Engineering Mutation-Tolerant Genes, Thesis of Genetics

This thesis explores the theoretical demonstration of engineering mutation-tolerant genes using ideas from coding theory. The focus is on tolerating substitution mutations and developing an algebraic representation of nucleotide bases to convert nucleotide sequences into algebraic sequences. The work assumes the existence of synthetic protein-assembling entities that function like tRNAs but can read k nucleotides at a time, with k ≥ 5. The realization of this assumption is presented as a challenge to the research community.

Typology: Thesis

2018/2019

Uploaded on 05/11/2023

eknath
eknath 🇺🇸

4.7

(29)

266 documents

1 / 63

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Engineering Mutation-Tolerant Genes
by
Prince Kwame Ampofo
A Thesis Presented in Partial Fulfillment
of the Requirements for the Degree
Master of Science
Approved April 2019 by the
Graduate Supervisory Committee:
Xiaojun Tian, Chair
Samira Kiani
Yang Kuang
ARIZONA STATE UNIVERSITY
May 2019
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f

Partial preview of the text

Download Engineering Mutation-Tolerant Genes and more Thesis Genetics in PDF only on Docsity!

Engineering Mutation-Tolerant Genes by Prince Kwame Ampofo

A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science

Approved April 2019 by the Graduate Supervisory Committee: Xiaojun Tian, Chair Samira Kiani Yang Kuang

ARIZONA STATE UNIVERSITY

May 2019

ABSTRACT

Ideas from coding theory are employed to theoretically demonstrate the engineer- ing of mutation-tolerant genes, genes that can sustain up to some arbitrarily chosen number of mutations and still express the originally intended protein. Attention is re- stricted to tolerating substitution mutations. Future advances in genomic engineering will make possible the ability to synthesize entire genomes from scratch. This presents an opportunity to embed desirable capabilities like mutation-tolerance, which will be useful in preventing cell deaths in organisms intended for research or industrial ap- plications in highly mutagenic environments. In the extreme case, mutation-tolerant genes (mutols) can make organisms resistant to retroviral infections. An algebraic representation of the nucleotide bases is developed. This algebraic representation makes it possible to convert nucleotide sequences into algebraic se- quences, apply mathematical ideas and convert results back into nucleotide terms. Using the algebra developed, a mapping is found from the naturally-occurring codons to an alternative set of codons which makes genes constructed from them mutation- tolerant, provided no more than one substitution mutation occurs per codon. The ideas discussed naturally extend to finding codons that can tolerate t arbitrarily cho- sen number of mutations per codon. Finally, random substitution events are simulated in both a wild-type green fluorescent protein (GFP) gene and its mutol variant and the amino acid sequence expressed from each post-mutation is compared with the amino acid sequence pre-mutation. This work assumes the existence of synthetic protein-assembling entities that func- tion like tRNAs but can read k nucleotides at a time, with k ≥ 5. The realization of this assumption is presented as a challenge to the research community.

i

ACKNOWLEDGMENTS

I would like to express my gratitude to Dr. Xiaojun Tian, Dr. Samira Kiani and Dr. Yang Kuang for serving on my committee and for their constructive feedback on my work. Special thanks also to the Mastercard Foundation for the scholarship that made my education at Arizona State University, which culminated in this thesis, possible.

iii

TABLE OF CONTENTS

Page LIST OF TABLES......................................................... v

vi

Chapter 1

ALGEBRAIC REPRESENTATION OF NUCLEOTIDE BASES

1.1 Introduction In our effort to develop genes that can tolerate mutations, we will be employing some mathematical ideas. To make this possible, we need to represent nucleotide bases, the building blocks from which DNA and RNA are constructed, with suitable mathematical entities. This way, we can easily apply mathematical ideas to algebraic representations of codons or nucleotide sequences and convert results back into nu- cleotide terms. This chapter is devoted to finding suitable algebraic representations for the bases: adenine (A), guanine (G), cytosine (C), thymine (T), and uracil (U). In section 1.2, an interesting algebraic structure called a field is introduced. In section 1.3, we construct a field whose elements will be used as algebraic represen- tations of the bases. In section 1.4, we assign elements of the field constructed in section 1.3 to the bases. Finally, in section 1.5, we discuss how computation is done in a field. Most concepts utilized throughout this work are from the areas of linear algebra and coding theory. The uninitiated will find the following references helpful: Axler (2007), Ling (2004), and Lin and Costello (1983).

1.2 Field

A field F is a set together with two binary operations addition (denoted by +) and multiplication (denoted by ∗) defined such that the following axioms are satisfied:

  1. Closure. ∀ x, y ∈ F, x + y ∈ F and x ∗ y ∈ F 1

1.3 Constructing Field of order 4 In general, if p is prime and addition and multiplication are done in modulo-p, the set { 0 , 1 , 2 ,... , p − 1 } satisfies all the field axioms outlined in Section 1.2 above and thus forms a field of p elements. The number of elements in a field is called its order. When the order of a field is finite, we call it a Galois field and denote it by GF(number of elements). Also, when the number of elements in a Galois field is prime, we call the field a prime field. We desire to construct a field of 4 elements: GF(4). However, 4 is not prime and so { 0 , 1 , 2 ,... , 4 − 1 } = { 0 , 1 , 2 , 3 } is not a field. We can however employ a well-known idea from field theory which allows the construction of any field GF(pq) from GF(p), where p is prime and q is any positive integer. Thus, we can construct our desired field GF(2^2 ) from GF(2) = { 0 , 1 }, which is the field formed from { 0 , 1 , 2 ,... , p − 1 } when p = 2. When GF(2) is extended to get GF(2^2 ), GF(2) is called the base field and GF(2^2 ) is called an extension field. To utilize this strategy, we first introduce two concepts:

  1. Polynomials over fields. If we construct an n-degree polynomial f (α) = f 0 + f 1 α+f 2 α^2 +· · ·+fn− 1 αn−^1 +fn αn^ such that its coefficients f 0 , f 1 , f 2 ,... , fn− 1 , fn are all taken from a field called F, we say that f (α) is a “polynomial over field F ”. As an example, polynomials over GF(2) are polynomials with all coefficients either 0 or 1. In general, there are 2n^ polynomials of degree n that can be constructed over GF(2). To see this, notice that for any polynomial of degree n over GF(2), its coefficients f 0 , f 1 , f 2 ,... , fn− 1 can each have two possible values (0 or 1) whereas coefficient fn must be 1 since if it is 0, the polynomial will no longer be of degree n. Thus, the total number of n-degree polynomials over GF(2) is 2︸ ∗ 2 ∗ · · · ∗︷︷ 2 ︸ n times

∗ 1 = 2 n. For example, there are 2^2 = 4 polynomials

3

over GF(2) that are of degree 2. They are: α^2 , α + α^2 , 1 + α^2 , and 1 + α + α^2 (1.1)

  1. Primitive polynomials. A q-degree polynomial f (α) over a prime field GF(p) is called primitive if the smallest positive integer n for which f (α) divides αn^ + 1 without a remainder is n = pq^ −1. Primitive polynomials are important because they are used in constructing extension fields from base fields. It is difficult to test all polynomials over a given field to find ones that are primitive. As such, tables of primitive polynomials exist from which desired primitive polynomials can be obtained. Using one such table, we find that of the four polynomials of degree 2 over GF(2) (listed in 1.1 ), only one is primitive: p(α) = 1 + α + α^2 (1.2) With these two concepts introduced, we state without proof a theorem with which we can construct GF(2^2 ) from GF(2).

Theorem 1.1. If f (α) is a q-degree primitive polynomial over GF(p), the set of all q-degree polynomials over GF(p) taken as modulo f (α) forms GF(pq).

Remark. Just as an integer like 9 can be expressed in modulo 3 by dividing 9 by 3 and keeping the remainder, a polynomial f 1 (α) can be expressed as modulo of another polynomial f 2 (α) by dividing f 1 (α) by f 2 (α) and keeping the remainder.

Using Theorem 1.1 above, we will divide each of the 4 polynomials in Equa- tion (1.1) by the primitive polynomial in Equation (1.2), keeping the remainder in each case. The set of these four remainders, under modulo 2 addition and multipli- cation, will satisfy all the field axioms in Section 1.2 and thus form the desired field

1 + α + α^2 = 0 1 + α + (α^2 + α^2 ) = 0 + α^2 (added α^2 to both sides) α^2 + α^2 = (1 + 1)α^2 = (0)α^2 = 0 (1 + 1 = 0 due to modulo 2 addition) ∴ 1 + α = α^2 (1.4) According to (1.4) above, the element α + 1 in GF(2^2 ) is equal to α^2. We can thus write GF(2^2 ) = { 0 , 1 , α, α + 1} as { 0 , 1 , α, α^2 }. Because α^2 is more compact, we will use it more often. Being in exponent form, it will also be used primarily in multiplication operations, whereas the α + 1 form will be used mainly in addition operations.

1.4 Assignment of Field elements to nucleotide Bases Our goal at the beginning of this chapter was to find suitable algebraic entities to represent the nucleotide bases. We constructed a field of four elements, GF(2^2 ), in the preceding section. Here, we assign these elements to the bases. As outlined in Table 1.1 below, we will represent Adenine with 0, Guanine with α, and Cytosine with α^2. Because the RNA “equivalent” of Thymine is Uracil, they are both assigned 1. These assignments are not arbitrary. They have been done such that adding 1 to the algebraic representation of any base gives the algebraic representation of its Watson-Crick base pair. Examples^2 :

  1. 0 (Adenine) + 1 = 1 (Thymine)
  2. 1 (Thymine) + 1 = 0 (Adenine)
  3. α (Guanine) + 1 = α^2 (Cytosine)
  4. α^2 (Cytosine) + 1 = (α + 1) + 1 = α (Guanine) (^2) All operations are done in modulo 2. Also, note that α (^2) = α + 1 (Equation 1.4) 6

Base Algebraic Representation Adenine (A) 0 Guanine (G) α Cytosine (C) α^2 Thymine (T) & Uracil (U) 1

Table 1.1: Algebraic representation of Nucleotide Bases. Note that α^2 = α + 1

We will refer to the set of elements used to represent the bases as B (as in Bases), where the subscripts used are simply to indicate which base the element represents and may be left out for the sake of compactness:

B = { (^0) A, (^1) T or U , αG, α^2 C } (1.5) As stated earlier, notice that 1 is used to represent both Thymine (T) and Uracil (U) since each can be thought of as the DNA or RNA “equivalent” of the other. As such, when 1 appears in the algebraic representation of a nucleotide sequence, it should be seen as representing Thymine or Uracil based on the context. Obviously, if 1 appears in a DNA sequence, it represents Thymine. In an RNA sequence, it represents Uracil.

1.5 Doing Algebra with the Base Representations Our goal for this chapter was to find suitable algebraic representations for the nu- cleotide bases so that we can write any nucleotide sequence as a sequence of algebraic entities, apply mathematical ideas to it, and map results back into nucleotide terms, with the aim of developing mutation-tolerant genes. Having found such suitable al- gebraic entities (elements of set B), we now discuss how addition, multiplication, subtraction, and division can be done with them.

7

  • 0 1 α α^2 0 0 1 α α^2 1 1 0 α^2 α α α α^2 0 α^2 α^2 α 1 0 (a) Addition Table

∗ 0 1 α α^2 0 0 0 0 0 1 0 1 α α^2 α 0 α α^2 α^2 0 α^2 1 α (b) Multiplication Table

Table 1.2: Arithmetic table for GF(2^2 ). Note that + and ∗ are done mod 2.

  • Subtraction. Note that a − b = a + (−b), where −b is the additive inverse of b. Thus, to subtract b from a, we first find the additive inverse of b (which is denoted by −b) and add it to a. Note that an additive inverse of an element b is the element which when added to b, gives 0. Looking at the addition table in Table 1.2a, we notice that adding any element to itself gives 0. This means that for set B, the additive inverse of any element is itself. This has two interesting implications: 1) Subtracting an element from another is the same as adding them. For example, a − b = a + (−b) = a + b since −b, which denotes the inverse of b, equals b itself. 2) Because of implication 1, the addition table shown in Table 1.2a also acts as a subtraction table for the elements in B.
  • Division. Note that ab = a ∗ b−^1 , where b−^1 is the multiplicative inverse of b. Thus, to divide a by b, we first find the multiplicative inverse of b then multiply it by a. Remember that the multiplicative inverse of an element b is an element which when multiplied by b, gives 1. For example, to find (^) α^12 , we first find the multiplicative inverse of α^2 by looking at the multiplicative table (Table 1.2b) to find the element which when multiplied by α^2 , gives 1. Since α^2 ∗ α = 1, α is the multiplicative inverse of α^2. Thus, (^) α^12 = 1 ∗ (α^2 )−^1 = 1 ∗ α = α.

9

1.6 Algebraic representation of Codons Having found suitable algebraic representation for the nucleotide bases, we can now write each of the 64 DNA sense (5′^ → 3 ′) codons in terms of the algebraic representation of its nucleotide bases.

(α α^2 1) (^) GCT (α α^2 0) (^) GCA (α α^2 α^2 ) (^) GCC (α α^2 α) (^) GCG

Ala (0 0 1) (0 0 α (^2) )AAT AAC

 Asn^

(α 0 1) (^) GAT (α 0 α^2 ) (^) GAC

 Asp^

(1 α 1) (^) T GT (1 α α^2 ) (^) T GC

 Cys

(α 0 0) (^) GAA (α 0 α) (^) GAG

 Glu^

(α^2 0 0) (^) CAA (α^2 0 α) (^) CAG

 Gln

(α α 1) (^) GGT (α α 0) (^) GGA (α α α^2 ) (^) GGC (α α α) (^) GGG

Gly ( (αα^22 0 1) 0 α (^2) )CAT CAC

 His

(0 1 1) AAT

(0 1 α^2 ) (^) AT C (0 1 0) (^) AT A

^ Ile

(1 1 0) T T A

(1 1 α) (^) T T G (α^2 1 1) (^) CT T (α^2 1 0) (^) CT A (α^2 1 α^2 ) (^) CT C (α^2 1 α) (^) CT G

Leu (0 0 0) (0 0 α)AAA AAG

 Lys^ (0 1^ α)^ AT G^ Met

(α^2 α^2 1) (^) CCT (α^2 α^2 0) (^) CCA (α^2 α^2 α^2 ) (^) CCC (α^2 α^2 α) (^) CCG

Pro (1 1 1) (1 1 α (^2) )T T T T T C

 Phe

(0 α^2 1) (^) ACT (0 α^2 α^2 ) (^) ACC (0 α^2 0) (^) ACA (0 α^2 α) (^) ACG

Thr (1 0 1) (1 0 α (^2) )T AT T AC

 Tyr

(α 1 1) (^) GT T (α 1 0) (^) GT A (α 1 α^2 ) (^) GT C (α 1 α) (^) GT G

Val

(1 α^2 1) (^) T CT (1 α^2 0) (^) T CA (1 α^2 α^2 ) (^) T CC (1 α^2 α) (^) T CG (0 α 1) (^) AGT (0 α α^2 ) (^) AGC

Ser (1 α α) (^) T GG Trp

(α^2 α 1) (^) CGT (α^2 α α^2 ) (^) CGC (α^2 α 0) (^) CGA (α^2 α α) (^) CGG (0 α 0) (^) AGA (0 α α) (^) AGG

Arg

(1 0 0) T AA

(1 0 α) (^) T AG (1 α 0) (^) T GA

^ Stop

C =

           

(0 0 0), (0 0 1), (0 0 α), (0 0 α^2 ), (0 1 0), (0 1 1), (0 1 α), (0 1 α^2 ), (0 α 0), (0 α 1), (0 α α), (0 α α^2 ), (0 α^2 0), (0 α^2 1), (0 α^2 α), (0 α^2 α^2 ), (1 0 0), (1 0 1), (1 0 α), (1 0 α^2 ), (1 1 0), (1 1 1), (1 1 α), (1 1 α^2 ), (1 α 0), (1 α 1), (1 α α), (1 α α^2 ), (1 α^2 0), (1 α^2 1), (1 α^2 α), (1 α^2 α^2 ), (α 0 0), (α 0 1), (α 0 α), (α 0 α^2 ), (α 1 0), (α 1 1), (α 1 α), (α 1 α^2 ), (α α 0), (α α 1), (α α α), (α α α^2 ), (α α^2 0), (α α^2 1), (α α^2 α), (α α^2 α^2 ), (α^2 0 0), (α^2 0 1), (α^2 0 α), (α^2 0 α^2 ), (α^2 1 0), (α^2 1 1), (α^2 1 α), (α^2 1 α^2 ), (α^2 α 0), (α^2 α 1), (α^2 α α), (α^2 α α^2 ), (α^2 α^2 0), (α^2 α^2 1), (α^2 α^2 α), (α^2 α^2 α^2 )

           

Definition 1. (Vector Space) A vector space V is a set of objects (called vectors) that can be added (+) together and multiplied (·) by scalars such that the following axioms are satisfied for any choice of vectors u, v, and w and scalars a and b :

  1. Closure. u + w ∈ V and a · v ∈ V
  2. Commutativity. u + w = w + u
  3. Additive identity. There exists a zero vector, denoted 0 , in C such that u + 0 = u
  4. Multiplicative identity. 1 · v = v
  5. Additive inverses. For any vector u ∈ V, there exists −u ∈ V such that u + (−u) = 0
  6. Associativity. u + (v + w) = (u + v) + w and a · (b · v) = (a · b) · v
  7. Distributivity. a · (u + v) = (a · u) + (a · v) and (a + b) · u = (a · u) + (b · u)

What the definition says is that if we form a set and define a way of adding the elements of the set and also define multiplication of the elements by some scalars such that the addition and multiplication operations obey all the 7 axioms listed, then we have a vector space. Thus, to demonstrate that the set of p-codons C (with B = { 0 , 1 , α, α^2 } as scalars) is a vector space, we need to do two things 12

as per the definition: 1) define two operations: the addition of two p-codons and multiplication of a p-codon by a scalar and 2) show that the 7 axioms are satisfied under those operations. We do these in sections 2.2.1 through 2.2.3 below. Addition of p-codons and multiplication of p-codons by scalars are defined in terms of tuples of length n so as to generalize to our later needs. In the context of C, we simply set n = 3.

2.2.1 Addition of p-codons. We define the addition of p-codons a = (a 1 , a 2 , ..., an) and b = (b 1 , b 2 , ..., bn) as a + b = (a 1 + b 1 , a 2 + b 2 , ..., an + bn) where the + in ai + bi for i = 1, 2 , ..., n is modulo 2 addition as given in addition table 1.2a.

2.2.2 Multiplication of p-codons by scalars. We define the multiplication of a p-codon a = (a 1 , a 2 , ..., an) by a scalar λ in B as λ · a = (λa 1 , λa 2 , ..., λan). We will sometimes write λ · a simply as λa.

2.2.3 Proof that axioms 1 through 7 are satisfied. Having defined addition of p-codons and scalar multiplication, we are now in a position to show that set C satisfies axioms 1 through 7 and is thus a vector space. For the proofs below, it is important to keep two things in mind. First, the set of scalars B = { 0 , 1 , α, α^2 } is a field and so addition or multiplication of any members of the set always results in another member of the set (see Table 1.2). Second, C is the set of all 3-tuples whose 3 elements are chosen from B and so to show that some arbitrary 3-tuple (λ 1 λ 2 λ 3 ) is a member of C, we only need to show that λ 1 , λ 2 , and λ 3 are elements of B.

  1. Closure. 13