























































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
This thesis explores the theoretical demonstration of engineering mutation-tolerant genes using ideas from coding theory. The focus is on tolerating substitution mutations and developing an algebraic representation of nucleotide bases to convert nucleotide sequences into algebraic sequences. The work assumes the existence of synthetic protein-assembling entities that function like tRNAs but can read k nucleotides at a time, with k ≥ 5. The realization of this assumption is presented as a challenge to the research community.
Typology: Thesis
1 / 63
This page cannot be seen from the preview
Don't miss anything!
























































Engineering Mutation-Tolerant Genes by Prince Kwame Ampofo
A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Master of Science
Approved April 2019 by the Graduate Supervisory Committee: Xiaojun Tian, Chair Samira Kiani Yang Kuang
May 2019
Ideas from coding theory are employed to theoretically demonstrate the engineer- ing of mutation-tolerant genes, genes that can sustain up to some arbitrarily chosen number of mutations and still express the originally intended protein. Attention is re- stricted to tolerating substitution mutations. Future advances in genomic engineering will make possible the ability to synthesize entire genomes from scratch. This presents an opportunity to embed desirable capabilities like mutation-tolerance, which will be useful in preventing cell deaths in organisms intended for research or industrial ap- plications in highly mutagenic environments. In the extreme case, mutation-tolerant genes (mutols) can make organisms resistant to retroviral infections. An algebraic representation of the nucleotide bases is developed. This algebraic representation makes it possible to convert nucleotide sequences into algebraic se- quences, apply mathematical ideas and convert results back into nucleotide terms. Using the algebra developed, a mapping is found from the naturally-occurring codons to an alternative set of codons which makes genes constructed from them mutation- tolerant, provided no more than one substitution mutation occurs per codon. The ideas discussed naturally extend to finding codons that can tolerate t arbitrarily cho- sen number of mutations per codon. Finally, random substitution events are simulated in both a wild-type green fluorescent protein (GFP) gene and its mutol variant and the amino acid sequence expressed from each post-mutation is compared with the amino acid sequence pre-mutation. This work assumes the existence of synthetic protein-assembling entities that func- tion like tRNAs but can read k nucleotides at a time, with k ≥ 5. The realization of this assumption is presented as a challenge to the research community.
i
I would like to express my gratitude to Dr. Xiaojun Tian, Dr. Samira Kiani and Dr. Yang Kuang for serving on my committee and for their constructive feedback on my work. Special thanks also to the Mastercard Foundation for the scholarship that made my education at Arizona State University, which culminated in this thesis, possible.
iii
Page LIST OF TABLES......................................................... v
vi
Chapter 1
ALGEBRAIC REPRESENTATION OF NUCLEOTIDE BASES
1.1 Introduction In our effort to develop genes that can tolerate mutations, we will be employing some mathematical ideas. To make this possible, we need to represent nucleotide bases, the building blocks from which DNA and RNA are constructed, with suitable mathematical entities. This way, we can easily apply mathematical ideas to algebraic representations of codons or nucleotide sequences and convert results back into nu- cleotide terms. This chapter is devoted to finding suitable algebraic representations for the bases: adenine (A), guanine (G), cytosine (C), thymine (T), and uracil (U). In section 1.2, an interesting algebraic structure called a field is introduced. In section 1.3, we construct a field whose elements will be used as algebraic represen- tations of the bases. In section 1.4, we assign elements of the field constructed in section 1.3 to the bases. Finally, in section 1.5, we discuss how computation is done in a field. Most concepts utilized throughout this work are from the areas of linear algebra and coding theory. The uninitiated will find the following references helpful: Axler (2007), Ling (2004), and Lin and Costello (1983).
1.2 Field
A field F is a set together with two binary operations addition (denoted by +) and multiplication (denoted by ∗) defined such that the following axioms are satisfied:
1.3 Constructing Field of order 4 In general, if p is prime and addition and multiplication are done in modulo-p, the set { 0 , 1 , 2 ,... , p − 1 } satisfies all the field axioms outlined in Section 1.2 above and thus forms a field of p elements. The number of elements in a field is called its order. When the order of a field is finite, we call it a Galois field and denote it by GF(number of elements). Also, when the number of elements in a Galois field is prime, we call the field a prime field. We desire to construct a field of 4 elements: GF(4). However, 4 is not prime and so { 0 , 1 , 2 ,... , 4 − 1 } = { 0 , 1 , 2 , 3 } is not a field. We can however employ a well-known idea from field theory which allows the construction of any field GF(pq) from GF(p), where p is prime and q is any positive integer. Thus, we can construct our desired field GF(2^2 ) from GF(2) = { 0 , 1 }, which is the field formed from { 0 , 1 , 2 ,... , p − 1 } when p = 2. When GF(2) is extended to get GF(2^2 ), GF(2) is called the base field and GF(2^2 ) is called an extension field. To utilize this strategy, we first introduce two concepts:
∗ 1 = 2 n. For example, there are 2^2 = 4 polynomials
3
over GF(2) that are of degree 2. They are: α^2 , α + α^2 , 1 + α^2 , and 1 + α + α^2 (1.1)
Theorem 1.1. If f (α) is a q-degree primitive polynomial over GF(p), the set of all q-degree polynomials over GF(p) taken as modulo f (α) forms GF(pq).
Remark. Just as an integer like 9 can be expressed in modulo 3 by dividing 9 by 3 and keeping the remainder, a polynomial f 1 (α) can be expressed as modulo of another polynomial f 2 (α) by dividing f 1 (α) by f 2 (α) and keeping the remainder.
Using Theorem 1.1 above, we will divide each of the 4 polynomials in Equa- tion (1.1) by the primitive polynomial in Equation (1.2), keeping the remainder in each case. The set of these four remainders, under modulo 2 addition and multipli- cation, will satisfy all the field axioms in Section 1.2 and thus form the desired field
1 + α + α^2 = 0 1 + α + (α^2 + α^2 ) = 0 + α^2 (added α^2 to both sides) α^2 + α^2 = (1 + 1)α^2 = (0)α^2 = 0 (1 + 1 = 0 due to modulo 2 addition) ∴ 1 + α = α^2 (1.4) According to (1.4) above, the element α + 1 in GF(2^2 ) is equal to α^2. We can thus write GF(2^2 ) = { 0 , 1 , α, α + 1} as { 0 , 1 , α, α^2 }. Because α^2 is more compact, we will use it more often. Being in exponent form, it will also be used primarily in multiplication operations, whereas the α + 1 form will be used mainly in addition operations.
1.4 Assignment of Field elements to nucleotide Bases Our goal at the beginning of this chapter was to find suitable algebraic entities to represent the nucleotide bases. We constructed a field of four elements, GF(2^2 ), in the preceding section. Here, we assign these elements to the bases. As outlined in Table 1.1 below, we will represent Adenine with 0, Guanine with α, and Cytosine with α^2. Because the RNA “equivalent” of Thymine is Uracil, they are both assigned 1. These assignments are not arbitrary. They have been done such that adding 1 to the algebraic representation of any base gives the algebraic representation of its Watson-Crick base pair. Examples^2 :
Base Algebraic Representation Adenine (A) 0 Guanine (G) α Cytosine (C) α^2 Thymine (T) & Uracil (U) 1
Table 1.1: Algebraic representation of Nucleotide Bases. Note that α^2 = α + 1
We will refer to the set of elements used to represent the bases as B (as in Bases), where the subscripts used are simply to indicate which base the element represents and may be left out for the sake of compactness:
B = { (^0) A, (^1) T or U , αG, α^2 C } (1.5) As stated earlier, notice that 1 is used to represent both Thymine (T) and Uracil (U) since each can be thought of as the DNA or RNA “equivalent” of the other. As such, when 1 appears in the algebraic representation of a nucleotide sequence, it should be seen as representing Thymine or Uracil based on the context. Obviously, if 1 appears in a DNA sequence, it represents Thymine. In an RNA sequence, it represents Uracil.
1.5 Doing Algebra with the Base Representations Our goal for this chapter was to find suitable algebraic representations for the nu- cleotide bases so that we can write any nucleotide sequence as a sequence of algebraic entities, apply mathematical ideas to it, and map results back into nucleotide terms, with the aim of developing mutation-tolerant genes. Having found such suitable al- gebraic entities (elements of set B), we now discuss how addition, multiplication, subtraction, and division can be done with them.
7
∗ 0 1 α α^2 0 0 0 0 0 1 0 1 α α^2 α 0 α α^2 α^2 0 α^2 1 α (b) Multiplication Table
Table 1.2: Arithmetic table for GF(2^2 ). Note that + and ∗ are done mod 2.
9
1.6 Algebraic representation of Codons Having found suitable algebraic representation for the nucleotide bases, we can now write each of the 64 DNA sense (5′^ → 3 ′) codons in terms of the algebraic representation of its nucleotide bases.
(α α^2 1) (^) GCT (α α^2 0) (^) GCA (α α^2 α^2 ) (^) GCC (α α^2 α) (^) GCG
Ala (0 0 1) (0 0 α (^2) )AAT AAC
Asn^
(α 0 1) (^) GAT (α 0 α^2 ) (^) GAC
Asp^
(1 α 1) (^) T GT (1 α α^2 ) (^) T GC
Cys
(α 0 0) (^) GAA (α 0 α) (^) GAG
Glu^
(α^2 0 0) (^) CAA (α^2 0 α) (^) CAG
Gln
(α α 1) (^) GGT (α α 0) (^) GGA (α α α^2 ) (^) GGC (α α α) (^) GGG
Gly ( (αα^22 0 1) 0 α (^2) )CAT CAC
His
(0 1 α^2 ) (^) AT C (0 1 0) (^) AT A
^ Ile
(1 1 α) (^) T T G (α^2 1 1) (^) CT T (α^2 1 0) (^) CT A (α^2 1 α^2 ) (^) CT C (α^2 1 α) (^) CT G
Leu (0 0 0) (0 0 α)AAA AAG
Lys^ (0 1^ α)^ AT G^ Met
(α^2 α^2 1) (^) CCT (α^2 α^2 0) (^) CCA (α^2 α^2 α^2 ) (^) CCC (α^2 α^2 α) (^) CCG
Pro (1 1 1) (1 1 α (^2) )T T T T T C
Phe
(0 α^2 1) (^) ACT (0 α^2 α^2 ) (^) ACC (0 α^2 0) (^) ACA (0 α^2 α) (^) ACG
Thr (1 0 1) (1 0 α (^2) )T AT T AC
Tyr
(α 1 1) (^) GT T (α 1 0) (^) GT A (α 1 α^2 ) (^) GT C (α 1 α) (^) GT G
Val
(1 α^2 1) (^) T CT (1 α^2 0) (^) T CA (1 α^2 α^2 ) (^) T CC (1 α^2 α) (^) T CG (0 α 1) (^) AGT (0 α α^2 ) (^) AGC
Ser (1 α α) (^) T GG Trp
(α^2 α 1) (^) CGT (α^2 α α^2 ) (^) CGC (α^2 α 0) (^) CGA (α^2 α α) (^) CGG (0 α 0) (^) AGA (0 α α) (^) AGG
Arg
(1 0 α) (^) T AG (1 α 0) (^) T GA
^ Stop
C =
(0 0 0), (0 0 1), (0 0 α), (0 0 α^2 ), (0 1 0), (0 1 1), (0 1 α), (0 1 α^2 ), (0 α 0), (0 α 1), (0 α α), (0 α α^2 ), (0 α^2 0), (0 α^2 1), (0 α^2 α), (0 α^2 α^2 ), (1 0 0), (1 0 1), (1 0 α), (1 0 α^2 ), (1 1 0), (1 1 1), (1 1 α), (1 1 α^2 ), (1 α 0), (1 α 1), (1 α α), (1 α α^2 ), (1 α^2 0), (1 α^2 1), (1 α^2 α), (1 α^2 α^2 ), (α 0 0), (α 0 1), (α 0 α), (α 0 α^2 ), (α 1 0), (α 1 1), (α 1 α), (α 1 α^2 ), (α α 0), (α α 1), (α α α), (α α α^2 ), (α α^2 0), (α α^2 1), (α α^2 α), (α α^2 α^2 ), (α^2 0 0), (α^2 0 1), (α^2 0 α), (α^2 0 α^2 ), (α^2 1 0), (α^2 1 1), (α^2 1 α), (α^2 1 α^2 ), (α^2 α 0), (α^2 α 1), (α^2 α α), (α^2 α α^2 ), (α^2 α^2 0), (α^2 α^2 1), (α^2 α^2 α), (α^2 α^2 α^2 )
Definition 1. (Vector Space) A vector space V is a set of objects (called vectors) that can be added (+) together and multiplied (·) by scalars such that the following axioms are satisfied for any choice of vectors u, v, and w and scalars a and b :
What the definition says is that if we form a set and define a way of adding the elements of the set and also define multiplication of the elements by some scalars such that the addition and multiplication operations obey all the 7 axioms listed, then we have a vector space. Thus, to demonstrate that the set of p-codons C (with B = { 0 , 1 , α, α^2 } as scalars) is a vector space, we need to do two things 12
as per the definition: 1) define two operations: the addition of two p-codons and multiplication of a p-codon by a scalar and 2) show that the 7 axioms are satisfied under those operations. We do these in sections 2.2.1 through 2.2.3 below. Addition of p-codons and multiplication of p-codons by scalars are defined in terms of tuples of length n so as to generalize to our later needs. In the context of C, we simply set n = 3.
2.2.1 Addition of p-codons. We define the addition of p-codons a = (a 1 , a 2 , ..., an) and b = (b 1 , b 2 , ..., bn) as a + b = (a 1 + b 1 , a 2 + b 2 , ..., an + bn) where the + in ai + bi for i = 1, 2 , ..., n is modulo 2 addition as given in addition table 1.2a.
2.2.2 Multiplication of p-codons by scalars. We define the multiplication of a p-codon a = (a 1 , a 2 , ..., an) by a scalar λ in B as λ · a = (λa 1 , λa 2 , ..., λan). We will sometimes write λ · a simply as λa.
2.2.3 Proof that axioms 1 through 7 are satisfied. Having defined addition of p-codons and scalar multiplication, we are now in a position to show that set C satisfies axioms 1 through 7 and is thus a vector space. For the proofs below, it is important to keep two things in mind. First, the set of scalars B = { 0 , 1 , α, α^2 } is a field and so addition or multiplication of any members of the set always results in another member of the set (see Table 1.2). Second, C is the set of all 3-tuples whose 3 elements are chosen from B and so to show that some arbitrary 3-tuple (λ 1 λ 2 λ 3 ) is a member of C, we only need to show that λ 1 , λ 2 , and λ 3 are elements of B.