.introducing dna sequence.v.0.1, Study notes of Biogenetics and Computers

DNA sequencing in Bio-Informatics

Typology: Study notes

2014/2015

Uploaded on 02/02/2015

Qazi.Masud
Qazi.Masud 🇧🇩

1 document

1 / 34

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Bio-Bio-1
Exploring Bioinformatics
DIU and Team Bio-Bio-1
January 24, 2015
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22

Partial preview of the text

Download .introducing dna sequence.v.0.1 and more Study notes Biogenetics and Computers in PDF only on Docsity!

Exploring Bioinformatics

DIU and Team Bio-Bio-

January 24, 2015

Contents

  • 2 Introducing DNA Sequence
    • 2.1 DNA Sequencing
    • 2.2 History of DNA Sequencing
    • 2.3 Methods of DNA Sequencing
    • 2.4 DNA Sequencing Process
    • 2.5 Automated DNA Sequencing
    • 2.6 Computer Analysis
    • 2.7 DNA Sequencing in Real Time
    • 2.8 Next Generation DNA Sequencing
    • 2.9 Sequencing Larger DNA Sequences
    • 2.10 Complete Genome Sequencing
    • 2.11 Shotgun Sequencing
    • 2.12 Challenges of DNA Sequencing
    • 2.13 Applications of DNA Sequencing
    • 2.14 DNA Sequencing: Where to Next
    • 2.15 Case Study: Human Genome Project (HGP)
    • 2.16 All Life depends on 3 critical molecules
    • 2.17 DNA Replication
    • 2.18 DNA Polymerase
    • 2.19 DNA Replication Process
    • 2.20 RNA
      • 2.20.1 RNA vs DNA
      • 2.20.2 Major RNA Types
    • 2.21 Sequence Formats
      • 2.21.1 GenBank
      • 2.21.2 Fasta
    • 2.22 Some Important Concepts
      • 2.22.1 Prokaryotes vs Eukaryotes
      • 2.22.2 Genome
      • 2.22.3 Gene
      • 2.22.4 Gene Coding Region
      • 2.22.5 Exon
      • 2.22.6 Intron
      • 2.22.7 Proteins
  • 2.22.8 Structure of Protein Coding Gene ii
  • 2.22.9 TATA Box
  • 2.22.10 GC Box
  • 2.22.11 CAAT Box
  • 2.22.12 Reverse Complement
  • 2.22.13 Reverse Palindrome
  • 2.22.14 Mutation
  • 2.22.15 Single Nucleotide Polymorphism (SNP)
  • 2.22.16 Hamming Distance
  • 2.22.17 Genome Rearrangement
  • 2.1 DNA to DNA Sequence List of Figures
  • 2.2 DNA Sequencing Process
  • 2.3 Automated DNA Sequence Readouts
  • 2.4 Computer Analysis of DNA Sequence
  • 2.5 Modern Sequencer Machine
  • 2.6 Sequencing Larger DNA Sequences
    • Sequencing 2.7 Whole Genome Shotgun Sequencing and Hierarchical Shotgun
  • 2.8 Tiling Path
  • 2.9 DNA Replication
  • 2.10 DNA Replication Process
  • 2.11 GenBank Sequence Format
  • 2.12 Fasta Sequence Format
  • 2.13 Gene Structure
  • 2.14 Eukaryotic Gene Structure
  • 2.15 Eukaryotic Gene Structure (Zoom-In)
  • 2.16 Gene Coding Region, Exon and Intron
  • 2.17 Exon and Intron
  • 2.18 Structure of Protein Coding Gene
  • 2.19 Promoter Elements
  • 2.20 Reverse Complement
  • 2.21 Five Types of Chromosomal Mutations
    • location 2.22 DNA molecule 1 differs from DNA molecule 2 at a single base-pair
  • 2.23 Hamming Distance

iv

Chapter 2

Introducing DNA Sequence

—Fokhruzzaman

—Saddam Hossain

—Nazmun Nessa Moon

2.1 DNA Sequencing

Single mother of all humans is Eve and Single father of all humans is Adam

  • how we could be able explore this hypothesis if we don’t know the code of human being that is encoded in the DNA and suffers from evolutionary changes in the form of mutations. DNA is the representative of a species. Now a days it has become very routine job for any bio-molecular lab to try to read some DNA to find out its code, which is called the DNA Sequence.

Definition 1. DNA Sequence- DNA Sequence is the ordered arrangement of nucleotides that have made up that DNA. The length of a DNA sequence is measured in terms of number of nucleotides that have built it up, the popular unit to measure this is called base-pair (bp).

Definition 2. DNA Sequencing- DNA Sequencing is the process of determin- ing the complete ordered sequence of nucleotides (A, T, C, G’s) of a complete or partial DNA of an organism.

In the figure - 2.1 DNA and DNA Sequence have been illustrated. This sequence has been extracted from the DNA as an output of DNA Sequencing.

  • DNA sequencing is the biochemical process of determining the exact order of the chemical building blocks (called bases and abbreviated A, T, C, and G) in a DNA oligonucleotide
  • DNA serves as the blue-print of life
  • Determining DNA sequence is basic to understand biological processes

Figure 2.1: DNA to DNA Sequence

  • Automated DNA sequencing is a core research tool used by almost every research biochemistry lab
  • Modern DNA sequencing technology has accelerated the biological re- search tremendously

2.2 History of DNA Sequencing

Prior to mid-1970’s there was no direct method to sequence Eukaryotic DNA. During that period, all scientists did this based on their knowledge and experi- ence, and rendering some Reverse Genetics, in which the amino acid sequence of the gene product of interest is back-translated into a nucleotide sequence based upon the appropriate codons. Given the degeneracy of the genetic code, this process can be tricky at best but not the accurate or direct one. Around mid-1970’s, scientists invented some direct methods for DNA sequencing and the revolution started. Though DNAs of many viruses and prokaryotes have been sequenced in the early age, the first eukaryotic free-living organism to be DNA-Sequenced is claimed by Fred Blattner in 1997, and the organism is E.coli. But there is a debate, it is also claimed that Hemophilus inuenzae has been sequenced first. With the speeding automated sequencing mechanisms and computational models, Human Genome Project has made the ever-best land- mark in the field of bioinformatics by sequencing the complete human genome in 2003. And about 1000 more eukaryotic genomes are currently in production.

2.4 DNA Sequencing Process

DNA sequencing is a very long process. It encompasses the following steps of task to result out a sequenced DNA fragment. The following steps are main- tained in several bioinformatics industry for DNA sequencing process.

Extraction of Genomic DNA and Sample Preparation The first step is to extract the high quality DNA from the organism to sequence. Different kits and protocols are available to extract clean and efficient genomic DNA from the respective organism. Starting with whole genome DNA or targeted gene fragments, the initial step in the process is a universal library preparation for any sample.

Genome Fragmentation and DNA Amplification The genome map is the must pre-requisite for DNA sequencing task. From the genome map the span region of genome to be sequenced is identified first. The DNA from the genome is then chopped into bits, as whole chromosomes are too large to deal with, so the DNA is broken into manageably-sized overlapping segments. Identity set of clones from this region is selected, these are the mapped clones. In order to sequence a piece of DNA, first need to amplify it. This is sometimes done by a process called polymerase chain reaction (PCR). An alternative to PCR for DNA amplification is to clone the DNA piece, inserting the piece of DNA into the DNA of a bacterium. Replicating the bacterium thus replicates the DNA.

Library Creation The amplified DNAs, generated by either PCR or cloning, are gathered in a library. This pool of clones act as a clone library for further sequencing work. The DNA fragments are already denatured (ie. melted), so that the two strands split apart.

Template Preparation In this stage DNA is purified from smaller clones. The denatured DNA is added to reaction using the wet-lab set up for sequencing chemistries (using any of above discussed methods).

Gel Electrophoresis The fragments are separated out on a gel by gel-electrophoresis and their lengths are calculated. Now working out the DNA sequence is just like a jigsaw puzzle.

Sequence Assembly Using Genome Mapping As quoted previously that the genome map is pre-requisite to DNA sequencing, it is really needed in this step again. During the previous steps, originally sequencing has been performed by cutting the chromosomes into large pieces which have been cloned into bac- teria, creating a whole library of DNA segments. The segments have been cut open to look for common sequence landmarks in overlapping fragments. These have been used to fingerprint the fragments, so that it is known where in the chromosome the fragment is- this is called mapping. The fragments have been

cut into smaller pieces and the process repeated and the small fragments have been sequenced. Finally the whole sequence is built up by assembling the frag- ments using the corresponding genome maps and fingerprints.

If shotgun sequencing has been applied, it dispenses with the need for map- ping and so is much faster. It involves chopping the DNA into fragments of size of 2, 000 base pairs (bps) and 10, 000 bps. It sequences the first and last 500 bps of each fragment. It then uses computer algorithms to assemble the entire sequence from the sequenced fragments to get the initial DNA fragment sequenced. Shotgun sequencing is much faster- it takes a matter of months to obtain a draft sequence of the fruit fly, Drosophila Melanogaster (135M bps), while the conventional sequencing effort may take several years to achieve a similar level of completion. But assembly of pieces for large genome like the human (3 × 109 bps), requires very powerful computers And repetitive DNA, which is common in eukaryotic genomes, causes great difficulties in the assembly process and may get it wrong.

Pre-finishing Some special techniques of sequencing are used to produce high quality sequences. This is very crucial step as because cleaning up the DNA se- quence and generating less error prone DNA sequence are done here. Genomes must be sequenced several times over on average, both to ensure complete cov- erage of the genome is achieved, and because sequencing data is somewhat error-prone.

Finishing This is the stage where the final product of sequenced DNA is achieved, this is the final in-laboratory quality pass step. Quality of sequence data may vary, depending on purity and concentration of template DNA, pres- ence of extra PCR bands, quality of dye-terminators, electrophoresis matrix, and other reagents. Ensuring the required level of quality, this sequenced DNA is now ready to publish as DNA sequence data for further uses.

Data Editing and DNA Annotation To make the sequenced DNA avail- able to the next bioinformatics researches, it is need to store in a library or genome bank. Before the submission to public databases, some steps of quality assurance, verification and biological annotation are needed.

2.5 Automated DNA Sequencing

In the automated DNA sequencing, we don’t even have to ”read” the sequence from the gel - the computer does that for us! A computer read-out of the gel generates a ”false color” image where each color corresponds to a base. Then the intensities are translated into peaks that represent the sequence. This is a plot of the colors detected in one ”lane” of a gel (one sample), scanned from smallest fragments to largest. The computer even interprets the colors by printing the

Figure 2.4: Computer Analysis of DNA Sequence

  • The peaks reveals the sequence of bases in the original DNA sample

2.7 DNA Sequencing in Real Time

Think, what if, it were possible that you have come to a doctor and the doctor has prescribed you to do some diagonostic tests for your blood along with an- other test for finding DNA sequence of a specific region of your genome. The doctor may need the sequence information for that region of genome to infer a conclusion about the presence or absence of a DNA pattern in that region. There would come some day, when sequencing a fragment of DNA would be almost a routine work in the clinical practices. To enable this kind of situation the primary need will be high speed, ultra-fast DNA sequencing mechanisms. Several bioinformatics research institutes, academics, and industries are work- ing to find a solution to this end. This future techniques of DNA sequencing must differ substantially from the current and probable next-generation meth- ods of sequencing with very high throughput. Such one technique for sequencing which may open the future door, is the nanopore sequencing approach. In this approach the nucleic acids are driven through a nanopore (either a biological membrane protein such as alpha-hemolysin or a synthetic pore). Fluctuations in DNA conductance through the pore, or, potentially, the detection of interac- tions of individual bases with the pore, are used to infer the nucleotide sequence. Although progress has been made in achieving early proof-of-concept demon- strations with such methods major technical challenges remain along the path to a truly practical nanopore-based sequencing platform.

Figure 2.5: Modern Sequencer Machine

2.8 Next Generation DNA Sequencing

The expected and proposed next generation DNA sequencing must have the property of sequencing thousands of individual mini-sequencing reaction on a single plate. So that millions of base pairs can be sequenced on a single run. Sequence capture arrays will become available to focus sequencing on genes of interest. This will help in comprehensive analysis of gene in full length. Obviously this would be expensive in terms of instruments and reagents.

2.9 Sequencing Larger DNA Sequences

Overlapping sequence data from many clones is analyzed by powerful computers, which regenerate the full-length sequence by piecing the short sequences together like a puzzle. Entire genomes can be sequenced in this manner

2.10 Complete Genome Sequencing

The complete genome sequencing refers to sequencing the DNA of an organism end-to-end. The whole genome sequencing strategies are based on Genome Mapping (i.e: Physical Mapping) , Primer Walking, Shotgun Sequencing etc. Genome map is the primary tool to start or plan for a complete DNA sequencing

Figure 2.7: Whole Genome Shotgun Sequencing and Hierarchical Shotgun Se- quencing

  • Assembles a linear map from sub-clone sequences without knowing their order on the chromosome
  • Contigs are assembled based on alignments of all possible sequence pairs in the computer
  • Up to ten levels of redundancy used to make it correct
  • First used for H. influenzae. Now routinely used for microbial genomes and genome fragments
  • Can shotgun sequencing used for genomes with repetitive sequences?

· Celara Genomics used it for Drosophila removing repetitive regions · Used some other methods to avoid the problem

2.12 Challenges of DNA Sequencing

The main challenge of DNA sequencing is that there is no machine that takes long DNA as an input, and gives the complete sequence as output The available methods can only sequence around 500 letters (base-pairs) at a time. Increase of sensitivity of current instruments (in terms of sequence length) is essential. In the chemistry lab, There is a need for additional Fluor combinations to enable reaction multiplexing, which can save time and money. Lowering the cost of sequencing in another challange ahead of us, along with increasing the throughput. In the history, most cost decreases have been incremental, rather than monumental. But, in this case there is needed large cost decreases, which may require some revolutionary approaches on this. Making the application and related instruments available one of the concerns in DNA Sequencing. An statistics tells that current set-ups (laboratory standards) for DNA sequencing (i.e.: 3100 Genetic Analyzer) on an average can sequence around 100 samples in one day. 16 − 20 samples make up one run, 6 − 10 runs in a plate, and 2 plates at once. So the average capability for daily sequencing is about 200 samples. This is very low for sequencing throughput to support the current and up-coming demand for sequenced DNA. A well maintained machine is also vital to a successful sequence. The final challenge of DNA sequencing is in the analysis of the data.

2.13 Applications of DNA Sequencing

There are about 100 million species. And each individual has different DNA. Even within individual, some cells have different DNA (i.e. cancer). How many sequences are there? Really they are needed to be sequenced to study before having the study of the population from any direction. If we want to explore what genes are on when and in which cell, we need to know the sequence first. Where do molecules bind to DNA? - this study also needs sequenced DNA. The

Opportunities for discovery are virtually endless, from complex diseases to paleogenomics and museomics (analysis of ancient DNA), from searching for new organisms in the deep ocean and volcanoes to manipulating valuable traits in livestock and molecular plant breeding. This is where the challenges as well as major opportunities lie in the future.

2.15 Case Study: Human Genome Project (HGP)

The Human Genome Project, a large, federally funded collaborative project completed by 2003. This was a project of USD 3 billion to sequence human genome of 3 billion nucleotides. The project was developed from an idea dis- cussed at scientific meetings in 1984 and 1985, and a pilot project, the Human Genome Initiative, was begun by the Department of Energy (DOE) in 1986. National Institutes of Health funding of the project began in 1987 under the Office of Genome Research. Then the project is constituted as the National Human Genome Research Initiative. In 1988, a new commercial venture un- der the leadership of Craig Venter was formed to sequence the majority of the human genomes and intensive computer processing of data, has already com- pleted the Drosophila sequence and mouse genome. Both groups simultaneously announced completion of the sequencing of the human genome on 2003. Offi- cially it took the time line of 1990 to 2003. Largest ever individual project in the research of bioinformatics. Though first time human genome had been sequenced with a cost of USD 3 billion and labor of around 13 years, the cost came down to USD 20-30 million during 2005 with a time line of 6 months to sequence. Already the cost has come down to about USD 50000-100000. But still this is a very high price to sequence a human genome. The blueprint of life is contained in the DNA in the nuclei of eukaryotic cells and simply within prokaryotic cells. But the Human genome project has just obtained the list of approximately 3 × 109 bases (As, Cs, Gs and Ts) in the 23 chromosomes. Extraction of useful information from this list and genome sequence of other organisms relies on computer-intensive data handling - the Bioinformatics.

The sequencing strategy for this project was clone-based physical mapping. The whole genome was digested and made Bacterial Artificial Chromosomes. Then these BACs were digested to create fingerprints. These BACs were orga- nized to form contigs. BAC clones were selected for sequencing. After shearing BACs and shotgun cloning, sequence clones were assembled using genome map to re-construct the whole genome with sequenced DNA.

The working-draft DNA sequence and the more polished 2003 version rep- resent an enormous achievement, akin in scientific importance, some say, to developing the periodic table of elements. And, as in most major scientific ad- vances, much work remains to realize the full potential of the accomplishment. Deriving meaningful knowledge from DNA sequences will define biological re- search through the coming decades and require the expertise and creativity of

Figure 2.8: Tiling Path

teams of biologists, chemists, engineers, and computational scientists, among others. A sampling follows of some research challenges in geneticswhat we still don’t know, even with the full human DNA sequence in hand.

2.16 All Life depends on 3 critical molecules

  • DNAs

· Hold information on how cell works

  • RNAs

· Act to transfer short pieces of information to different parts of cell

· Provide templates to synthesize into protein

  • Proteins

· Form enzymes that send signals to other cells and regulate gene ac- tivity

· Form body’s major components (e.g. hair, skin, etc.) · Are life’s laborers!