Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Understanding DNA Replication: Identifying the Origin of Replication (OriC), Lecture notes of Biology

Biology

The process of DNA replication, focusing on the identification of the origin of replication (OriC) in a DNA sequence. the biology of DNA strands, the role of DNA polymerases, and the differences between prokaryotic and eukaryotic replication. It also covers the implications of leaving single-stranded DNA exposed and the potential mutations that can occur. The document concludes by suggesting a new algorithm for finding OriC based on genome-wide GC skew.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

rubytuesday 🇺🇸

4.4

(38)

273 documents

1 / 30

This page cannot be seen from the preview

Don't miss anything!

Where does DNA Replication Begin?

continued from last time...

After a couple of false starts, we continue on our quest to develop an algorithm for finding the origin of replication,

OriC

, locus in a DNA sequence.

Partial preview of the text

Download Understanding DNA Replication: Identifying the Origin of Replication (OriC) and more Lecture notes Biology in PDF only on Docsity!

Where does DNA Replication Begin?

continued from last time...

After a couple of false starts, we continue on our quest to develop an algorithm for finding the origin of replication,

OriC, locus in a DNA sequence.

Let's take a closer look at the biology

Recall DNA Strands have Directions:

Replication progresses in one direction

Beginning at theoriC locus the DNA molecule is pulled apart and two DNA polymerases, one on each strand begin copying on each strand.

As they progress the DNA separates more. The boundrary of the separation between single-stranded and double-stranded DNA is called the

replication fork. Eventually, this separation exposes a significantly large single-stranded DNA on each strand.

Once the replication fork opens enough...

This open region of single-stranded DNA eventually allows a second phase of the replication process to begin. A second DNA polymerase detects a

primer sequence, and then start replicating the exposed sequence Ahead of it and works towards the beginning of the previous replication primer.

However, this DNA polymerase does not have too far to go.

Eventually the whole genome is replicated

The lengths of Okazaki fragments in prokaryotes and eukaryotes differ. Prokaryotes tend to have longer Okazaki fragments (≈ 2,000 nucleotides

long) than eukaryotes (100 to 200 nucleotides long).

Once completed, the adjacent Okazaki fragments are joined by another important protein called aDNA ligase.

Cytosine Uracil

Observations

Theleading half strand is copied as a single contiguous piece that progresses at a uniform rate as the DNA

separates

The otherlagging half strand liesexposed while waiting for the gap to enlarge enough, and until another

primer sequence appears so that another DNA polymerase can start

Replication on thelagging half-strand proceeds in a stop-and-go fashion extending by one Okazaki fragment

at a time

A DNA repair mechanism then comes along to fix all of the lagging half-strand fragments

What is the downside of leaving single-stranded DNA exposed?

Single-stranded DNA is less stable than double-stranded

Single-stranded DNA can potentially mutate when exposed

The most common mutation type is calleddeanimation

Deanimation tends to convert C nucelotides into T nucelotides.

Let's look for evidence

RecallThermotoga Petrophila, from last lecture (the bacteria whose k-mers did not match the frequent ones that we

found inVibrio Cholerae). Let's examine the nucleotide counts on either side of itsOriC region:

base Total Forward Reverse Diff

C 427419 207901 219518 -

G 413241 211607 201634 9973

A 491488 247525 243963 3562

T 491363 244722 246641 -

TheLagging strand in the forward direction corresponds to exposed Cs, while Gs in the reverse direction correspond to

Cs of theLagging strand. Thus, the Lagging strands have 9973 + 11617 = 21590 fewer Cs than the Leading strands.

Code for reading sequences from last time

def loadFasta(filename): """ Parses a classically formatted and possibly compressed FASTA file into a list of headers and fragment sequences for each sequence contained""" if (filename.endswith(".gz")): fp = gzip.open(filename, 'rb') else : fp = open(filename, 'rb') # split at headers data = fp.read().split('>') fp.close() # ignore whatever appears before the 1st header data.pop(0) headers = [] sequences = [] for sequence in data: lines = sequence.split('\n') headers.append(lines.pop(0)) # add an extra "+" to make string "1-referenced" sequences.append('+' + ''.join(lines)) return (headers, sequences)

Counting base occurences in large strings

Here's a somewhat standard approach to counting characters in a string.

def getStatsV1(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 total = { base: 0 for base in "ACGT" } forwardCount = { base: 0 for base in "ACGT" } reverseCount = { base: 0 for base in "ACGT" } for position in xrange(1,len(sequence)): base = sequence[position] total[base] += 1 if (terC > start): if position >= start and position < terC: forwardCount[base] += 1 else : reverseCount[base] += 1 else : if position >= start or position < terC: forwardCount[base] += 1 else : reverseCount[base] += 1 return {key: (total[key], forwardCount[key], reverseCount[key]) for key in total.iterkeys()}

Another way to count

This version makes four passes, one for each base, but moves the dictionary overhead outside of the linear scan.

def getStatsV2(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 stats = {} for base in "ACGT": total = sequence.count(base) if (terC > start): forwardCount = sequence[start:terC].count(base) reverseCount = total - forwardCount else : reverseCount = sequence[terC:start].count(base) forwardCount = total - reverseCount stats[base] = (total, forwardCount, reverseCount) return stats

One more contender

Python provides an optimized library called"numpy" for processing vectorized data. Our sequence can be considered a

vector of bases.

import numpy

def getStatsV3(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 genome = numpy.fromstring(sequence, dtype="uint8") total = numpy.bincount(genome) if (terC > start): forwardCount = numpy.bincount(genome[start:terC]) reverseCount = total - forwardCount else : reverseCount = numpy.bincount(match[terC:start]) forwardCount = total - reverseCount return {b: (total[ord(b)],forwardCount[ord(b)],reverseCount[ord(b)]) for b in "ACGT"}

Verify and time it

answer = getStatsV3(seq[0], oriCStart + oriOffset)

for base in "CGAT": total, forwardCount, reverseCount = answer[base] print "%s: %8d %8d %8d %8d" % (base, total, forwardCount, reverseCount, forwardCount - reverseCount) print

% timeit getStatsV2(seq[0], oriCStart + oriOffset) % timeit getStatsV3(seq[0], oriCStart + oriOffset)

C: 427419 207901 219518 -

G: 413241 211607 201634 9973

A: 491488 247525 243963 3562

T: 491363 244723 246640 -

10 loops, best of 3: 36.8 ms per loop 100 loops, best of 3: 14.7 ms per loop

Counting with cummulative sums

We'll use a vectorized cumuluative sum method to compute counts in the G-C skew genome wide. Given an input

vector, V, of length N. S = V.cumsum() returns:

Cumulative sums can be used to compute counts over any interval,. Example:

Si =

j =

Vj

Coun t [ ij ) = Sj − Si

v = numpy.array(numpy.random.random(20) < 0.25, dtype="int8") s = numpy.concatenate(([0],v.cumsum())) print v print s print s[15] - s[5]

[0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0]

[0 0 1 1 1 2 3 3 4 5 6 7 7 7 7 7 7 7 7 7 7]

Finding the genome-wide GC skew

def GCSkew(sequence): half = len(sequence) // 2 full = len(sequence) genome = numpy.fromstring(sequence + sequence, dtype='uint8') matchC = numpy.concatenate(([0], numpy.array(genome == ord('C'), dtype="int8").cumsum())) matchG = numpy.concatenate(([0], numpy.array(genome == ord('G'), dtype="int8").cumsum())) matchGC = matchG - matchC skew = matchGC[half:half + full] - matchGC[0:full] + matchGC[full - half:2 ***** full - half] - matchGC[full:2 ***** full] return skew

Let's test it function on the short sequence: CATGGGCATCGGCCATACGCC

% matplotlib inline import matplotlib import matplotlib.pyplot as plt

test = "+CATGGGCATCGGCCATACGCC" y = GCSkew(test) plt.figure(num=None, figsize=(24, 8), dpi=100) plt.ylim([ - 10,10]) plt.xticks(range(len(test)), [c for c in test]) result = plt.plot(range(len(y)), y)

Understanding DNA Replication: Identifying the Origin of Replication (OriC), Lecture notes of Biology

Related documents

Partial preview of the text

Download Understanding DNA Replication: Identifying the Origin of Replication (OriC) and more Lecture notes Biology in PDF only on Docsity!

Where does DNA Replication Begin?

continued from last time...

After a couple of false starts, we continue on our quest to develop an algorithm for finding the origin of replication,

OriC, locus in a DNA sequence.

Let's take a closer look at the biology

Recall DNA Strands have Directions:

Replication progresses in one direction

Beginning at theoriC locus the DNA molecule is pulled apart and two DNA polymerases, one on each strand begin copying on each strand.

As they progress the DNA separates more. The boundrary of the separation between single-stranded and double-stranded DNA is called the

replication fork. Eventually, this separation exposes a significantly large single-stranded DNA on each strand.

Once the replication fork opens enough...

This open region of single-stranded DNA eventually allows a second phase of the replication process to begin. A second DNA polymerase detects a

primer sequence, and then start replicating the exposed sequence Ahead of it and works towards the beginning of the previous replication primer.

However, this DNA polymerase does not have too far to go.

Eventually the whole genome is replicated

The lengths of Okazaki fragments in prokaryotes and eukaryotes differ. Prokaryotes tend to have longer Okazaki fragments (≈ 2,000 nucleotides

long) than eukaryotes (100 to 200 nucleotides long).

Once completed, the adjacent Okazaki fragments are joined by another important protein called aDNA ligase.

Cytosine Uracil

Observations

Theleading half strand is copied as a single contiguous piece that progresses at a uniform rate as the DNA

separates

The otherlagging half strand liesexposed while waiting for the gap to enlarge enough, and until another

primer sequence appears so that another DNA polymerase can start

Replication on thelagging half-strand proceeds in a stop-and-go fashion extending by one Okazaki fragment

at a time

A DNA repair mechanism then comes along to fix all of the lagging half-strand fragments

What is the downside of leaving single-stranded DNA exposed?

Single-stranded DNA is less stable than double-stranded

Single-stranded DNA can potentially mutate when exposed

Deanimation tends to convert C nucelotides into T nucelotides.

Let's look for evidence

RecallThermotoga Petrophila, from last lecture (the bacteria whose k-mers did not match the frequent ones that we

found inVibrio Cholerae). Let's examine the nucleotide counts on either side of itsOriC region:

base Total Forward Reverse Diff

C 427419 207901 219518 -

G 413241 211607 201634 9973

A 491488 247525 243963 3562

T 491363 244722 246641 -

TheLagging strand in the forward direction corresponds to exposed Cs, while Gs in the reverse direction correspond to

Cs of theLagging strand. Thus, the Lagging strands have 9973 + 11617 = 21590 fewer Cs than the Leading strands.

Code for reading sequences from last time

Counting base occurences in large strings

Here's a somewhat standard approach to counting characters in a string.

Another way to count

This version makes four passes, one for each base, but moves the dictionary overhead outside of the linear scan.

One more contender

Python provides an optimized library called"numpy" for processing vectorized data. Our sequence can be considered a

vector of bases.

Verify and time it

C: 427419 207901 219518 -

G: 413241 211607 201634 9973

A: 491488 247525 243963 3562

T: 491363 244723 246640 -

Counting with cummulative sums

We'll use a vectorized cumuluative sum method to compute counts in the G-C skew genome wide. Given an input

vector, V, of length N. S = V.cumsum() returns:

Cumulative sums can be used to compute counts over any interval,. Example:

Si =

Vj

Coun t [ ij ) = Sj − Si

[0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0]

[0 0 1 1 1 2 3 3 4 5 6 7 7 7 7 7 7 7 7 7 7]

Finding the genome-wide GC skew

Let's test it function on the short sequence: CATGGGCATCGGCCATACGCC