Understanding DNA Replication: Identifying the Origin of Replication (OriC), Lecture notes of Biology

The process of DNA replication, focusing on the identification of the origin of replication (OriC) in a DNA sequence. the biology of DNA strands, the role of DNA polymerases, and the differences between prokaryotic and eukaryotic replication. It also covers the implications of leaving single-stranded DNA exposed and the potential mutations that can occur. The document concludes by suggesting a new algorithm for finding OriC based on genome-wide GC skew.

Typology: Lecture notes

2021/2022

Uploaded on 09/12/2022

rubytuesday
rubytuesday ๐Ÿ‡บ๐Ÿ‡ธ

4.4

(38)

273 documents

1 / 30

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Where does DNA Replication Begin?
continued from last time...
After a couple of false starts, we continue on our quest to develop an algorithm for finding the origin of replication,
OriC
, locus in a DNA sequence.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e

Partial preview of the text

Download Understanding DNA Replication: Identifying the Origin of Replication (OriC) and more Lecture notes Biology in PDF only on Docsity!

Where does DNA Replication Begin?

continued from last time...

After a couple of false starts, we continue on our quest to develop an algorithm for finding the origin of replication,

OriC, locus in a DNA sequence.

Let's take a closer look at the biology

Recall DNA Strands have Directions:

Replication progresses in one direction

Beginning at theoriC locus the DNA molecule is pulled apart and two DNA polymerases, one on each strand begin copying on each strand.

As they progress the DNA separates more. The boundrary of the separation between single-stranded and double-stranded DNA is called the

replication fork. Eventually, this separation exposes a significantly large single-stranded DNA on each strand.

Once the replication fork opens enough...

This open region of single-stranded DNA eventually allows a second phase of the replication process to begin. A second DNA polymerase detects a

primer sequence, and then start replicating the exposed sequence Ahead of it and works towards the beginning of the previous replication primer.

However, this DNA polymerase does not have too far to go.

Eventually the whole genome is replicated

The lengths of Okazaki fragments in prokaryotes and eukaryotes differ. Prokaryotes tend to have longer Okazaki fragments (โ‰ˆ 2,000 nucleotides

long) than eukaryotes (100 to 200 nucleotides long).

Once completed, the adjacent Okazaki fragments are joined by another important protein called aDNA ligase.

Cytosine Uracil

Observations

Theleading half strand is copied as a single contiguous piece that progresses at a uniform rate as the DNA

separates

The otherlagging half strand liesexposed while waiting for the gap to enlarge enough, and until another

primer sequence appears so that another DNA polymerase can start

Replication on thelagging half-strand proceeds in a stop-and-go fashion extending by one Okazaki fragment

at a time

A DNA repair mechanism then comes along to fix all of the lagging half-strand fragments

What is the downside of leaving single-stranded DNA exposed?

Single-stranded DNA is less stable than double-stranded

Single-stranded DNA can potentially mutate when exposed

The most common mutation type is calleddeanimation

Deanimation tends to convert C nucelotides into T nucelotides.

Let's look for evidence

RecallThermotoga Petrophila, from last lecture (the bacteria whose k-mers did not match the frequent ones that we

found inVibrio Cholerae). Let's examine the nucleotide counts on either side of itsOriC region:

base Total Forward Reverse Diff

C 427419 207901 219518 -

G 413241 211607 201634 9973

A 491488 247525 243963 3562

T 491363 244722 246641 -

TheLagging strand in the forward direction corresponds to exposed Cs, while Gs in the reverse direction correspond to

Cs of theLagging strand. Thus, the Lagging strands have 9973 + 11617 = 21590 fewer Cs than the Leading strands.

Code for reading sequences from last time

def loadFasta(filename): """ Parses a classically formatted and possibly compressed FASTA file into a list of headers and fragment sequences for each sequence contained""" if (filename.endswith(".gz")): fp = gzip.open(filename, 'rb') else : fp = open(filename, 'rb') # split at headers data = fp.read().split('>') fp.close() # ignore whatever appears before the 1st header data.pop(0) headers = [] sequences = [] for sequence in data: lines = sequence.split('\n') headers.append(lines.pop(0)) # add an extra "+" to make string "1-referenced" sequences.append('+' + ''.join(lines)) return (headers, sequences)

Counting base occurences in large strings

Here's a somewhat standard approach to counting characters in a string.

def getStatsV1(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 total = { base: 0 for base in "ACGT" } forwardCount = { base: 0 for base in "ACGT" } reverseCount = { base: 0 for base in "ACGT" } for position in xrange(1,len(sequence)): base = sequence[position] total[base] += 1 if (terC > start): if position >= start and position < terC: forwardCount[base] += 1 else : reverseCount[base] += 1 else : if position >= start or position < terC: forwardCount[base] += 1 else : reverseCount[base] += 1 return {key: (total[key], forwardCount[key], reverseCount[key]) for key in total.iterkeys()}

Another way to count

This version makes four passes, one for each base, but moves the dictionary overhead outside of the linear scan.

def getStatsV2(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 stats = {} for base in "ACGT": total = sequence.count(base) if (terC > start): forwardCount = sequence[start:terC].count(base) reverseCount = total - forwardCount else : reverseCount = sequence[terC:start].count(base) forwardCount = total - reverseCount stats[base] = (total, forwardCount, reverseCount) return stats

One more contender

Python provides an optimized library called"numpy" for processing vectorized data. Our sequence can be considered a

vector of bases.

import numpy

def getStatsV3(sequence, start): halflen = len(sequence) // 2 terC = start + halflen # handle genome's circular nature if (terC > len(sequence)): terC = terC - len(sequence) + 1 genome = numpy.fromstring(sequence, dtype="uint8") total = numpy.bincount(genome) if (terC > start): forwardCount = numpy.bincount(genome[start:terC]) reverseCount = total - forwardCount else : reverseCount = numpy.bincount(match[terC:start]) forwardCount = total - reverseCount return {b: (total[ord(b)],forwardCount[ord(b)],reverseCount[ord(b)]) for b in "ACGT"}

Verify and time it

answer = getStatsV3(seq[0], oriCStart + oriOffset)

for base in "CGAT": total, forwardCount, reverseCount = answer[base] print "%s: %8d %8d %8d %8d" % (base, total, forwardCount, reverseCount, forwardCount - reverseCount) print

% timeit getStatsV2(seq[0], oriCStart + oriOffset) % timeit getStatsV3(seq[0], oriCStart + oriOffset)

C: 427419 207901 219518 -

G: 413241 211607 201634 9973

A: 491488 247525 243963 3562

T: 491363 244723 246640 -

10 loops, best of 3: 36.8 ms per loop 100 loops, best of 3: 14.7 ms per loop

Counting with cummulative sums

We'll use a vectorized cumuluative sum method to compute counts in the G-C skew genome wide. Given an input

vector, V, of length N. S = V.cumsum() returns:

Cumulative sums can be used to compute counts over any interval,. Example:

Si =

j =

i

Vj

Coun t [ ij ) = Sj โˆ’ Si

v = numpy.array(numpy.random.random(20) < 0.25, dtype="int8") s = numpy.concatenate(([0],v.cumsum())) print v print s print s[15] - s[5]

[0 1 0 0 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 0]

[0 0 1 1 1 2 3 3 4 5 6 7 7 7 7 7 7 7 7 7 7]

Finding the genome-wide GC skew

def GCSkew(sequence): half = len(sequence) // 2 full = len(sequence) genome = numpy.fromstring(sequence + sequence, dtype='uint8') matchC = numpy.concatenate(([0], numpy.array(genome == ord('C'), dtype="int8").cumsum())) matchG = numpy.concatenate(([0], numpy.array(genome == ord('G'), dtype="int8").cumsum())) matchGC = matchG - matchC skew = matchGC[half:half + full] - matchGC[0:full] + matchGC[full - half:2 ***** full - half] - matchGC[full:2 ***** full] return skew

Let's test it function on the short sequence: CATGGGCATCGGCCATACGCC

% matplotlib inline import matplotlib import matplotlib.pyplot as plt

test = "+CATGGGCATCGGCCATACGCC" y = GCSkew(test) plt.figure(num=None, figsize=(24, 8), dpi=100) plt.ylim([ - 10,10]) plt.xticks(range(len(test)), [c for c in test]) result = plt.plot(range(len(y)), y)