Scarica ALGORITHMS AND DATA STRUCTURES IN BIOLOGY - GENOMICS e più Schemi e mappe concettuali in PDF di Sistemi Informatici solo su Docsity!
ALGORITHMS AND
DATA STRUCTURES
algorithm = finite sequence of unambiguous instructions, an algorithm is correct for a combinatorial problem if the steps dictated by the algorithm solve
the problem
pseudocode = specifing the algorithms than can easily translated to concrete programming languages
PROBLEMS AND COMPLEXITY
combinatorial problem = unambiguous and precise problem concerning the production of some outputs from
some inputs
how to prove correctness =
- testing the algorithm = transfoming an input into an output (experimetal methodlogy)
- proving the algorithm = mathematical proof that it does what it's supposed to do (analytical methodlogy) TYPES OF PROBLEMS
recursive problems = base case (general case to return and end the recursion) + recursive case (function call itself)
how to prove correctness =
- identify the property that can be helpful for us
- base of induction = show that algorithm satisfies the property in the base case
- induction principle = show that the algorithm satisfies the property for all the recursive cases until (n-1)-th one
- inductive step = show that algorithm satisfies the property for the n-th recursive case based on the assumption of
correctness of the inductive hypothesis
sorting problems = organizing arrays
- selection sort = the array is run from 0 to end position, the smallest element is searched in swapped with
the i-th element of the array
- merge sort = array is divided in 2 and sorted separately, then merged back together by adding
numbers to a new list one by one choosing the smallest one (o check the smallest one we only need to
check the first number of each sorted list since they are the smallest of that list)
EXHAUSTIVE SEARCH ALGORITHMS
exhaustive search algorithms = high complexity (NP-hard problems that generate in a domain all possible candidate solutions and searching one by one
through to find the solution) but easy to prove correctness
- properties = finite domain, the domain must contains the solution and the domain must be ordered to be searched through
explore the search space (space of tuples) in a straightforward way → trees = arrangment of tuples
motif finding problem = algorithm to find frequent subsequences showing little variance (find a pattern that is the least
different from all the l-nucleotide sequences based on scoring system that counts the number of l-subsequences having in
that position a nucleotide matching the one in the consensus string) that gives as output an array of t starting positions s
maximizing the score
complexity = O(tln^t)
median string (less complex version of motif finding problem) = given two strings of the same length u and v, their
hamming distance is the number of positions at which they differ
- the total distance between string u and a txn matrix is the minimum hamming distance between u (u = string) and s (s =
tuple of starting positions) complexity of hamming distance = O(nlt)
- all possible strings of l are generated and if the total distance is lower than the best distance then the total distance is
the new best distance
complexity = O(nlt*4^l)
restriction mapping = searching for restriction sites
complexity = O(max^n-2)
- build a table with the numbers present in the line as row and column (ordered from smaller to biggest) and for each
box find the difference (column - row) only for positive integers and put the resukts into a list
- find the biggst number present (in the list) by comparing numbers with 0 (n-0) choosin the biggest one that rapresents
the furthest point from zero (add 0 and biggest point M)
- find the seconf biggest number that can be the difference between M-0 or M-n so we dont know for sure the new point
(branching point)
- we repeat the previous step till the end of the list to find all the restriction points
trees = arrangment of tuples (all leaves have the same height H and all nodes have a fixed number of children K (branching factor) or no children at all)
the total number of leaves is k^h
serach every leaf
skip from one vertex to another
search all the tree vertex leaf & I
greedy algorithms = making choices (not ever questioned) which are locally optimal to lower the complexity (generate a non-optimal solution in polynomial
time), approximation algorithm = gives an approximate (correct but not optimal) solution to an optimization problem
evaluate the quality of the solution = distance of the solution to the optimal solution OPT(input) =
cost of the optimal solution
approximation ratio (AR) of an algorithm on an instance of length n =
- maximization problem AR(n) ≥ max (OPT(x) / A(x))
- minimization problem AR(n) ≥ max (A(x) / OPT(x))
GREEDY ALGORITHMS
complexity = O(ln^2 + ln*t)
sequence alignment = to find the function of newly sequenced genes by comparing their sequence with similar
genes of known function
- hamming distance = count of number of mismatches of the two sequences assuming that the i^th symbol of
one sequence is aligned with the i^th symbol of the second sequence
- edit distance = minimum amount of editing operations (insertion, deletion, substitutions) transforming a string
into the other
- build an alignment grid = matrix with the two sequences as row and column
- use a scoring function to assign weights to edges depending on the number of matches, mismatches or gaps
in order to evaluate each alignment
- generating a path where horizontal edges corespond to insertions, vertical edges correspond to deletions
and oblique eges correspond to substitutions or matches
- the resulting path is the optimal alignment between the initial two sequences
global sequence alignment = similarities between the entire strings
local sequence alignment = similarities between substrings
knapsack problem = thera are various objects that we can choose from but the weight that we can carry is limited, the algorithm is able to
choose the maximum number of objects with a fixed total weight (da programmare!)
binpacking problem = pack staff in the minimum of boxes
takes as input an array with elements and an associated array with sizes or weights of the elements and returns an array of arrays
containing the partitioning of the elements firstfit = adding elements subsequently inside boxes, if we can fit the element in the
previous boxes we add it otherwise we add it to a new box
DIVIDE-AND-CONQUER
ALGORITHMS
divide-and-conquer algorithm = split the input into two (or more) parts, solve them separately and then combine them
complexity = O(n*log(n))
the algorithm builds a matrix with as middle column an array contanig the score of the best path
from the initial point to the middle point with position i (index of the array and point in the matrix)
then it computes the same matrix but from middle to end
at the end the two arrays are summed and the highest score corespond to the best point in column i
therefore we can split the matrix into two parts from position i to find the new best middle points to
reconstruct the best path
questions
I formal definition =
def motif - finding(DNA , t^ ,n^ ,^ 2)^ :
given a^ see^ of^ DNA^ sequences^ find^ a^ set^ of^ C-mers^ ,^ one^ from^ each^ sequence
best - motif <-(1 . ..., 1) such that maximises the consensusScore - > Score (S , ANA) is the Sum
, position
<-^ (1^ ,.. ., 1)^ by position^ of^ the^ n.^ ofe-subsequences^ having in^ that^ position^ a^ nucleotide^ mathing
for 51 71 to n- L + 1 : the one in the consensuous
string
for (^) S2 71 to n- L + 1 :
input =^ Exn^ marise of^ DNA^ (t^ =^ n^.^ of^ sequences , n^ = length of^ sequences) and^ the
if score (s^ ,^2 ,^ DNA)^ <^ Score^ (best-^ motif^ , 2 , DNA)^ :
length of^ the^ pattern^ &
best - motif-1S1 output= array of^ t^ starting positions^ s=^ (S1.... St) (^) maximizing the^ score best (^) - motif-2S 52 -best - motif -1 (^) used (^) technique = greedy S2-best _ motif (^) - for i 1-3 to :^ complexity analysis = (^0) (nze + (^) net) for Si^71 to^ n-^ C^ +^1 :
if score (s , 2 , DNA) <^ Score (best_ motif , 2 , DNA) :
best - motif^ -^ iSi si hamming distance^ between^ two
best-distance 7 sequences therefore^ n.^ of^ positions^ at^ which^ they differ
for each -mer word from 'AA ...^ A^ to^ 'TT...T'^ :
input =^ Exn^ marise of^ DNA^ (t^ =^ n^.^ of^ sequences , n^ = length of^ sequences (
if totaldistance (word , DNA) < best-distance : and the length of the pattern &
best-distance =^ totaldistance^ (word^ , DNA)
output=^ a (^) string word^ ofe nucleotides that minimizes the
best-word word total-distance
return best-word used technique = exhaustive search
complexity analysis^ =^0 (tent)
def PDP(L , n)^ :
m 7 maximum element in^ L
fore every set of n-2 (^) integers O <^ x2) ...
if delta (width - y , se) is part of L:
add width-y to^ a^ Remove^ Lengths deta^ (width^ - y ,^ x)^ from
place (1^ ,^ x) remove width-y^ from^ se^ and^ all^ Lengths deta^ (width^ -^ y^ ,^ x)^ to return
brute-force-PDP (1 , n) :
m = maximum element of L
for every set ofh-2 integers 01 ...^ ^ best-score : for^ S21^ to^ n-l^ +^1 :
best - Score =^ Score (s. DNA) for^ S2^ =^2 to^ n-1^ +^1 :
best (^) - motif =^ (S1 , .... St) if^ Score^ (s.^2 ,^ DNA)^ Score^ (best^ -^ motif^.^2 ,^ DNA) :
return best_motif O(t. C. nt) best - motif- 1 S
best _ motif -^ 2 S
S1 <^ best - motif-
brute-force - median-string(DNA^ , t^ , n^. L)^ :
52 best^ _^ motif-
best - word =^ AA^ ... A
for it3 to n-l+ 1 :
best _ distance =^ x
if (^) score (s. 2 , (^) DNA) < Score (^) (best - motif (^). (^2) , DNA) :
for each I-mer word from AA ... A to TT...^ T^ :
best - motif-i Si
if (^) total-distance (word (^) , DNA) best-distance : si -^ best_motif-i best-distance- total-distance^ (word , DNA) return best - motif 0(n2.^ e^ +^ n^ -^2.^ t)
best-worda word
return best_word (^0) (n. 2. t. 4t)
Simple-Reversal-sort (a) :^ improved -^ Reversal-sort^ (a)^ :
for i +^1 to n-1 : while^ b(u) >^ o :
ja position of^ element i^ in^ th^ if^ i^ has^ a decreasing strip^ : if (^) ii^ : among all reversals choose (^) p (^) minimizing bla^.^ p)
a =^ u - p(i , j) else :
output a^ choose^ reversal^ p that^ flips an (^) increasing strip in^ n ifa is (^) the (^) identity permutation : (^) a = (^) a - p return output a 0(nz) return 0(n3) Manhattan-tourist (Wi (^) , Wj (^) , n ,^ m)^ :
So ,o to
for it^1 to n^ :
Si , 0 to Si-1 , 0 +^ win^ , o
for ja 1 to mi
So (^) , j So, (^) j-1 + (^) Wjo (^) ,
for i t 1 to n:
for j-1 to^ m^ :
Si ,+Max (Si-1^ , 5 +^ Wii , j ,^ Si^ , j -^1 +^ WJi , j)
Meturn (^) Shim (^) on. m) Longest-path (6)^ : vertices-left v edges-left =^ E result =^ [] while vertices-left *^0 : top =^ vertices^ left^ of^ vertices-^ left^ not^ having (^) entering edges
result. Append (top)
vertices (^) - left. (^) Remove (top) edges-left.^ Remove^ (edges (^) having an^ endpoint^ in^ top return Result (^) O(n+ (^) m)