Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


ALGORITHMS AND DATA STRUCTURES IN BIOLOGY - GENOMICS, Schemi e mappe concettuali di Sistemi Informatici

Concetti di algoritmo e di complessità computazionale: definizione di algoritmo, algoritmi ricorsivi ed iterativi, notazione asintotica. Algoritmi di ricerca esaustiva: restriction mapping, motif finding. Algoritmi greedy: sorting by reversals, algoritmi approssimati. Programmazione dinamica: edit distance, Manhattan distance. La tecnica Divide and Conquer.

Tipologia: Schemi e mappe concettuali

2024/2025

In vendita dal 25/02/2026

vivi-lerose
vivi-lerose 🇮🇹

5 documenti

1 / 16

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
ALGORITHMS AND
DATA STRUCTURES
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Anteprima parziale del testo

Scarica ALGORITHMS AND DATA STRUCTURES IN BIOLOGY - GENOMICS e più Schemi e mappe concettuali in PDF di Sistemi Informatici solo su Docsity!

ALGORITHMS AND

DATA STRUCTURES

algorithm = finite sequence of unambiguous instructions, an algorithm is correct for a combinatorial problem if the steps dictated by the algorithm solve

the problem

pseudocode = specifing the algorithms than can easily translated to concrete programming languages

PROBLEMS AND COMPLEXITY

combinatorial problem = unambiguous and precise problem concerning the production of some outputs from

some inputs

how to prove correctness =

  • testing the algorithm = transfoming an input into an output (experimetal methodlogy)
  • proving the algorithm = mathematical proof that it does what it's supposed to do (analytical methodlogy) TYPES OF PROBLEMS

recursive problems = base case (general case to return and end the recursion) + recursive case (function call itself)

how to prove correctness =

  • identify the property that can be helpful for us
  • base of induction = show that algorithm satisfies the property in the base case
  • induction principle = show that the algorithm satisfies the property for all the recursive cases until (n-1)-th one
  • inductive step = show that algorithm satisfies the property for the n-th recursive case based on the assumption of

correctness of the inductive hypothesis

sorting problems = organizing arrays

  • selection sort = the array is run from 0 to end position, the smallest element is searched in swapped with

the i-th element of the array

  • merge sort = array is divided in 2 and sorted separately, then merged back together by adding

numbers to a new list one by one choosing the smallest one (o check the smallest one we only need to

check the first number of each sorted list since they are the smallest of that list)

EXHAUSTIVE SEARCH ALGORITHMS

exhaustive search algorithms = high complexity (NP-hard problems that generate in a domain all possible candidate solutions and searching one by one

through to find the solution) but easy to prove correctness

  • properties = finite domain, the domain must contains the solution and the domain must be ordered to be searched through

explore the search space (space of tuples) in a straightforward way → trees = arrangment of tuples

motif finding problem = algorithm to find frequent subsequences showing little variance (find a pattern that is the least

different from all the l-nucleotide sequences based on scoring system that counts the number of l-subsequences having in

that position a nucleotide matching the one in the consensus string) that gives as output an array of t starting positions s

maximizing the score

complexity = O(tln^t)

median string (less complex version of motif finding problem) = given two strings of the same length u and v, their

hamming distance is the number of positions at which they differ

  • the total distance between string u and a txn matrix is the minimum hamming distance between u (u = string) and s (s =

tuple of starting positions) complexity of hamming distance = O(nlt)

  • all possible strings of l are generated and if the total distance is lower than the best distance then the total distance is

the new best distance

complexity = O(nlt*4^l)

restriction mapping = searching for restriction sites

complexity = O(max^n-2)

  • build a table with the numbers present in the line as row and column (ordered from smaller to biggest) and for each

box find the difference (column - row) only for positive integers and put the resukts into a list

  • find the biggst number present (in the list) by comparing numbers with 0 (n-0) choosin the biggest one that rapresents

the furthest point from zero (add 0 and biggest point M)

  • find the seconf biggest number that can be the difference between M-0 or M-n so we dont know for sure the new point

(branching point)

  • we repeat the previous step till the end of the list to find all the restriction points

trees = arrangment of tuples (all leaves have the same height H and all nodes have a fixed number of children K (branching factor) or no children at all)

the total number of leaves is k^h

serach every leaf

skip from one vertex to another

search all the tree vertex leaf & I

greedy algorithms = making choices (not ever questioned) which are locally optimal to lower the complexity (generate a non-optimal solution in polynomial

time), approximation algorithm = gives an approximate (correct but not optimal) solution to an optimization problem

evaluate the quality of the solution = distance of the solution to the optimal solution OPT(input) =

cost of the optimal solution

approximation ratio (AR) of an algorithm on an instance of length n =

  • maximization problem AR(n) ≥ max (OPT(x) / A(x))
  • minimization problem AR(n) ≥ max (A(x) / OPT(x))

GREEDY ALGORITHMS

complexity = O(ln^2 + ln*t)

sequence alignment = to find the function of newly sequenced genes by comparing their sequence with similar

genes of known function

  • hamming distance = count of number of mismatches of the two sequences assuming that the i^th symbol of

one sequence is aligned with the i^th symbol of the second sequence

  • edit distance = minimum amount of editing operations (insertion, deletion, substitutions) transforming a string

into the other

  • build an alignment grid = matrix with the two sequences as row and column
  • use a scoring function to assign weights to edges depending on the number of matches, mismatches or gaps

in order to evaluate each alignment

  • generating a path where horizontal edges corespond to insertions, vertical edges correspond to deletions

and oblique eges correspond to substitutions or matches

  • the resulting path is the optimal alignment between the initial two sequences

global sequence alignment = similarities between the entire strings

local sequence alignment = similarities between substrings

knapsack problem = thera are various objects that we can choose from but the weight that we can carry is limited, the algorithm is able to

choose the maximum number of objects with a fixed total weight (da programmare!)

binpacking problem = pack staff in the minimum of boxes

takes as input an array with elements and an associated array with sizes or weights of the elements and returns an array of arrays

containing the partitioning of the elements firstfit = adding elements subsequently inside boxes, if we can fit the element in the

previous boxes we add it otherwise we add it to a new box

DIVIDE-AND-CONQUER

ALGORITHMS

divide-and-conquer algorithm = split the input into two (or more) parts, solve them separately and then combine them

complexity = O(n*log(n))

the algorithm builds a matrix with as middle column an array contanig the score of the best path

from the initial point to the middle point with position i (index of the array and point in the matrix)

then it computes the same matrix but from middle to end

at the end the two arrays are summed and the highest score corespond to the best point in column i

therefore we can split the matrix into two parts from position i to find the new best middle points to

reconstruct the best path

questions

I formal definition =

def motif - finding(DNA , t^ ,n^ ,^ 2)^ :

given a^ see^ of^ DNA^ sequences^ find^ a^ set^ of^ C-mers^ ,^ one^ from^ each^ sequence

best - motif <-(1 . ..., 1) such that maximises the consensusScore - > Score (S , ANA) is the Sum

, position

<-^ (1^ ,.. ., 1)^ by position^ of^ the^ n.^ ofe-subsequences^ having in^ that^ position^ a^ nucleotide^ mathing

for 51 71 to n- L + 1 : the one in the consensuous

string

for (^) S2 71 to n- L + 1 :

input =^ Exn^ marise of^ DNA^ (t^ =^ n^.^ of^ sequences , n^ = length of^ sequences) and^ the

if score (s^ ,^2 ,^ DNA)^ <^ Score^ (best-^ motif^ , 2 , DNA)^ :

length of^ the^ pattern^ &

best - motif-1S1 output= array of^ t^ starting positions^ s=^ (S1.... St) (^) maximizing the^ score best (^) - motif-2S 52 -best - motif -1 (^) used (^) technique = greedy S2-best _ motif (^) - for i 1-3 to :^ complexity analysis = (^0) (nze + (^) net) for Si^71 to^ n-^ C^ +^1 :

if score (s , 2 , DNA) <^ Score (best_ motif , 2 , DNA) :

best - motif^ -^ iSi si hamming distance^ between^ two

best-distance 7 sequences therefore^ n.^ of^ positions^ at^ which^ they differ

for each -mer word from 'AA ...^ A^ to^ 'TT...T'^ :

input =^ Exn^ marise of^ DNA^ (t^ =^ n^.^ of^ sequences , n^ = length of^ sequences (

if totaldistance (word , DNA) < best-distance : and the length of the pattern &

best-distance =^ totaldistance^ (word^ , DNA)

output=^ a (^) string word^ ofe nucleotides that minimizes the

best-word word total-distance

return best-word used technique = exhaustive search

complexity analysis^ =^0 (tent)

def PDP(L , n)^ :

m 7 maximum element in^ L

fore every set of n-2 (^) integers O <^ x2) ...

if delta (width - y , se) is part of L:

add width-y to^ a^ Remove^ Lengths deta^ (width^ - y ,^ x)^ from

place (1^ ,^ x) remove width-y^ from^ se^ and^ all^ Lengths deta^ (width^ -^ y^ ,^ x)^ to return

brute-force-PDP (1 , n) :

m = maximum element of L

for every set ofh-2 integers 01 ...^ ^ best-score : for^ S21^ to^ n-l^ +^1 :

best - Score =^ Score (s. DNA) for^ S2^ =^2 to^ n-1^ +^1 :

best (^) - motif =^ (S1 , .... St) if^ Score^ (s.^2 ,^ DNA)^ Score^ (best^ -^ motif^.^2 ,^ DNA) :

return best_motif O(t. C. nt) best - motif- 1 S

best _ motif -^ 2 S

S1 <^ best - motif-

brute-force - median-string(DNA^ , t^ , n^. L)^ :

52 best^ _^ motif-

best - word =^ AA^ ... A

for it3 to n-l+ 1 :

best _ distance =^ x

if (^) score (s. 2 , (^) DNA) < Score (^) (best - motif (^). (^2) , DNA) :

for each I-mer word from AA ... A to TT...^ T^ :

best - motif-i Si

if (^) total-distance (word (^) , DNA) best-distance : si -^ best_motif-i best-distance- total-distance^ (word , DNA) return best - motif 0(n2.^ e^ +^ n^ -^2.^ t)

best-worda word

return best_word (^0) (n. 2. t. 4t)

Simple-Reversal-sort (a) :^ improved -^ Reversal-sort^ (a)^ :

for i +^1 to n-1 : while^ b(u) >^ o :

ja position of^ element i^ in^ th^ if^ i^ has^ a decreasing strip^ : if (^) ii^ : among all reversals choose (^) p (^) minimizing bla^.^ p)

a =^ u - p(i , j) else :

output a^ choose^ reversal^ p that^ flips an (^) increasing strip in^ n ifa is (^) the (^) identity permutation : (^) a = (^) a - p return output a 0(nz) return 0(n3) Manhattan-tourist (Wi (^) , Wj (^) , n ,^ m)^ :

So ,o to

for it^1 to n^ :

Si , 0 to Si-1 , 0 +^ win^ , o

for ja 1 to mi

So (^) , j So, (^) j-1 + (^) Wjo (^) ,

for i t 1 to n:

for j-1 to^ m^ :

Si ,+Max (Si-1^ , 5 +^ Wii , j ,^ Si^ , j -^1 +^ WJi , j)

Meturn (^) Shim (^) on. m) Longest-path (6)^ : vertices-left v edges-left =^ E result =^ [] while vertices-left *^0 : top =^ vertices^ left^ of^ vertices-^ left^ not^ having (^) entering edges

result. Append (top)

vertices (^) - left. (^) Remove (top) edges-left.^ Remove^ (edges (^) having an^ endpoint^ in^ top return Result (^) O(n+ (^) m)