Download Examen de bio informatique and more Exams Biotechnology in PDF only on Docsity!
Examen de bioinformatique
Février 2009
Durée : 2h (ou 2h30) - Documents interdits
Première partie (4 points)
1) La séquence ci-dessous est-elle au format fasta (justifiez votre réponse)? 1 pt
Is the sequence below in fasta format (justify your answer)?
>tr|A0A098|A0A098_CHLRE
MASMAAELRPSDGGSSLHMLDSLLMMGLSSGGGVGGGGSSQSQILDSAGAAELAALLLPQ
HSNDPLHLMSTGDAALGLAGPMAAAEHHQHHPHHQHHSVPATAGFPSQTPPPPLFSNATA
GAAPATRVRAAGSCGSGGVAGGTTSHSSEDGVFHSADPHHHHQQHLQQPQPQQQQ
2) Quel problème rencontrera-t-on avec cette séquence lors d’une recherche de similarité? 1.5 pt
Which problem will be encountered with this sequence in a similarity search?
3) Définissez la banque GO. Define the GO database. 1.5 pt
Deuxième partie (7.5 points)
Un alignement multiple de séquences eucaryotes est présenté page 2. A multiple alignment of
eucaryotic sequences is shown page 2.
1) Quels groupes pouvez-vous distinguer dans cet alignement? 2 pt
Donnez 2 résidus discriminants pour chaque groupe.
Which group can you distinguish in this alignment? Give 2 discriminative residues for each
group.
2) Donnez une erreur de séquence probable dans cet alignement. 1.5 pt
Give a probable sequence error in this alignment.
3) Quelle est la relation d’homologie entre zn143_human et znf76_human? zn143_human et
q8ci27_mouse? zn143_human et znf76_mouse? 1.5 pt
Give the homology relation between zn143_human and znf76_human? zn143_human and
q8ci27_mouse? zn143_human and znf76_mouse?
4) Un arbre a été construit à partir de cet alignement selon la méthode du neigbor-joining. 2.5pt
Dans cet arbre (page 3), 3 identifiants de séquences ont été remplacés par x, y et z.
A tree has been constructed from this alignment using the neighbor-joining method. In this
tree (page 3), 3 sequence identifiers have been replaced par x, y, and z.
a) Est-ce que cet arbre est en accord avec votre analyse de l’alignement? Justifiez
votre réponse.
Is this tree in agreement with your alignment analysis? Justify your answer.
b) Selon vous, à quelles séquences correspondent respectivement X, Y et Z?
According to you, which sequences correspond respectively to X, Y and Z?
2
ZN143_HUMAN
A6QQW0_BOVIN :Q8CI27_MOUSE :Q6VQB0_FUGRU :A0AUQ7_DANRE :Q4S173_TETNG :Q6GPP5_XENLA :ZNF76_MOUSE
ZNF76_HUMAN
MEGVSLQAVTLADGSTAYIQHNSK----DAKLIDGQVIQLEDGSAAYVQHVPIPKSTGDSLRLEDGQAVQLEDGTTAFIHHTSKDSYDQSALQAVQLEDGTTAYIHHAVQMEGVSLQAVTLADGSTAYIQHNS-----------------KDGSAAYVQHVPIPKTTGDSLRLEDGQAVQLED------------SYDQSALQAVQLEDGTTAYIHHAVQMEGVSLQAVTLADGSTAYIQHNSK----DGRLIDGQVIQLEDGSAAYVQHVPIPKS----------------------------NSYDQSSLQAVQLEDGTTAYIHHAVQMDTVSLQAVTLADGSTAYIQHDSKASFSDGQIMDGQVIQLEDGSAAYVQHVSMPKAGGDSLQLEDGQTVQLEDGTTAYIHTP-KETYDQSGLQEVQLEDGSTAYIQHTVHMDTVSLQAVTLVDGSTAYIQHSPKVSLTENKIMEGQVIQLEDGSAAYVQHLPMSKTGGEGLRLEDGQAVQLEDGTTAYTHAP-KETYDQGGLQAVQLEDGTTAYIQH---MDTVSLQAVTLADGSTAYIQHDSKASFPDGQIMDGQVIQLEDGSAAYVQHVSMPKAGGESLQLEDGQTVQLEDGTTAYIHAP-KETYDQSGLQEVQLEDGSTAYIQHTVHMESMSLQAVTLADGSTAYIQHNTK----DGKLMEGQVIQLEDGSAAYVQHIP----KGDDLSLEDGQAVQLEDGTTAYIHHSSKESYDQSSVQAVQLEDGTTAYIHHAVQMESLGLQTVRLSDGTTAYVQQAVK----GEKLLEGQVIQLEDGTTAYIHQVTI---QKESFSFEDGQPVQLEDGSMAYIHHTPKEGCDPSALEAVQLEDGSTAYIHHPVPMESLGLHTVTLSDGTTAYVQQAVK----GEKLLEGQVIQLEDGTTAYIHQVTV---QKEALSFEDGQPVQLEDGSMAYIHRTPREGYDPSTLEAVQLEDGSTAYIHHPVA
ZN143_HUMAN
A6QQW0_BOVIN :Q8CI27_MOUSE :Q6VQB0_FUGRU :A0AUQ7_DANRE :Q4S173_TETNG :Q6GPP5_XENLA :ZNF76_MOUSE
ZNF76_HUMAN
VPQSDTILAIQADGTVAGLHT-GDATIDPDTISALEQYAAKVSIDGSESVAGTGMIGENEQEKKMQIVLQGHATRVTAKSQQSGEKAFRCEYDGCG--VPQSDTILAIQADGTVAGLHT-GDAAIDPDTISALEQYAAKVSIDGSEGVTGSGIIGENEQEKKMQIVLQGHATRVTAKSQQSGEKAFRCGYDGCG--VPQSDTILAIQADGTVAGLHT-GDATIDPDTISALEQYAAKVSIDGSDGVTSTGMIGENEQEKKMQIVLQGHATRVTPKSQQSGEKAFRCKYDGCG--MPQSNTILAIQADGTIADLQA-DATGLNPETISVLEQYATKVESIENQLG--SYSRAEADNGVHMRIVLQDQDNRQS-RSTNVGEKSFRCEYEGCG--MPQSNTILAIQADGTVADLQT-EGT-IDAETISVLEQYSTKMEATECGTG--LIGRGDSD-GVHMQIVLQGQDCRSP-RIQHVGEKAFRCEHEGCG--MPQSNTILAIQADGTIADLQA-DAAGLNPETISVLEQYATKVPLVSGLRLRLLWAGGEYRKPVGLLQPAGGGERRPH-ADCFTRSRQQAVAEHQCGREVPQSDTILAIQADGTVAGLHT-GEASIDPDTITALEQYAAKVSIEGGEGAGSNALITESESEKKMQIVLS-HGSRVPVKVPQTNEKAFRCDYEGCG--VPSDSAILAVQTEAGLEDLAAEDEEGFGTDTVVALEQYASKVLHDS--------------------------PASHNGKGQQVGDRAFRCGYKGCG--VPSESTILAVQTEVGLEDLAAEDDEGFSADAVVALEQYASKVLHDS--------------------------QIPRNGKGQQVGDRAFRCGYKGCG--
DE Homo sapiens MHC class I antigen (HLA-A) gene, HLA-A01 variant allele, DE alternatively spliced. ... XX FH Key Location/Qualifiers FT source 1.. FT /db_xref="taxon:9606" FT /organism="Homo sapiens" FT gene <1..> FT /gene="HLA-A" FT /allele="HLA-A01 variant" FT mRNA join(<1..373,504..773,1015..1266,1870..2145,2248..2364, FT 2807..2839,2982..3029,3199..>3374) FT exon <1.. FT /number= FT 5'UTR <1.. FT /allele="HLA-A01 variant" FT CDS join(301..373,504..773,1015..1266,1870..2145,2248..2364, FT 2807..2839,2982..3029,3199..3203) FT /gene="HLA-A" FT /product="MHC class I antigen" FT /protein_id="AAW30165.1" FT /translation="MAVMAPRTLLLLLSGALALTQTWAGSHSMRYFFTSVSRPGRGEPR FT FIAVGYVDDTQFVRFDSDAASQKMEPRAPWIEQEGPEYWDQETRNMKAHSQTDRANLGT FT LRGYYNQSEDGSHTIQIMYGCDVGPDGRFLRGYRQDAYDGKDYIALNEDLRSWTAADMA FT AQITKRKWEAVHAAEQRRVYLEGRCVDGLRRYLENDPPKTHMTHHPISDHEATLRCWAL FT GFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWAAVVVPSGEEQRYTCHVQHEG FT LPKPLTLRWELSSQPTIPIVGIIAGLVLLGAVITGAVVAAVMWRRKSSDRKGGSYTQAA FT SSDSAQGSDVSLTACKV" FT exon 504.. FT /number= FT exon 1015.. FT /number= FT variation 1268 FT /note="alternatively spliced compared to HLA-A010101; FT results in altered exon and protein length; no membrane FT expression detected" FT /replace="g" FT /gene="HLA-A" FT exon 1870.. FT /number= FT exon 2248.. FT /number= FT exon 2807.. FT /number= FT exon 2982.. FT /number= FT exon 3199..> FT /number= FT 3'UTR 3204..> ...
- Une recherche blastp a été effectuée à partir de la séquence protéique HLA-A (357 aa).
Cette protéine est similaire à des immunoglobulines comme le montrent les alignements avec
la protéine MUCM_RABIT. 3 pt
A blastp search has been performed with the HLA-A sequence (357 aa). This protein is
similar to immunoglobulins as shown by the alignments with the protein MUCM_RABIT.
a) Représentez schématiquement les 2 protéines en indiquant les régions conservées. Represent schematically the 2 proteins with their conserved regions.
b) Représentez le résultat d’une comparaison des deux protéines par la méthode de la
matrice de points.
Represent the result of a dotplot comparison between the two proteins.
>sp|P04221|MUCM_RABIT Ig mu chain C region membrane-bound form
Length = 479
Score = 45.1 bits (105), Expect = 3e- Identities = 29/94 (30%), Positives = 49/94 (52%), Gaps = 11/94 (11%)
Query: 214 EATLRCWALGFYPAEITLTWQRDGED-----QTQDTELVETRPAGDGTFQKWAAVVVPSG 268 ++ L C A GF P +I+++W RDG+ T+ E ET+ AG TF + + + Sbjct: 132 KSRLICQATGFSPKQISVSWLRDGQKVESGVLTKPVE-AETKGAGPATFSISSMLTITES 190
Query: 269 E---EQRYTCHVQHEGL--PKPLTLRWELSSQPT 297
- YTC V H G+ K +++ E S+ P+ Sbjct: 191 DWLSQSLYTCRVDHRGIFFDKNVSMSSECSTTPS 224
Score = 40.4 bits (93), Expect = 7e- Identities = 24/81 (29%), Positives = 37/81 (45%), Gaps = 6/81 (7%)
Query: 215 ATLRCWALGFYPAEITLTWQRDGEDQTQD---TELVETRPAGDGTFQKWAAVVVPS---G 268 AT+ C GF PA++ + WQ+ G+ + D T P G + + + V Sbjct: 352 ATVTCLVKGFSPADVFVQWQQRGQPLSSDKYVTSAPAPEPQAPGLYFTHSTLTVTEEDWN 411
Query: 269 EEQRYTCHVQHEGLPKPLTLR 289
- +TC V HE LP +T R Sbjct: 412 SGETFTCVVGHEALPHMVTER 432
Score = 31.2 bits (69), Expect = 4e- Identities = 23/85 (27%), Positives = 37/85 (43%), Gaps = 10/85 (11%)
Query: 219 CWALGFYPAEITLTWQRDGEDQTQDTELVETRPA---GDGTFQKWAAVVVPS-----GEE 270 C A F P+ +T +W + + V T P GD + + V+VPS G E Sbjct: 28 CLARDFLPSSVTFSWSFKNNSEIS-SRTVRTFPVVKRGD-KYMATSQVLVPSKDVLQGTE 85
Query: 271 QRYTCHVQHEGLPKPLTLRWELSSQ 295
- C VQH + L + + + S+ Sbjct: 86 EYLVCKVQHSNSNRDLRVSFPVDSE 110
Deuxième partie (6 points)
Une région génomique (982 bases, access GQ2293385) d’une souche de virus H1N1 a été
comparée à une banque de séquences nucléiques avec les programmes fasta et blastn. L’une
des séquences détectées est la séquence synthétique CS723756 (4700 pb). Les séquences
GQ2293385 et CS723756 ont également été alignées avec le programme d’alignement
optimal Water. Les alignements entre ces 2 séquences obtenus par les trois méthodes vous
sont présentés.
A genomic region (982 bases, access GQ2293385) of a H1N1 virus strain has been compared
to a nucleic sequence database using fasta and blastn programs. One of the detected
sequences is the synthetic sequence CS723756 (4700 pb). The GQ2293385 and CS
sequences have also been aligned using the optimal alignment program Water. The
alignments between the two sequences obtained using the three methods are shown.
1) Que pouvez-vous déduire sur la similarité entre les 2 séquences? Quelles sont les
principales différences entre les 3 alignements obtenus? Comment l’expliquez-vous 3 pts
What can you deduce about the similarity between these 2 sequences? What are the main
differences between the three alignments? How do you explain these differences?
2) Megablast est-il adapté dans le cadre de cette recherche? Pourquoi? 1 pts
Is Megablast suitable in the context of this search? Why?
3) Donnez schématiquement le résultat d’une comparaison de ces deux séquences par la
méthode de la matrice de points. 2 pts
Give schematically the result of a dotplot comparison of these two sequences.
BlastN
>emb|CS723756.1| Sequence 14 from Patent WO
Length=
Score = 105 bits (116), Expect = 6e-
Identities = 121/163 (74%), Gaps = 0/163 (0%)
Strand=Plus/Plus
Query 818 GATCGTCtttttttCAAATGTATTTATCGTCGCTTTAAATACGGTTTGAAAAGAGGGCCT 877 || || || || ||||| || || || || | | || || || |||| ||||| ||| Sbjct 1503 GACCGGCTGTTCTTCAAGTGCATCTACCGGAGACTGAAGTATGGACTGAAGAGAGGACCT 1562
Query 878 TCTACGGAAGGAGTGCCTGAGTCCATGAGGGAAGAATATCAACAGGAACAGCAGAGTGCT 937 | || | ||||||||||| || ||| |||| || ||| ||||||||||||||| || Sbjct 1563 GCCACAGCCGGAGTGCCTGAATCTATGCGGGAGGAGTATAGACAGGAACAGCAGAGCGCC 1622
Query 938 GTGGATGTTGACGATGGTCATTTTGTCAACATAGAGCTAGAGT 980 |||||||| || ||||| || || || || || ||||| |||| Sbjct 1623 GTGGATGTGGATGATGGCCACTTCGTGAATATCGAGCTGGAGT 1665
Score = 48.2 bits (52), Expect = 0. Identities = 32/36 (88%), Gaps = 0/36 (0%) Strand=Plus/Plus
Query 716 CCTACCAGAAGCGAATGGGAGTGCAGATGCAGCGAT 751 |||||||||| || |||||||||||||| |||||| Sbjct 1464 CCTACCAGAAATGAGTGGGAGTGCAGATGTAGCGAT 1499
Fasta
>>EM_PAT:CS723756; CS723756 Sequence 14 from Patent WO20 (4700 nt)
initn: 378 init1: 238 opt: 549 Z-score: 279.9 bits: 65.5 E(): 7.6e-
58.0% identity (69.2% similar) in 357 nt overlap (631-982:1319-1667)
Sequen GAGGCCAUGGAGGUUGCUAAUCAGACUAGGCAGAUGGUACAUGCAAUGAGAACUAUUGGG
EM_PAT GCUGACAGACUAACAGACUGUUCCUUUCCAUGGGUCUUUUCUGCAGUCACCGUCGUCGAC
Sequen ACUCAU--CCUAGCUCCAGUGCUGGUCU-GAAAGAUGACCUUCUUG-AAAAUUUGCAGGC
EM_PAT ACGUGUGAUCAGAUAUCGCGGCCGCUCUAGAGAUAUCGCCACCAUGCAGUACAUCAAGGC
Sequen CUACCAGAAGCGAAU-GGGAGUGCAGAUGCAGCGAUUCAAGUGAUCCUCUCGUCAUUGCA
EM_PAT CAACAGCAAGUUUAUCGGCAUCACAGAGCUGUCUCUGCUGACAGAAGUGGAGAC-CCCUA
Sequen GCAAAUAUCAUUGGGAUCUUGCACCUGAUAUUGUGGAUUACUGAUCGUCUUUUUUUCAAA
EM_PAT CCAGAAAUGAGUGGGA--GUGCA---GAUGUAG-CGAUAGC-GACCGGCUGUUCUUCAAG
Sequen UGUAUUUAUCGUCGCUUUAAAUACGGUUUGAAAAGAGGGCCUUCUACGGAAGGAGUGCCU
EM_PAT UGCAUCUACCGGAGACUGAAGUAUGGACUGAAGAGAGGACCUGCCACAGCCGGAGUGCCU
Sequen GAGUCCAUGAGGGAAGAAUAUCAACAGGAACAGCAGAGUGCUGUGGAUGUUGACGAUGGU
EM_PAT GAAUCUAUGCGGGAGGAGUAUAGACAGGAACAGCAGAGCGCCGUGGAUGUGGAUGAUGGC
Sequen CAUUUUGUCAACAUAGAGCUAGAGUAA
EM_PAT CACUUCGUGAAUAUCGAGCUGGAGUGAACACGUGGGAUCCAGAUCUGCUGUGCCUUCUAG
c) Que pouvez-vous dire sur la similarité entre les 2 protéines? 1.5 pts
What can you say about the similarity between the two proteins?
FIRST iteration
>sp|Q57979.2|SURE_METJA RecName: Full=5'-nucleotidase surE; AltName: Full=Nucleoside 5'-monophosphate phosphohydrolase Length=
Score = 58.9 bits (141), Expect = 3e-06, Method: Compositional matrix adjust. Identities = 55/219 (25%), Positives = 97/219 (44%), Gaps = 45/219 (20%)
Query 1 MRVLITNDDGPLSDQFSPYIRPFIQHIKRNYPEWKITVCVPHVQKSWVGKAHLAGKNLTA 60 M +LI NDDG +SP + +K + + IT+ P Q+S +G+A Sbjct 1 MEILIVNDDG----IYSPSLIALYNALKEKFSDANITIVAPTNQQSGIGRAI-------- 48
Query 61 QFIYSKVDAEDNTFWGPFIQPQIRSENSKLPYVLNAEIPKDTIEWILIDGTPASCANIGL 120
- P +++ + KD + + + GTP C +G+ Sbjct 49 ------------SLFEPLRMTKVK-------------LAKDIVGY-AVSGTPTDCVILGI 82
Query 121 HLLSNEPFDLVLSGPNVGRNTSAAYITSSGTVGGAMESVITGNTKAIAISWAYFN---GL 177
- DLV+SG N+G N I +SGT+G A E+ G K+IA S + Sbjct 83 YQILKKVPDLVISGINIGENLGTE-IMTSGTLGAAFEAAHHG-AKSIASSLQITSDHLKF 140
Query 178 KNVS-PLLMEKASKRSLDVIKHLVKNWDPKTDLYSINIP 215 K + P+ E +K + + + + ++D D+ +INIP Sbjct 141 KELDIPINFEIPAKITAKIAEKYL-DYDMPCDVLNINIP 178
SECOND iteration
>sp|Q57979.2|SURE_METJA RecName: Full=5'-nucleotidase surE; AltName: Full=Nucleoside 5'-monophosphate phosphohydrolase Length=
Score = 210 bits (534), Expect = 5e-52, Method: Composition-based stats. Identities = 67/318 (21%), Positives = 119/318 (37%), Gaps = 71/318 (22%)
Query 1 MRVLITNDDGPLSDQFSPYIRPFIQHIKRNYPEWKITVCVPHVQKSWVGKAHLAGKNLTA 60 M +LI NDDG +SP + +K + + IT+ P Q+S +G+A + L Sbjct 1 MEILIVNDDGI----YSPSLIALYNALKEKFSDANITIVAPTNQQSGIGRAISLFEPLRM 56
Query 61 QFIYSKVDAEDNTFWGPFIQPQIRSENSKLPYVLNAEIPKDTIEWILIDGTPASCANIGL 120
- D I + GTP C +G+ Sbjct 57 TKVKLAKD----------------------------------IVGYAVSGTPTDCVILGI 82
Query 121 HLLSNEPFDLVLSGPNVGRNTSAAYITSSGTVGGAMESVITGN---TKAIAISWAYFNGL 177
- DLV+SG N+G N I +SGT+G A E+ G ++ I+ + Sbjct 83 YQILKKVPDLVISGINIGENLGTE-IMTSGTLGAAFEAAHHGAKSIASSLQITSDHLKFK 141
Query 178 KNVSPLLMEKASKRSLDVIKHLVKNWDPKTDLYSINIPLVESLSDDTKVYYAPIWENRWI 237
- P+ E +K + + + + P D+ +INIP E+ + +T + + + Sbjct 142 ELDIPINFEIPAKITAKIAEKYLDYDMP-CDVLNINIP--ENATLETPIEITRLARKMYT 198
Query 238 PIFNGPHINLENSFAEIEDGNESSSISFNWAPKFGAHKDSIHYMDEYKDRTVLTDAEVI- 296 +E+ + S+ W D +E +D TD V+ Sbjct 199 --------------THVEERIDPRGRSYYW-------IDGYPIFEEEED----TDVYVLR 233
Query 297 ESEMISVTPMKATFKGVN 314
- IS+TP+ N Sbjct 234 KKRHISITPLTLDTTIKN 251