Module 2: Core Bioinformatics FINAL EXAM SOLUTIONS, Exercises of Bioinformatics

Question 1: What is the statement that does NOT apply to the FASTA format? a. FASTA format can be used to store multiple sequences in tandem in ...

Typology: Exercises

2021/2022

Uploaded on 08/01/2022

fabh_99
fabh_99 🇧🇭

4.4

(53)

543 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Master in Bioinformatics January 9th, 2013
Universitat Autònoma de Barcelona
Module 2: Core Bioinformatics
FINAL EXAM SOLUTIONS
Question 1: What is the statement that does NOT apply to the FASTA format?
a. FASTA format can be used to store multiple sequences in tandem in a unique computer file.
b. In FASTA file the definition line always starts with greater than (>) symbol, and usually no
constraints restrict its length.
c. In addition to the plain text file extension (.txt), there is no other file extension for a text file
containing FASTA formatted sequences.
d. FASTA sequence lines could contain line breaks or paragraph marks <¶> at the end of each line.
Question 2: Search SWISS-PROT using the Sequence Retrieval System (SRS) and determine how many proteins
inferred from homology are there greater than 100 KDa in the mouse (Mus musculus) genome.
a. No entries found
b. <10 entries
c. 10 and <100 entries
d. 100 entries
Explanation:
At the EBI-SRS browsing the UniProtKB/Swiss-Prot database using the combined search terms:
- Species: Mus musculus
- Protein existence: Inferred from homology
- MolWeight: >100000 Da
You will find 9 entries.
The number of occurrences of dinucleotides in the genoma of Dengue 1 virus has been the following:
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
1108 720 890 708 901 523 261 555 976 500 787 507 440 497 832 529
Note that, 1108 + 720 + …+ 529 = 10734
Moreover, for this virus the frequencies of the nucleotides are the following,
a c g t
0.3191430 0.2086633 0.2580345 0.2141593
Question 3: We are specifically interested in the dinucleotides tg and cg. Are they over or under-represented?
a. Both are over-represented
b. tg is over-represented and cg is under-represented
c. tg is under-represented and cg is over-represented
d. Both are under-represented
pf3
pf4
pf5
pf8

Partial preview of the text

Download Module 2: Core Bioinformatics FINAL EXAM SOLUTIONS and more Exercises Bioinformatics in PDF only on Docsity!

Master in Bioinformatics January 9th, 2013

Universitat Autònoma de Barcelona

Module 2: Core Bioinformatics

FINAL EXAM SOLUTIONS

Question 1 : What is the statement that does NOT apply to the FASTA format?

a. FASTA format can be used to store multiple sequences in tandem in a unique computer file. b. In FASTA file the definition line always starts with greater than (>) symbol, and usually no constraints restrict its length. c. In addition to the plain text file extension (.txt), there is no other file extension for a text file containing FASTA formatted sequences. d. FASTA sequence lines could contain line breaks or paragraph marks <¶> at the end of each line.

Question 2 : Search SWISS-PROT using the Sequence Retrieval System (SRS) and determine how many proteins inferred from homology are there greater than 100 KDa in the mouse (Mus musculus) genome.

a. No entries found b. <10 entries c. 10 and <100 entries d. 100 entries

Explanation : At the EBI-SRS browsing the UniProtKB/Swiss-Prot database using the combined search terms:

  • Species: Mus musculus
  • Protein existence: Inferred from homology
  • MolWeight: >100000 Da You will find 9 entries.

The number of occurrences of dinucleotides in the genoma of Dengue 1 virus has been the following:

aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt

1108 720 890 708 901 523 261 555 976 500 787 507 440 497 832 529

Note that, 1108 + 720 + …+ 529 = 10734

Moreover, for this virus the frequencies of the nucleotides are the following,

a c g t

0.3191430 0.2086633 0.2580345 0.

Question 3: We are specifically interested in the dinucleotides tg and cg. Are they over or under-represented?

a. Both are over-represented b. tg is over-represented and cg is under-represented c. tg is under-represented and cg is over-represented d. Both are under-represented

Explanation: To answer the question we need to calculate ρ tg and ρ cg. To do this, we have to calculate the frequencies of the dinucleotides tg and cg, ftg=832/10734, fcg=261/10734. We know also the frequencies of the nucleotides fc=0.2086633, fg=0.2580345 and ft =0.2141593. Therefore, ρ tg= ftg/(ftxfg )≈1.403 and ρ cg= fcg/(fcxfg )≈0.452. Consequently, tg is over - represented (ρ tg>1) and cg is under- represented (ρ cg<1), and the right answer is b).

Question 4: Assuming that a Markov chain model is appropriate to describe the behavior of this DNA sequence, the estimated transition probability pcg= P(Xn=g| Xn-1=c) is,

a. 0.2086633 x 0.2580345 = 0. b. 261/(1108 + 720 +…+529) = 261/10734 = 0. c. 261/(901+523+261+555) = 261/2240 = 0. d. (261 + 500)/(1108 + 720 +…+529) = 761/10734 = 0.

Explanation : The transition probability to pass from c to g is calculated as the number of times that cg occurs (261) divided by the total number of occurrences of all the dinucleotides beginning with a c (ca, cc, cg, ct). This is just 261/(901+523+261+555), that is, the answer c).

Question 5: The most commonly used multiple aligners

a. Align all the sequences at once b. Align the sequences 2 by 2 using a progressive approach c. Add the sequences 1 after the other to a larger alignment d. Add the sequences 3 by 3 using a progressive approach

Question 6: Given the following guide tree, which proposition is most compatible with the order in which the sequences could be aligned?

a. A with B, followed by C with D, followed by E with F …. b. A with B, followed by A with C, followed by A with D …. c. A with B, followed by C with D, followed by AB with CD, followed by ABCD with E… d. any of the above mentioned alignment order is compatible with the tree

been extensively used to find sequence motifs in RMF and there are several Transcription Factor Binding Site (TFBS) databases that were built on sets of PWMs (like TransFac, Jaspar, Oreganno). Hidden Markov Models (HMMs) have been used to find sequence content biases (like coding potential, codon periodicity, etc), as well as signals (donor and acceptor splice sites), or a combination of both (as it happens with the General-HMMs that define a gene model where each node can be a combination of signal and content sensors).

Comparative Genomics (CG) techniques can help both, the CGP and RMF, as they highlight those regions that have been more conserved than expected, as evolution tends to keep what is important for a biological function. Finding protein-coding genes benefits of the whole genome comparison, as changes on the nucleoteide sequence can be synonymous at protein sequence level and amino acid substitutions can be better modelled into an scoring matrix (like those used by BLAST-like tools). Phylogenetic Footprinting (PF), based on the comparison of genomic sequences of properly phylogenetically distant species, can point out conserved regions due to a functional constrain that can code for protein-coding features, but also for regulatory features and those non-canonical genes (also known as non-coding RNAs, ncRNAs). When species are too closely related, so their genomic sequences haven't had enough time to diverge, we can still take advantage of the CG approach by applying the Phylogenetic Shadowing (PS). Next-Generation Sequencing (NGS) methodologies provide a myriad of new experimental evidences to define genic structures, based on RNA-seq data, and to detect novel regulatory motifs, thanks to the ChIP-seq approaches. However, both approaches require a reference genome to which sequencing reads are mapped to detect functional regions on that genome

Question 10: The surname “Barbaluenga” is very rare in Spain. As a matter of fact there is only one single family. Mr and Mrs Barbaluenga have a single son (called Zifban Barbaluenga). Calculate the probability that this surname is lost in the next generation.

a. 25% b. 37% c. 63% d. 75%

Explanation: Surnames are passed on to all male children by their father. One surname present in a single man is akin to a new mutation in a population. The surname will be lost in the next generation unless the man has a male child.

The population of Spain is large (in practical terms infinite) and constant in size (at least over the short time of a generation). Thus, the average number of children per family is 2 and the distribution of family size (k) can be described using the Poisson (ppt presentation 2, slide 10). The probability that a child is a male is 1/2.

The probability of extinction is:

e-2 (1) + 2e-2 (1/2) + (22/2!) e- 2 (1/4) + …………… + (2k/k!) e -2 (1/2)k = 0.367 37%

(ppt presentation 2, slide 11)

Question 11: In the practical exercise, we calculated the (absolute) number of neutral mutations fixed in one million years in populations with four different sizes (N= 500, 10000, 500000, 10000000). Which of the following statements is true?

a. This number was identical in the four cases. b. This number was higher in large populations than in small populations. c. This number was higher in small populations than in large populations. d. This number was higher in populations with intermediate size than in populations with extreme sizes.

Explanation: In a population of size N, the number of new neutral mutations arisen per generation is 2Nμ where μ stands fior the rate of neutral mutation. The fixation probability of a unique neutral mutation is (1/2N). Thus, the rate of neutral evolution (number of neutral mutations fixed per generation) is:

2Nμ x (1/2N) = μ

and does not depend on population size. Therefore, the absolute number of neutral mutations fixed in 10^6 years (or generations) is identical in the four cases.

Question 12: Indicate which of the following statements about the human genome is false:

a. The current assembly of the human genome contains more than 350 gaps b. The human genome has ≈20,000 protein-coding genes and ≈13,000 long non-coding RNAs c. The exons represent a 3% of the human genome but only 1.2% of the genome is coding d. The repetitive content of the human genome is 35%

Question 13: Which of the following statements about the functional elements found in the positions 73,031,000-73,087,000 of the human X chromosome is false?

a. There is a long non-coding intergenic RNA with several alternative transcripts b. An active promoter has been found close to the transcription start site of the longest transcript produced from this genomic region c. The transcription of this gene has been detected by RNA-Seq but no ESTs mapping to this region are currently known d. The second exon of the longest transcript overlaps a transposable element insertion

Question 14: The observed nucleotide divergence between the sequences of two different species shows that there are 200 differences in 1000 studied positions (that is 20% observed divergence). Is this value a good estimate of the species divergence?

a. Yes, the observed divergence is proportional to the time of divergence by the following relation Div_obs = 2sT, where s is the substitution rate and T is the Time to the ancestor. b. No, the observed divergence must be corrected first by possibly recurrent mutations using the relation Div_real = -3/4 * ln(1- 4/3 * Div_obs) c. No, the observed divergence must be corrected by possibly recurrent mutations using an appropriate evolutionary model.

Question 15: A graphical representation of the relationships among several Drosophila species has been obtained for the gene Adh. A distance matrix was calculated using an appropriate evolutionary model and the method of neighbor-joining (NJ) for phylogenetic reconstruction was used. This phylogeny represents:

Question 19: Explain the main concepts that differentiate Quantum Mechanical and Molecular Mechanics approaches.

Solution:

QM Explicit representation of electron and nucli resolution of equation of schrodinger Allows to deal with changes in chemical states Computationally demanding small systems

MM parameters for atoms, their intra and intermolecular interactions use the laws of classical mechanics Allow large conformational explorations Low computational cost large system

Points: 1 for the three concepts or more correctly mentioned; 0.75 for two concepts correctly mentioned; 0.5 point for only one; 0 for none

Question 20: Could you briefly overview the limitations of geometry optimizations versus Molecular Dynamics?

Solution:

Geometry optimization only allows to characterize the most stable conformation of a molecule the closest from the initial geometry (generally the most occupied). No consideration of temperature (minimum on the potential energy surface) hence dynamical effects are provided with these calculations. MD allows to provide with trajectory of the molecules (motions), generate ensemble of conformations that allow statistical analysis.

Points: 1 point if correctly described, 0,5 if approximative, 0.25 if comparison not done but one of the method is well described, 0 if out subject

Question 21: Which region(s) of a protein are the more challenging the model using homology modeling approaches?

Solution:

The most challenging regions to model by HM are those with low sequence identity with the targets. Those are generally the loops and external regions of a prot (like for interaction prot-prot or allosteric)

Points: 1 point if both terms correctly mentioned, 0.75 if only one, 0,5 if approximative, 0.25 if unclear "sounds like" answers, and 0 if out subject.

Question 22: A mass spectrometer give us:

a. the mass of an analyte b. the weight of an analyte c. the mass to charge ratio of an analyte d. the radiofrequency of an analyte

Question 23: Which sentence is NOT correct:

a. The electrospray ionization methods generate multiple charged states of an analyte b. The MALDI-TOF only generate mono-protonated ions c. The electrospray ionization methods uses solid samples d. ESI and MALDI-TOF are the most soft ionization methods

Question 24: Create a workflow using Galaxy that answers the following question: Which is the longest exon in chromosome 21 in humans? (note: use database hg19) (hint: Text Manipulation>Compute may help you)

Solution:

Any of the next exons in chr21 is correct

chr21 34921781 34927697 uc002ysb.1_cds_2_0_chr21_34921782_f 0 + 5916. chr21 34921781 34927697 uc002ysc.3_cds_2_0_chr21_34921782_f 0 + 5916. chr21 34921781 34927697 uc002yse.1_cds_2_0_chr21_34921782_f 0 + 5916.

Also this exon in chr21 is accepted

chr21 40557403 40569341 uc021wjf.1_exon_0_0_chr21_40557404_r 0 - 11938.

or this exon in chr

chr19 9056172 9077865 uc002mkp.3_exon_81_0_chr19_9056173_r 0 - 21693.

Question 25: Which web services have you used in the previous workflow?

Solution:

The tools in Galaxy are not strictly web services (but under alternative definitions of web services the data retrieval tools or the Galaxy platform as a whole could be considered web services)

Question 26: What are 'ID' and 'class' in HTML and the differences between them? Why are they so useful?

Solution:

ID and class are attributes. The main difference is that ID can only be used once and class can be repeated. Both of them are important to reference the elements to give them format or functionality.