




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Three machine learning projects for the Special Course in Bioinformatics II. Students will implement and apply different methods for retention time prediction using kernel regression, metabolite identification from tandem mass spectra using CSI:FingerID, and drug-protein interaction prediction using multiple kernel learning regression. Prerequisites include programming skills in R, MATLAB, or Python, and basic knowledge of machine learning and linear algebra.
Typology: Study Guides, Projects, Research
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Instructor: Eric Bach ([email protected])
Background: In untargeted metabolomics studies complex biological sample with possibly thousands of molecules are encountered. Tandem mass spectrometry (MS/MS) is a widely used technique to extract patterns from biological samples to identify the molecules in it. However, the sensitivity of a mass spectrometer depends on the ability to reduce the com- plexity of the biological sample, e.g. to prevent MS/MS spectra representing more than one molecule. Liquid chromatography (LC) is a technique to do such complexity reduction. If a properly prepared biological sample is provided to a LC column the molecules in the sample will interact differently with the columns stationary phase. This makes the molecules separating as a function of time depending on their molecular properties. Some molecules are passing faster through the column than others. The time at which a molecule leaves the column is called the retention time.
The retention time can serve as an orthogonal information for the metabolite identi cation, e.g. it can exclude molecular candidates which are expected to have a different retention time [Aic+15] or make distinction of diastereoisomers possible [SNV15]. Unfortunately, retention time measurements are only available for a small number of molecules and not compara- ble between different chromatographic systems. On the other hand, for example the set of molecular candidates for the identi cation of one molecule (given its MS/MS spectra) can possible contain thousands of molecules. Therefore, machine learning algorithms have been applied to predict retention times given the structure of a molecule [Aic+15; Fal+16].
Goal: In this project the student will implement and apply two different kernelized regres- sion approaches to predict the retention time of molecules given their structure.
Methods and materials: For the project the student will be provided with a data set containing the retention time measurements for 596 molecules. The molecular descrip- tors and ngerprints will be given to the student. The student will implement the Kernel Ridge Regression (KRR) and the Magnitude-preserving kernel regression (MPKR) [CMR07]. The student will apply both approaches to predict the retention times for the molecular struc- tures in the data set. The student will compare the performance of KRR and MPKR and investigate, whether the magnitude-preserving error term leads to better retention time pre- diction.
Prerequisite: Basic knowledge of machine learning (especially kernel methods) & param- eter estimation (i.e. cross-validation), linear algebra, programming skills in R, MATLAB or Python. Some basic knowledge of molecular biology and chemoinformatics is bene cial.
References
[Aic+15] Fabian Aicheler et al. "Retention Time Prediction Improves Identi cation in Non- targeted Lipidomics Approaches". In: Analytical chemistry 87.15 (2015), pp. 76987704.
[CMR07] Corinna Cortes et al. "Magnitude-preserving Ranking Algorithms". In: Proceed- ings of the 24th International Conference on Machine Learning. ICML 07. ACM, 2007. url: http://doi.acm.org/10.1145/1273496.1273518.
[Fal+16] Federico Falchi et al. "Kernel-Based, Partial Least Squares Quantitative Structure- Retention Relationship Model for UPLC Retention Time Prediction: A Useful Tool for Metabolite Identi cation". In: Analytical Chemistry (2016).
[SNV15] Jan Stanstrup et al. "PredRet: Prediction of Retention Time by Direct Mapping be- tween Multiple Chromatographic Systems". In: Analytical Chemistry 87.18 (2015). PMID: 26289378, pp. 94219428. url: http://dx.doi.org/10.1021/acs.analchem.5b02287.
multiple kernel learning on fragmentation trees. Bioinformatics, 30(12):i157-i164.
[3] Duhrkop, K., Shen, H., Meusel, M., Rousu, J., and Bocker, S. (2015). Searching molec- ular structure databases with tandem mass spectra using CSI:FingerID. Proceedings of the National Academy of Sciences, 112(41):12580-12585.
3 Multiple kernel learning for drug-protein interaction
prediction
Instructor: Anna Cichonska ([email protected])
Background: Drug-like chemical compounds execute their actions mainly by modulat- ing cellular targets, such as proteins. Experimental determination of interactions between chemical compounds and protein targets is time consuming and expensive, and therefore, in the recent years, a lot of effort has been placed on the development of computational methods that could provide fast, large-scale and systematic pre-screening of chemical probes. In particular, a lot of work has been devoted to compound-based interaction prediction methods, including quantitative structure-activity relationship (QSAR) models, which aim to relate structural properties of the chemical molecules to their bioactivity pro les. An- other class of computational methods, so called target-based methods, focus on evaluating similarities between amino acid sequences or three-dimensional structures of protein targets. In these supervised learning approaches, models are trained using available bioactivity data, together with either compound or protein information, which allows then predicting either new targets of a given drug or new drugs targeting a given protein.
As a more recent class of computational modelling approaches, systems-based frameworks take advantage of the information available on both compounds and proteins. A key assump- tion is that similar drug compounds interact with similar proteins, and therefore a proper representation and use of similarities, equivalent to a kernel choice, is a rst critical prerequi- site for the achievement of high-quality drug-protein interaction (DPI) predictions. Classical kernel-based methods rely on a single kernel. However, such approaches are unlikely to be optimal when a growing variety of biological and molecular data sources become available simultaneously. Multiple kernel learning (MKL) methods, which search for an optimal combi- nation of several kernels, enabling the use of different information sources simultaneously and learning their importance for the prediction task, are therefore receiving increasing attention.
Typically, binary-valued DPI prediction setup is employed. However, molecular interactions are not simple on-off relationships and predicting real-valued binding affinities is more ap- pealing.
Goal: The goal of the project is to compute several protein kernels as well as drug kernels, and then use them in MKL regression framework to predict drug-protein binding affinities.
Materials and Methods: The data set consists of 50 drug compounds and 50 protein targets, which is a subset of the data from Metz et al. (2011) experimental study. DPIs are represented as real values re ecting how tightly a compound binds to a protein. The student will calculate Tanimoto kernels for drug compounds based on several ngerprints implemented in ChemmineR R package. For proteins, Smith-Waterman amino acid sequence alignment as well as Generic String kernel will be adopted. The student can also choose to compute other molecular descriptors. Then, pairwise kernels that directly relate drug- protein pairs will be constructed by taking Kronecker product of each pair of drug kernel and protein kernel. The student will use pairwise kernels with two-stage MKL algorithm ALIGNF. In the rst stage, kernel mixture weights are determined based on maximising the centred alignment, i.e. matrix similarity measure, between the combined kernel and the ideal, so- called target kernel derived from the label values. In the second stage, combined kernel is used with Kernel Ridge Regression (KRR) as a prediction model. The student will be provided a script for calculating kernel mixture weights ( rst stage) but should implement KRR (sec- ond stage). UNIMKL algorithm will form a baseline model, where all kernel mixture weights are equal to 1/P , P being the number of input kernels. The student will implement nested cross validation to tune the regularisation parameters λ of KRR and asses the predictive performance of the model.
Prerequisite: Programming skills (MATLAB, R, Python), basic knowledge of machine learning. Some knowledge of chemoinformatics will be bene cial.
References [1] Ding H, Takigawa I, Mamitsuka H, Zhu S. Similarity-based machine learning methods for predicting drug-target interactions: a brief review. Briefings in Bioinformatics 2014; 15(5): 734{47.
[2] Cichonska A, Rousu J, Aittokallio T. Identi cation of drug candidates and repurposing opportunities through compoundtarget interaction networks. Expert Opinion on Drug Dis- covery 2015; 10(12): 1333{45.
[3] Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. Prediction of drug-target interaction networks from the integration of chemical and genomic spaces. Bioinformatics 2008; 24(13): i232{40.
[4] Giguere S, Marchand M, Laviolette F, Drouin A, Corbeil J. Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics 2013; 14(1): 82.
[5] Cortes C, Mohri M, Rostamizadeh A. Algorithms for learning kernels based on centered alignment. Journal of Machine Learning Research 2012; 13(Mar): 795-828.
[6] Metz JT, Johnson EF, Soni NB et al. Navigating the kinome. Nature Chemical Biology 2011; 7(4): 200{2.
of nonstationary GPs. The goal is to analyse which gene expression time series warrant a non-stationary GP, and to analyse the model improvement and runtime effects from adding non-stationarity [See Tolvanen et al 2014].
More detailed instructions will be available from the instructor.
Required background knowledge/skills: Programming skills (Matlab, R, python), basic statistics, basic Bayesian statistics and machine learning. Some knowledge of biology will be useful.
References
5 Learning molecular representation with an autoen-
coder
Instructor: Huibin Shen ([email protected])
Background: The current representations of molecule including a binary vector representa- tion such as molecular ngerprint, a string representation such as InChi or SMILES, or 2d/3d graph. Many applications related to molecules are based on some kind of representation.
The popular deep learning is at the core to learn a better representation for the data. The number of molecules in nowadays compound database is in the scale of millions. With the heated deep learning approach, to learn a compact and continuous vector representation is possible.
Goal: In this project, we will use an variational autoencoder to learn such representation and test the representation in a metabolite identi cation pipeline. We will rst test the autoencoder on a subset of 5M molecules with ngerprint representation or SMILES string representation. The code and data is already available. The student will run the code on GPU nodes on triton.
Prerequisite: Python and Basic knowledge about machine learning and deep learning.
References [1] Gomez-Bombarelli, R., Duvenaud, D., Hernndez-Lobato, J. M., Aguilera-Iparraguirre, J., Hirzel, T. D., Adams, R. P., and Aspuru-Guzik, A. (2016). Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415.
[2] Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.
[3] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.