Module III: Data Analysis, Study notes of Data Analysis & Statistical Methods

ntroduction to Biological Data Types: Genomic, transcriptomic, proteomic, and clinical data. Structured vs. unstructured data. Data Pre-processing and Quality Control: Missing values, normalization, scaling, Outlier detection. Exploratory Data

Typology: Study notes

2025/2026

Available from 01/05/2026

mitsuha-miyamizu-5
mitsuha-miyamizu-5 🇮🇳

6 documents

1 / 28

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MSc Bioinformatics
Research methodology and ethics (MMB10400)
1st Year, Sem I
2025-26
1
Dr. Priyanka Sen Guha,
Assistant Professor,
Department of Biotechnology,
Brainware University.
Study Material
(Research methodology and ethics - MMB10400)
MODULE 3
Table of contents
Topic
Page no.
Visualizations
1
Dimensionality Reduction
5
Clustering and Pattern Discovery
15
Classification vs. regression
20
Question and Answer
26
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c

Partial preview of the text

Download Module III: Data Analysis and more Study notes Data Analysis & Statistical Methods in PDF only on Docsity!

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 1 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology,

Study Material

(Research methodology and ethics - MMB 10400 )

MODULE 3

Table of contents

Topic Page no.

Visualizations 1

Dimensionality Reduction 5

Clustering and Pattern Discovery 15

Classification vs. regression 20

Question and Answer 26

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 2 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Visualizations and Dimensionality Reduction in Data Analysis

1. Introduction Data visualization is a crucial step in exploratory data analysis (EDA) because it allows researchers to see patterns, detect anomalies, identify trends, and communicate results effectively. Different types of visualization techniques provide different perspectives on data distribution, variability, and relationships among variables. For biological data, which are often complex and high-dimensional, visualization helps simplify interpretation. Common visualization tools include boxplots , histograms , and scatterplots , each serving a unique analytical purpose. In addition to visualization, when datasets contain a large number of variables (such as in genomics, transcriptomics, and proteomics), statistical techniques such as dimensionality reduction are required to reduce complexity while preserving important information. One of the most widely used techniques is Principal Component Analysis (PCA). 2. Boxplots A boxplot (or box-and-whisker plot) is a graphical representation of data distribution based on the five-number summary : minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum. 2.1 Structure - Box : Represents the interquartile range (IQR = Q3 − Q1), where the central 50% of values lie. - Median line : A line inside the box indicating the central tendency. - Whiskers : Lines extending from the box to the minimum and maximum values within 1.5 × IQR. - Outliers : Points beyond whiskers plotted as dots, showing extreme values. 2.2 Applications - Comparing distributions between groups (e.g., gene expression levels across treatment vs control). - Detecting outliers that may represent measurement errors or rare biological events. - Assessing data skewness and spread. 2.3 Example in Biology In clinical studies, boxplots are often used to compare biomarker levels between healthy and diseased individuals, making it easier to identify significant variations.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 4 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Examples of data shapes using histograms. Histograms can have different shapes depending on the data they are representing.

4. Scatterplots A scatterplot is used to visualize the relationship between two continuous variables by plotting data points on a Cartesian coordinate system. Correlation Correlation is any statistical relationship between two random variables, regardless whether the relationship is causal (one variable causes the other) or not. Although correlation technically refers to any statistical association, it typically is used to describe how linearly related two variables are.Even though correlation cannot be used to prove a causal relationship between two variables, it can be used to make predictions. For example, given two variables that are highly correlated, we can relatively accurately predict the value of one given the other. Correlation between two random variables is typically presented graphically using a scatter plot, or numerically using a correlation coefficient. Scatter plots Scatter plots are graphs that depict clusters of dots that represent all of the pairs of data in an experiment. For example, a plot of weight vs. height will show a positive correlation: as height increases, weight also increases.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 5 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Scatter plots are constructed by plotting two variables along the horizontal (x) and vertical (y) axes. Below are examples of scatter plots showing a positive correlation, negative correlation, and no or little correlation. Note that the more closely the cluster of dots represents a straight line, the stronger the correlation. Positive correlation - The two random variables increase together. There is a positive correlation between height and weight: weight increases as height increases. Negative correlation - One of the random variables increases as the other decreases. There is a negative correlation between speed and the amount of time it takes to get somewhere: as speed increases, it takes a shorter amount of time to get to a destination. No correlation - There is no linear relationship between the two random variables. There is no correlation between being able to write in cursive and the number of fish in the ocean. Correlation coefficient A correlation coefficient is a numerical representation of the relationship between a pair of random variables. There are several different correlation coefficients, the most commonly used of which is the Pearson correlation coefficient. The Pearson correlation coefficient (r), also referred to as Pearson's r, is a value between - 1 and + that describes the linear relationship between two random variables. The closer to - 1 or +1, the more linear the relationship between the variables. An r of 0 would mean that there is no linear correlation between the variables at all: r = 1: perfect positive correlation r = - 1: perfect negative correlation

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 7 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, 5.2 Steps in PCA

  1. Standardization : Data are scaled so each variable contributes equally.
  2. Covariance matrix computation : Identifies relationships between variables.
  3. Eigen decomposition : Eigenvalues (amount of variance explained) and eigenvectors (directions) are calculated.
  4. Principal components selection : PCs with highest eigenvalues are chosen.
  5. Projection : Data are transformed into a new coordinate system defined by selected PCs. 5.3 Applications in Biology
  • Genomics : Reducing gene expression data for clustering and visualization.
  • Population genetics : Understanding population structure and ancestry.
  • Proteomics/Metabolomics : Identifying dominant features differentiating disease vs control.
  • Clinical datasets : Simplifying patient biomarker data for classification. 5.4 Interpretation
  • Scree plots show how much variance each PC explains.
  • Score plots (PC1 vs PC2) visualize sample clustering.
  • Loading plots indicate which variables contribute most to each PC. 5.5 Example Imagine you want to study ageing and want to find out what factors contribute to a longer or shorter lifespan. You start with a dataset that looks like this:

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 8 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Let’s say we have data from 20 people and the age when they passed away. And then we have many factors like their height, weight, sex, if they smoked or not, about their diet, economic status, profession, age of retirement… 200 biological (or not) factors. In order to understand the data we need to visualise it first. But we cannot visualise so many dimensions all at once. We could for example pick 2 factors and plot them. For example, we might think smoking and cholesterol levels in blood might be high contributors to your life expectancy: But we are losing some possible valuable information contained in other factors of the data, such as weight, diet… Is there a way of taking into account ALL factors? I guess you know what the answer is. An amazing solution to our problem is Principal Component Analysis , or PCA for short. PCA takes all the factors, combines them in a smart way and produces new factors. These new factors are called principal components , or PCs for short. And it does that in such a way that if you just focus on the first few principal components , you will keep most of the information from the dataset. Principal component analysis (PCA) is a technique that transforms high-dimensional data into lower- dimensional data while retaining as much information as possible. Going back to our example, imagine we computed principal component analysis on our dataset, and reduced our 200 dimensions to 5 principal components.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 10 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, In this case, we coloured the points (people) by age. We can see that our samples already cluster together really nicely by age. Those people who lived longer seemed to be grouped together, those who lived shorter tend to be grouped together. So PCA took into account all our biological factors, transformed them into new variables called principal components, and if we just take the first two we actually already see some interesting trends in our data. In summary, PCA is a great way of representing large datasets to observe trends, jumps, clusters and outliers. But what about the other PCs? You may ask. How do we know if the first 2 principal components are enough to capture most of the information, or variance in the dataset? Well, the solution to this is a scree plot. A scree plot tells you how much variance of the dataset (basically how much information) is explained with each principal component. In this case, the first principal componentor PC1 explains 50% of the variance in our dataset. This means that 50% of the variation of a person’s lifespan can be explained by the first principal component. If you add principal component 2, that’s another 35%, which makes 85%. Of course, it depends on your objectives, but explaining 85% of variance in life expectancy is pretty nice. Ideally , we want to get around 90% variance with just 2- to 3-components so that enough information is retained while we can still visualize our data on a plot.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 11 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, A scree plot tells you how much variance of the dataset each principal component holds. But what is exactly a ‘Principal Component’? What does ‘Principal Component’ mean? As a biologist who is interpreting biological data, you are interested in knowing which variables (biological factors) are responsible for the patterns seen among the observations (people)? Basically, you would like to know which variables are influential , and also how the variables are correlated. This is given by the principal component loadings. Basically, each variable gets a loading, or weight, for each principal component, which tells you how much it contributes to that PC. Let’s represent the loadings for some of the variables in a plot:

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 13 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, When variables are negatively (“inversely”) correlated , they are positioned on opposite sides of the plot origin, in diagonally opposed quadrants. For example, heart rate and frequent exercise are inversely correlated, meaning that when frequent exercise increases, your average heart rate decreases, and vice versa. Furthermore, the distance to the origin also conveys information.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 14 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, The further away a variable is from the origin, the stronger impact it has on the model. For example, here obesity, blood pressure and average heart rate seem to be good variables to separate longer lifespans from shorter lifespans. We recommend buying your favorite at super low prices with free shipping, and you can also pick up your order at the store on the same day. A loading plot is a great way of seeing the relationships between all our variables at the same time. It lets you know what variables are influential , and also how the variables are correlated. So how do I interpret a PCA? So back to our PCA plot. How do we use this to draw conclusions from our dataset? Let’s have a look at this other example. We have data from the gene expression profile of 50 different patients with lung cancer. And for each patient, we measured the expression of 30.000 genes.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 16 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Nice! This is our PCA plot. Each point is one of our patients. Patients with similar gene expression profiles are now clustered together. Just glancing at this plot, we can see that there are 3 clusters of patients. Let’s colour each of the three clusters to visualise it better: This means that overall, we have 3 distinct gene expression profiles. And this is very interesting, because it might mean that this group of patients will respond better to drug X and this group of patients will respond better to radiotherapy. What other conclusions can we make from this PCA plot? The orange and green clusters are different based on PC1. So the differences in gene expression profiles are probably due to genes that have heavy influences on PC1. Remember that the loadings tell us which genes have heavier weights on the PC. The pink and green clusters are different based on PC2, so the genes that influence PC2 more are likely to be responsible for this. But remember, PCs are actually ranked by how much they describe the data. PC1 is more important than PC2, so actually differences between clusters along PC1 axis are actually larger than the similar- looking distances along PC2 axis. In this case, the orange and green clusters are more different than the green and pink clusters.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 17 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Wait, but are 2 principal components enough to represent ALL our 30.000 points? Perhaps you remember how to check this. That’s right, we need to take a look at the scree plot. In this case, the first 2 PCs cover most variation in our data. So we’re happy to segment our patients in 3 groups. And this is one of the first steps of analysis.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 19 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Scalability: Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributes: Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms thatcan detect clusters of arbitrary shape. Minimal requirements for domain knowledge to determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control. Ability to deal with noisy data: Most real-world databases contain outliers or missing, unknown, or erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality. Incremental clustering and insensitivity to the order of input records: Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 20 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, That is, given a set of data objects, such an algorithm may return dramatically different clusterings depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms thatare insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensionsor attributes.Many clustering algorithms are good at handling low-dimensional data,involving only two to three dimensions. Human eyes are good at judging the qualityof clustering for up to three dimensions. Finding clusters of data objects in highdimensionalspace is challenging, especially considering that such data can be sparseand highly skewed. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods. 4.2 Major Clustering Methods: Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Methods