Data Pre-processing and Quality Control, Study notes of Computational and Statistical Data Analysis

Descriptive statistics (mean, median, SD). Visualizations: boxplots, histograms, scatterplots. Dimensionality Reduction: Principal Component Analysis (PCA). Clustering and Pattern Discovery, Clustering Techniques, and Evaluation of Clustering. Classification vs. regression. Introduction to Hypothesis Testing: p-values, t-test, ANOVA, Multiple testing correction (FDR, Bonferroni)

Typology: Study notes

2025/2026

Available from 01/05/2026

mitsuha-miyamizu-5
mitsuha-miyamizu-5 🇮🇳

6 documents

1 / 16

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MSc Bioinformatics
Research methodology and ethics (MMB10400)
1st Year, Sem I
2025-26
1
Dr. Priyanka Sen Guha,
Assistant Professor,
Department of Biotechnology,
Brainware University.
Study Material
(Research methodology and ethics - MMB10400)
MODULE 3
Table of contents
Topic
Page no.
Introduction to biological data type
1
Structure vs Unstructured Data
2
Data Pre-processing and quality control
12
Exploratory Data Analysis
16
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff

Partial preview of the text

Download Data Pre-processing and Quality Control and more Study notes Computational and Statistical Data Analysis in PDF only on Docsity!

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 1 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology,

Study Material

(Research methodology and ethics - MMB 10400 )

MODULE 3

Table of contents

Topic Page no.

Introduction to biological data type 1

Structure vs Unstructured Data 2

Data Pre-processing and quality control 12

Exploratory Data Analysis 16

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 2 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology,

1. Introduction to Biological Data Types With the advent of high-throughput technologies and bioinformatics tools, the life sciences have witnessed an explosion in the quantity and diversity of biological data. These data are generated across different biological levels, from DNA to RNA to proteins, and even at the level of clinical outcomes. Understanding the types and characteristics of biological data is essential for researchers aiming to explore complex biological systems, uncover disease mechanisms, and contribute to translational medicine. Among the vast types of data generated, four major categories form the core of biological and biomedical research: genomic, transcriptomic, proteomic, and clinical data. Each represents a different layer of biological information, and their integration enables a systems-level understanding of health and disease. a. Genomic Data Genomic data refers to the complete set of DNA sequences within an organism, encompassing genes (both coding and non-coding), regulatory regions, and structural elements of chromosomes. This data type serves as the foundational blueprint of life, dictating cellular function, inheritance patterns, and susceptibility to diseases. Genomic data is primarily obtained through advanced sequencing technologies such as Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES). While WGS provides a comprehensive view of an organism’s entire DNA, WES focuses only on the protein- coding regions (exons), which make up about 1 - 2% of the genome but harbor the majority of known disease-causing mutations. Applications of genomic data are vast and transformative. In clinical settings, genomic analysis is increasingly used for diagnostic purposes, particularly in rare genetic disorders and cancers. In population genomics, large-scale studies identify genetic variants associated with traits or diseases through Genome-Wide Association Studies (GWAS). Additionally, the field of pharmacogenomics uses genomic profiles to predict individual responses to drugs, enabling personalized treatment plans. Despite its potential, genomic data also present challenges such as variant interpretation, data storage, and ethical concerns, especially related to privacy and genetic discrimination. b. Transcriptomic Data While the genome provides the instructions, the transcriptome reflects the dynamic expression of those instructions under specific physiological or pathological conditions. Transcriptomic data refers to the complete set of RNA molecules transcribed from the genome at a given time. These include messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and various types of non-coding RNAs. Unlike the relatively static genome, the transcriptome is highly responsive to internal and external stimuli, making it a powerful indicator of cellular activity. The most widely used technologies for transcriptome profiling are RNA sequencing (RNA-Seq) and microarray analysis. RNA-Seq provides a quantitative and qualitative assessment of transcripts, enabling the identification of differentially expressed genes, alternative splicing events, and novel transcripts. Transcriptomic data is extensively used in biomedical research to understand disease mechanisms, especially in cancer, where specific expression signatures can distinguish between

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 4 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, However, clinical data pose several challenges. Data quality can be inconsistent due to missing values, different coding systems (e.g., ICD-10), and variability in data collection methods across institutions. Additionally, ethical and legal issues concerning patient consent, privacy, and data sharing must be addressed to facilitate responsible data use. In summary, biological data types such as genomic, transcriptomic, proteomic, and clinical data represent distinct yet interconnected layers of biological information. Genomic data provide the heritable code, transcriptomic data reveal gene expression patterns, proteomic data describe the functional protein landscape, and clinical data offer real-world patient insights. Together, they enable comprehensive exploration of biological systems and support translational research aiming to bridge bench-to-bedside applications. Mastery in understanding and analyzing these data types is critical for modern biologists, bioinformaticians, and healthcare professionals, especially in the era of integrative and personalized medicine.

2. Structured vs. Unstructured Data In data science and bioinformatics, data can be broadly classified into structured and unstructured categories, depending on the format, organization, and ease of processing. 1. Structured Data Structured data refers to information that is organized in a defined format —often stored in relational databases or spreadsheets—making it easily searchable and processable by algorithms. - Format & Storage: Typically arranged in rows and columns with well-defined data types (e.g., integers, dates, strings). - Examples in Life Sciences: o Genomic sequences stored in FASTA/VCF formats with defined identifiers o Patient clinical records with fixed fields (age, sex, diagnosis codes) o Laboratory measurements (enzyme activity, protein concentrations) in tabular datasets - Advantages: o Easy to store, retrieve, and analyze using SQL and statistical tools. o High compatibility with machine learning pipelines. - Limitations: o Less suited for capturing complex or context-rich data. o Requires predefined schemas, which limit flexibility. 2. Unstructured Data Unstructured data lacks a predefined organizational model , making it more complex to store, search, and analyze. - Format & Storage: Often stored as text documents, images, audio/video files, or free-form notes; metadata may be minimal. - Examples in Life Sciences: o Microscopy images o Pathology reports in free text o Audio recordings of patient interviews

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 5 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, o Next-generation sequencing (NGS) raw reads before preprocessing

  • Advantages: o Captures richer, more detailed information. o More adaptable for qualitative analysis and AI-driven pattern recognition.
  • Limitations: o Requires specialized tools (natural language processing, image recognition) for analysis. o High computational and storage requirements. 3. Relevance in Biotechnology & Bioinformatics Modern biological research often integrates both types of data. For example, structured clinical parameters may be combined with unstructured imaging data to develop diagnostic AI models. The ability to manage and analyse both data types is crucial for precision medicine, large-scale biological databases, and integrative omics studies. 3. Data Pre-processing and Quality Control Introduction In the modern era of data-intensive research , especially in bioinformatics, biotechnology, genomics, proteomics, and systems biology , the quality of raw data directly determines the validity of downstream analysis. Raw datasets often contain errors, inconsistencies, biases, and noise due to limitations in experimental design, sample handling, measurement technology, or human factors. Data preprocessing is the process of transforming raw, unrefined data into clean, consistent, and analyzable formats. Quality control (QC) ensures that the processed data maintains accuracy, reproducibility, and biological validity. Without preprocessing and QC, statistical analysis may produce false positives/negatives , leading to misleading conclusions and wasted research resources. 1. Handling Missing Values 1.1 Nature and Impact Missing data can occur when a measurement or observation fails to be recorded. In biotechnology, missing values can arise from:
  • Instrument limitations (e.g., below detection limit in ELISA assays)
  • Sample degradation before measurement
  • Human oversight in recording data
  • Sequencing gaps in next-generation sequencing (NGS) data Impact of Missing Values:
  • Distorts statistical parameters (mean, variance, correlation)
  • Reduces statistical power due to smaller sample sizes
  • Can bias models, particularly if missingness is systematic 1.2 Types of Missing Data
  1. MCAR (Missing Completely at Random) Probability of missingness is unrelated to observed or unobserved data. Example: Random instrument failure.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 7 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Method Description Application in Biotech Quantile Equalizes distributions across samples Microarray data normalization Log Transform Reduces skewness, stabilizes variance Metabolomics, gene expression Example: In RNA-seq, raw read counts can be normalized using:

  • TPM (Transcripts Per Million)
  • RPKM (Reads Per Kilobase Million)
  • Median-of-ratios method (DESeq2) 3. Scaling 3.1 Concept While normalization addresses distributional differences , scaling ensures that feature magnitudes are comparable, particularly in distance-based algorithms (e.g., K-means clustering, SVM, PCA). 3.2 Types of Scaling
  1. Standard Scaling (Z-score scaling): z=x−μσz = \frac{x - \mu}{\sigma}z=σx−μ Suitable for normally distributed features.
  2. Min-Max Scaling: x′=x−xminxmax−xminx' = \frac{x - x_{min}}{x_{max} - x_{min}}x′=xmax−xminx−xmin Preserves shape but rescales range.
  3. Robust Scaling: Uses median and interquartile range (IQR) ; resistant to outliers. 3.3 Importance in Biotechnology
  • In gene expression clustering , genes with high absolute counts may dominate analysis unless scaled.
  • In metabolomics , concentrations vary widely; scaling ensures smaller metabolites still contribute to multivariate models. 4. Outlier Detection 4.1 Definition

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 8 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, An outlier is an observation significantly different from the rest of the dataset. 4.2 Causes

  • Experimental error (e.g., pipetting mistakes)
  • Contamination (microbial, chemical)
  • Genuine biological variation (rare disease mutations) 4.3 Detection Methods Method Description Best Use Case Z-score Flag values >3 SD from mean Gaussian data IQR Method Detect values beyond 1.5×IQR from Q1/Q3 Skewed data Visualization Boxplots, scatter plots Quick detection Machine Learning Isolation Forest, DBSCAN High-dimensional omics data 4.4 Treatment of Outliers
  • Investigate biological plausibility before removal.
  • Remove if confirmed as error.
  • Transform data (log, Box-Cox) to reduce influence.
  • Winsorize (cap/floor) extreme values. 5. Workflow for Data Preprocessing and QC
  1. Raw Data Collection
  2. Missing Value Assessment
  3. Imputation or Removal
  4. Normalization
  5. Scaling
  6. Outlier Detection
  7. Final QC Report

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 10 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, provide numerical and graphical summaries that describe and simplify data, enabling researchers to understand patterns, central tendencies, and variability before applying inferential statistical methods. Three fundamental descriptive measures are the mean , median , and standard deviation (SD). They help in summarizing the dataset’s central value and spread, facilitating comparison between datasets.

2. Mean (Arithmetic Mean) Definition: The mean, or average, is the sum of all values in a dataset divided by the number of observations. xˉ=∑i=1nxin\bar{x} = \frac{\sum_{i=1}^n x_i}{n}xˉ=n∑i=1nxi Where: - xˉ\bar{x}xˉ = mean - xix_ixi = each data point - nnn = total number of observations Example: If five protein concentrations (mg/mL) are 3.2, 4.1, 3.8, 4.5, 4.4 , xˉ=3.2+4.1+3.8+4.5+4.45=20.05=4.0\bar{x} = \frac{3.2 + 4.1 + 3.8 + 4.5 + 4.4}{5} = \frac{20.0}{5} = 4.0xˉ=53.2+4.1+3.8+4.5+4.4=520.0=4. Applications in Life Sciences: - Determining the average gene expression level across replicates. - Calculating mean enzyme activity from multiple assays. Advantages: - Simple to calculate. - Uses all data points. Limitations: - Sensitive to extreme values (outliers). 3. Median Definition: The median is the middle value in an ordered dataset. It divides the dataset into two equal halves, with 50% of values below and 50% above it.

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 11 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Steps to Calculate:

  1. Arrange data in ascending order.
  2. If nnn is odd → median is the middle value.
  3. If nnn is even → median is the average of the two middle values. Example: Dataset: 2.4, 3.1, 3.5, 4.0, 4.8 (odd n=5n=5n=5) → Median = 3. Dataset: 2.4, 3.1, 3.5, 4.0 (even n=4n=4n=4) → Median = 3.1+3.52=3.3\frac{3.1 + 3.5}{2} = 3.323.1+3.5=3. Applications in Life Sciences:
  • Reporting median survival times in clinical trials.
  • Summarizing skewed biological measurements (e.g., biomarker levels). Advantages:
  • Not affected by extreme values.
  • Represents the central point in skewed distributions. Limitations:
  • Does not consider all data points directly in the calculation. 4. Standard Deviation (SD) Definition: Standard deviation measures the spread or variability of a dataset around the mean. A small SD indicates data points are close to the mean, while a large SD suggests high variability. Formula (Sample SD): s=∑i=1n(xi−xˉ)2n−1s = \sqrt{\frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1}}s=n−1∑i=1n(xi−xˉ) Where:
  • sss = sample standard deviation
  • xix_ixi = each data value
  • xˉ\bar{x}xˉ = sample mean
  • nnn = number of observations Example: Dataset: 2, 4, 4, 4, 5, 5, 7, 9 Mean = 555 SD calculation:

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 13 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, A solid understanding of these statistics allows researchers to summarize experimental results meaningfully, identify trends, and assess data reliability before applying advanced statistical models. Questions Q1. Which of the following best describes genomic data? A. The complete set of RNA molecules in a cell at a given time B. The total protein composition within a tissue C. The complete DNA sequence, including coding and non-coding regions D. The real-world patient-level health records Answer: C Q2. Transcriptomic data is particularly valuable because it: A. Remains constant regardless of environmental changes B. Reflects dynamic gene expression under specific conditions C. Only represents protein-coding genes D. Is easier to store and process than genomic data Answer: B Q3. Which technique is most commonly associated with large-scale proteomic analysis? A. PCR B. RNA-Seq C. Mass spectrometry D. Southern blotting Answer: C

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 14 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Q4. Which of the following is not an example of structured biological data? A. FASTA sequence files B. Enzyme activity values in a spreadsheet C. Microscopy images D. Patient age and diagnosis code in an EHR Answer: C Q5. In handling missing data, Multiple Imputation (MI) is preferred because: A. It fills in missing values with zero B. It generates several plausible values and combines results C. It removes all missing records to maintain accuracy D. It guarantees the original dataset remains unchanged Answer: B Q6. Which normalization method is particularly suited for making gene expression data comparable across samples in RNA-Seq? A. Min-Max Scaling B. Quantile Normalization C. TPM (Transcripts Per Million) D. Log Transformation only Answer: C Q7. Scaling in data preprocessing is most critical when: A. Variables have the same units but different ranges B. All features are already normally distributed C. Outliers have been completely removed

Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 16 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Short Answer Questions

  1. Define transcriptomic data and mention one common technology used to generate it.
  2. What is the key difference between structured and unstructured biological data?
  3. List two potential causes of missing values in biotechnology datasets.
  4. Name one normalization method used in RNA-Seq data and its purpose.
  5. What does the Interquartile Range (IQR) method detect in data analysis? Long Answer Questions
  6. Explain the four major types of biological data—genomic, transcriptomic, proteomic, and clinical—highlighting their sources, applications, and limitations.
  7. Discuss the differences between structured and unstructured data in bioinformatics, providing examples and relevance in precision medicine.
  8. Describe various approaches to handling missing values in biological datasets, and evaluate their advantages and disadvantages.
  9. Explain normalization and scaling in data preprocessing. Include examples of methods and their importance in omics data analysis.
  10. Outline the complete workflow for data preprocessing and quality control in biotechnology research, highlighting each step with examples.