









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Descriptive statistics (mean, median, SD). Visualizations: boxplots, histograms, scatterplots. Dimensionality Reduction: Principal Component Analysis (PCA). Clustering and Pattern Discovery, Clustering Techniques, and Evaluation of Clustering. Classification vs. regression. Introduction to Hypothesis Testing: p-values, t-test, ANOVA, Multiple testing correction (FDR, Bonferroni)
Typology: Study notes
1 / 16
This page cannot be seen from the preview
Don't miss anything!










Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 1 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology,
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 2 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology,
1. Introduction to Biological Data Types With the advent of high-throughput technologies and bioinformatics tools, the life sciences have witnessed an explosion in the quantity and diversity of biological data. These data are generated across different biological levels, from DNA to RNA to proteins, and even at the level of clinical outcomes. Understanding the types and characteristics of biological data is essential for researchers aiming to explore complex biological systems, uncover disease mechanisms, and contribute to translational medicine. Among the vast types of data generated, four major categories form the core of biological and biomedical research: genomic, transcriptomic, proteomic, and clinical data. Each represents a different layer of biological information, and their integration enables a systems-level understanding of health and disease. a. Genomic Data Genomic data refers to the complete set of DNA sequences within an organism, encompassing genes (both coding and non-coding), regulatory regions, and structural elements of chromosomes. This data type serves as the foundational blueprint of life, dictating cellular function, inheritance patterns, and susceptibility to diseases. Genomic data is primarily obtained through advanced sequencing technologies such as Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES). While WGS provides a comprehensive view of an organism’s entire DNA, WES focuses only on the protein- coding regions (exons), which make up about 1 - 2% of the genome but harbor the majority of known disease-causing mutations. Applications of genomic data are vast and transformative. In clinical settings, genomic analysis is increasingly used for diagnostic purposes, particularly in rare genetic disorders and cancers. In population genomics, large-scale studies identify genetic variants associated with traits or diseases through Genome-Wide Association Studies (GWAS). Additionally, the field of pharmacogenomics uses genomic profiles to predict individual responses to drugs, enabling personalized treatment plans. Despite its potential, genomic data also present challenges such as variant interpretation, data storage, and ethical concerns, especially related to privacy and genetic discrimination. b. Transcriptomic Data While the genome provides the instructions, the transcriptome reflects the dynamic expression of those instructions under specific physiological or pathological conditions. Transcriptomic data refers to the complete set of RNA molecules transcribed from the genome at a given time. These include messenger RNA (mRNA), ribosomal RNA (rRNA), transfer RNA (tRNA), and various types of non-coding RNAs. Unlike the relatively static genome, the transcriptome is highly responsive to internal and external stimuli, making it a powerful indicator of cellular activity. The most widely used technologies for transcriptome profiling are RNA sequencing (RNA-Seq) and microarray analysis. RNA-Seq provides a quantitative and qualitative assessment of transcripts, enabling the identification of differentially expressed genes, alternative splicing events, and novel transcripts. Transcriptomic data is extensively used in biomedical research to understand disease mechanisms, especially in cancer, where specific expression signatures can distinguish between
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 4 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, However, clinical data pose several challenges. Data quality can be inconsistent due to missing values, different coding systems (e.g., ICD-10), and variability in data collection methods across institutions. Additionally, ethical and legal issues concerning patient consent, privacy, and data sharing must be addressed to facilitate responsible data use. In summary, biological data types such as genomic, transcriptomic, proteomic, and clinical data represent distinct yet interconnected layers of biological information. Genomic data provide the heritable code, transcriptomic data reveal gene expression patterns, proteomic data describe the functional protein landscape, and clinical data offer real-world patient insights. Together, they enable comprehensive exploration of biological systems and support translational research aiming to bridge bench-to-bedside applications. Mastery in understanding and analyzing these data types is critical for modern biologists, bioinformaticians, and healthcare professionals, especially in the era of integrative and personalized medicine.
2. Structured vs. Unstructured Data In data science and bioinformatics, data can be broadly classified into structured and unstructured categories, depending on the format, organization, and ease of processing. 1. Structured Data Structured data refers to information that is organized in a defined format —often stored in relational databases or spreadsheets—making it easily searchable and processable by algorithms. - Format & Storage: Typically arranged in rows and columns with well-defined data types (e.g., integers, dates, strings). - Examples in Life Sciences: o Genomic sequences stored in FASTA/VCF formats with defined identifiers o Patient clinical records with fixed fields (age, sex, diagnosis codes) o Laboratory measurements (enzyme activity, protein concentrations) in tabular datasets - Advantages: o Easy to store, retrieve, and analyze using SQL and statistical tools. o High compatibility with machine learning pipelines. - Limitations: o Less suited for capturing complex or context-rich data. o Requires predefined schemas, which limit flexibility. 2. Unstructured Data Unstructured data lacks a predefined organizational model , making it more complex to store, search, and analyze. - Format & Storage: Often stored as text documents, images, audio/video files, or free-form notes; metadata may be minimal. - Examples in Life Sciences: o Microscopy images o Pathology reports in free text o Audio recordings of patient interviews
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 5 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, o Next-generation sequencing (NGS) raw reads before preprocessing
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 7 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Method Description Application in Biotech Quantile Equalizes distributions across samples Microarray data normalization Log Transform Reduces skewness, stabilizes variance Metabolomics, gene expression Example: In RNA-seq, raw read counts can be normalized using:
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 8 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, An outlier is an observation significantly different from the rest of the dataset. 4.2 Causes
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 10 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, provide numerical and graphical summaries that describe and simplify data, enabling researchers to understand patterns, central tendencies, and variability before applying inferential statistical methods. Three fundamental descriptive measures are the mean , median , and standard deviation (SD). They help in summarizing the dataset’s central value and spread, facilitating comparison between datasets.
2. Mean (Arithmetic Mean) Definition: The mean, or average, is the sum of all values in a dataset divided by the number of observations. xˉ=∑i=1nxin\bar{x} = \frac{\sum_{i=1}^n x_i}{n}xˉ=n∑i=1nxi Where: - xˉ\bar{x}xˉ = mean - xix_ixi = each data point - nnn = total number of observations Example: If five protein concentrations (mg/mL) are 3.2, 4.1, 3.8, 4.5, 4.4 , xˉ=3.2+4.1+3.8+4.5+4.45=20.05=4.0\bar{x} = \frac{3.2 + 4.1 + 3.8 + 4.5 + 4.4}{5} = \frac{20.0}{5} = 4.0xˉ=53.2+4.1+3.8+4.5+4.4=520.0=4. Applications in Life Sciences: - Determining the average gene expression level across replicates. - Calculating mean enzyme activity from multiple assays. Advantages: - Simple to calculate. - Uses all data points. Limitations: - Sensitive to extreme values (outliers). 3. Median Definition: The median is the middle value in an ordered dataset. It divides the dataset into two equal halves, with 50% of values below and 50% above it.
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 11 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Steps to Calculate:
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 13 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, A solid understanding of these statistics allows researchers to summarize experimental results meaningfully, identify trends, and assess data reliability before applying advanced statistical models. Questions Q1. Which of the following best describes genomic data? A. The complete set of RNA molecules in a cell at a given time B. The total protein composition within a tissue C. The complete DNA sequence, including coding and non-coding regions D. The real-world patient-level health records Answer: C Q2. Transcriptomic data is particularly valuable because it: A. Remains constant regardless of environmental changes B. Reflects dynamic gene expression under specific conditions C. Only represents protein-coding genes D. Is easier to store and process than genomic data Answer: B Q3. Which technique is most commonly associated with large-scale proteomic analysis? A. PCR B. RNA-Seq C. Mass spectrometry D. Southern blotting Answer: C
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 14 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Q4. Which of the following is not an example of structured biological data? A. FASTA sequence files B. Enzyme activity values in a spreadsheet C. Microscopy images D. Patient age and diagnosis code in an EHR Answer: C Q5. In handling missing data, Multiple Imputation (MI) is preferred because: A. It fills in missing values with zero B. It generates several plausible values and combines results C. It removes all missing records to maintain accuracy D. It guarantees the original dataset remains unchanged Answer: B Q6. Which normalization method is particularly suited for making gene expression data comparable across samples in RNA-Seq? A. Min-Max Scaling B. Quantile Normalization C. TPM (Transcripts Per Million) D. Log Transformation only Answer: C Q7. Scaling in data preprocessing is most critical when: A. Variables have the same units but different ranges B. All features are already normally distributed C. Outliers have been completely removed
Research methodology and ethics (MMB 10400 ) 1 st^ Year, Sem I 2025 - 26 16 Dr. Priyanka Sen Guha, Assistant Professor, Department of Biotechnology, Short Answer Questions