Mitochondrial Protein Sequence-Computer Sciences Applications and Systems-Project Presentation, Slides of Applications of Computer Sciences

This is project presentation related to Application of Computer Science course. This presentation was delivered in presence of Prof. Ashish Behari at Alliance University. Its main points are: Mitochondrial, Protein, Sequence, Extraction, Selection, Strategies, Classification, Techniques, Amino, Acids

Typology: Slides

2011/2012

Uploaded on 07/16/2012

samderiya
samderiya 🇮🇳

4.3

(4)

62 documents

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
docsity.com
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download Mitochondrial Protein Sequence-Computer Sciences Applications and Systems-Project Presentation and more Slides Applications of Computer Sciences in PDF only on Docsity!

Outline of Presentation

 Introduction

 Basic Concepts

 Feature Extraction strategies

 Feature Selection strategies

 Classification techniques

 Description of Author’s results

 Comparison with Author’s results

 Comparison with other methods

 Conclusions

 Future Work

Basic Concepts

 Before starting discussion to our project one must have some basic concepts discussed as :

  • Proteins
  • Amino acids
  • Mitochondrial protein sequence

Proteins

 Proteins are the major components of living organisms and constitute more than 25% weight of a cell.

 It performs functions e.g. catalysis, transport, transportation, digestion, movement, sensory capabilities, sense of taste, sense of vision and control the gene function.

 Proteins are made up of strings of amino acids(usually represented by English letters).

Amino acids

Amino acid 3-letter Abbreviation 1-letter Abbreviation Alanine Ala A Cysteine Cys C Aspartic acid Asp D Glutamic acid Glu E phenylalanine Phe F Glycine Gly G Histidine His H Isoleucine Ile I Lysine Lys K Leucine Leu L Methionine Met M Asparagine Asn N Proline Pro P Glutamine Gln Q Arganine Arg R Serine Ser S Threonine Thr T Valine Val V Tryptophan Trp W Tyrosine Tyr Y

Mitochondrial Protein Sequence

 >sp|P31937|3HIDH_HUMAN 3-hydroxyisobutyrate dehydrogenase, mitochondrial precursor (EC 1.1.1.31) (HIBADH) - Homo sapiens (Human).

MAASLRLLGAASGLRYWSRRLRPAAGSFAAVCSRSVASKTP

VGFIGLGNMGNPMAKNLMKHGYPLIIYDVFPDACKEFQDAGE

QVVSSPADVAEKADRIITMLPTSINAIEAYSGANGILKKVKKGS

LLIDSSTIDPAVSKELAKEVEKMGAVFMDAPVSGGVGAARSG

NLTFMVGGVEDEFAAAQELLGCMGSNVVYCGAVGTGQAAKI

CNNMLLAISMIGTAEAMNLGIRLGLDPKLLAKILNMSSGRCWS

SDTYNPVPGVMDGVPSANNYQGGFGTTLMAKDLGLAQDSA

TSTKSPILLGSLAHQIYRMMCAKGYSKKDFSSVFQFLREEETF

Data set

 The dataset used in this project has been generated in Jiang et al., 2006 and is received on request from [email protected].

 Comprises of 499 and 681 mitochondrial (positive) and non- mitochondrial (negative) sequences.

Feature Extraction Strategies

 We have used four kinds of proteins representations including:

  • Amino acid composition (AAC).
  • Pseudo amino acid composition (PseAAC).
  • Dipeptide composition (Dp).
  • Split amino acid composition (SAAC).

Pseudo amino acid composition

 Contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional Amino Acid composition while the additional factors incorporate some sequence-order information via various modes. P = [ P 1 , P 2 ,…P 20 …. P20+ λ ]

 Whereas P 1 , P 2 ,…P 20 are the normalized occurrence frequencies of 20 amino acids and

 P 21 , P 22 , ……, PΛ are the 1st-teir to λ - tier correlation factor of amino acid sequence in the protein chain determined based on hydrophobicity and hydrophilicity.

 Hydrophobicity and hydrophilicity correlation functions are used

Dipeptide composition

 It is used to transform the variable length of proteins sequences to fixed length feature vectors.

 Occurrence frequency of every consecutive pair of amino acids is calculated F(i) = P(i) / P Where P(i) is the occurrence frequency of pair ‘ i’ and P is the total number of pairs in a protein sequence

 400 dimensional feature vector.

Split amino acid composition (Contd.)

 We have some very small number of sequences in the dataset where total amino acids in the protein sequence is less than 50.

 So for that we divided it into the same three parts but with:

  • 10 amino acids of N termini
  • 10 amino acids C termini and
  • the region between these two terminuses.

Feature Selection Strategies

 We have employed two different feature selection strategies for dipeptide composition including:

  • Rank Features Selection
  • Features Selection Through GA

Testing Methods

 Jackknifing test

  • One of the protein sequence pattern is taken as test sample and remaining N-1 sequence patterns are considered as training patterns.
  • Label of a test sample is predicted using the rest of the N- training patterns.
  • This process is repeated for all N patterns.

 Independent test

 Self Consistency test

Performance Measurements

 Sensitivity(), Specificity(), Accuracy (ACC), Mathew Correlation Coefficient (MCC)

  • Sensitivity = TP/ (TP + FN)
  • Specificity = TP/ (TP + FP)
  • Acc = (TP + TN) / (TP + TN + FP +FN)
  • MCC = TP.TN-FN.FP/Sqrt {(TP+FN)(TP+FP)(TN+FN)(TN+FP)} TP = true positive TN = true negative FP = false Positive FN = false negative