Exploring Protein Structures: Visualization, Comparison, and Classification, Summaries of Bioinformatics

The importance of protein structural visualization using atomic models from PDB data files. It also covers protein structure comparison methods, such as superposition and combined approaches, and their applications in structural classification. The document mentions SCOP and CATH databases and their hierarchical levels for protein structure classification.

Typology: Summaries

2020/2021

Uploaded on 08/08/2021

ahmed-mohammed-16
ahmed-mohammed-16 🇪🇬

5

(1)

6 documents

1 / 23

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
CHAPTER13
Protein Structure Visualization, Comparison,
and Classification
computer visualization programs is interactivity, which allows users to visually
manipulate the structural images through a graphical user interface. At the touch of
a mouse button, a user can move, rotate, and zoom an atomic model on a computer
screen in real time, or examine any portion of the structure in great detail, aswell as
drawit in various formsin different colors
PROTEIN STRUCTURAL VISUALIZATION
Protein Data Bank (PDB) data file for a protein structure contains only
x, y, and z coordinates of atoms the most basic requirement for a visualization
program is to build connectivity between atoms to make a view of a molecule.
Molecular structure visualization forms:
1. A wire-frame diagram is a line drawing representing bonds between atoms.
The wire frame is the simplest form of model representation and is useful for
localizing positions of specific residues in a protein structure, or for
displaying a skeletal form of a structure when Cα atoms of each residue are
connected.
2. Balls and sticks are solid spheres and rods, representing atoms and bonds,
respectively. These diagrams can also be used to represent the backbone of a
structure.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17

Partial preview of the text

Download Exploring Protein Structures: Visualization, Comparison, and Classification and more Summaries Bioinformatics in PDF only on Docsity!

CHAPTER

Protein Structure Visualization, Comparison,

and Classification

computer visualization programs is interactivity, which allows users to visually manipulate the structural images through a graphical user interface. At the touch of a mouse button, a user can move, rotate, and zoom an atomic model on a computer screen in real time, or examine any portion of the structure in great detail, aswell as drawit in various formsin different colors

PROTEIN STRUCTURAL VISUALIZATION

Protein Data Bank (PDB) data file for a protein structure contains only x , y , and z coordinates of atoms the most basic requirement for a visualization program is to build connectivity between atoms to make a view of a molecule.

Molecular structure visualization forms:

  1. A wire-frame diagram is a line drawing representing bonds between atoms. The wire frame is the simplest form of model representation and is useful for localizing positions of specific residues in a protein structure, or for displaying a skeletal form of a structure when C α atoms of each residue are connected.
  2. Balls and sticks are solid spheres and rods, representing atoms and bonds, respectively. These diagrams can also be used to represent the backbone of a structure.
  1. a space-filling representation each atom is described using large solid spheres with radii corresponding to the van der Waals radii of the atoms.
  2. Ribbon diagrams use cylinders or spiral ribbons to represent α -helices and broad, flat arrows to represent β -strands. This type of representation is very attractive in that it allows easy

In practice, only the distances between C α carbons of corresponding residues are measured. -The goal of structural comparison is to achieve a minimum RMSD. However, the problem with RMSD is that it depends on the size of the proteins being compared. -For the same degree of sequence identity, large proteins tend to have higher RMSD values than small proteins when an optimal alignment is reached. Recently, alogarithmic factor has been proposed to correct this size-dependency problem. This new measure is called RMSD100 and is determined by the following formula:

Where N is the total number of corresponding atoms.

A number of solutions have been proposed to compare more distantly related structures.  One approach that has been proposed is to delete sequence variable regions Outside secondary structure elements to reduce the search time required to find an optimum superposition. However, this method does not guarantee an optimal alignment.  Another approach adopted by some researchers is to divide the proteins into small fragments (e.g., every six to nine residues). Matching of similar regions at the three-dimensional level is then done fragment by fragment. After finding the best fitting fragments, a joint superposition for the entire structure is performed  The third approach is termed iterative optimization, during which the two sequences are first aligned using dynamic programming. Identified equivalent residues are used to guide a first round of superposition.

Figure 4: Simplified representation showing steps involved in the structure superposition of two protein molecules. (A) Two protein structures are positioned in different places in a three dimensional space. Equivalent positions are identified using a sequence based alignment approach. (B) To superimpose the two structures, the first step is to move one structure ( left ) relative to the other ( right ) through lateral and vertical movement, which is called translation. (C) The left structure is then rotated relative to the reference structure until such a point that the relative distances between equivalent positions are minimal.

PROTEIN STRUCTURE CLASSIFICATION

One of the applications of protein structure comparison is structural classification. The ability to compare protein structures allows classification of the structure data and identification of relationships among structures. The reason to develop a protein structure classification system is to establish hierarchical relationships among protein structures and to provide a comprehensive and evolutionary view of known structures

the two most popular classification schemes are SCOP and CATH, both of which contain a number of hierarchical levels in their systems

1- SCOP

SCOP : is a database for comparing and classifying protein structures. It is constructed almost entirely based on manual examination of protein structures The proteins are grouped into hierarchies of classes, folds, superfamilies, and families

The SCOP families : consist of proteins having high sequence identity ( > 30%). Thus, the proteins within a family clearly share close evolutionary relationships and normally have the same functionality. The protein structures at this level are also extremely similar. Superfamilies: consist of families with similar structures, but weak sequence similarity. It is believed that members of the same superfamily share a common ancestral origin, although the relationships between families are considered distant. Folds: consist of superfamilies with a common core structure, which is determined manually. This level describes similar overall secondary structures with similar orientation and connectivity between them. Members within the same fold do not always have evolutionary relationships. Some of the shared core structure may be a result of analogy. Classes: consist of folds with similar core structures. This is at the highest level of the hierarchy, which distinguishes groups of proteins by secondary structure compositions such as all α , all β , α and β , and so on.

2- CATH

CATH (:classifies proteins based on the automatic structural alignment program SSAP as well as manual comparison. Structural domain separation is carried out also as a combined effort of a human expert and computer programs. Individual domain structures are classified at five major levels: class, architecture, fold/topology, homologous super family, and homologous family. The definition for class in CATH is similar to that in SCOP ,and is based on secondary structure content. Architecture is a unique level in CATH, intermediate between fold and class. This level describes the overall packing and arrangement of secondary structures independent of connectivity between the elements. The topology level is equivalent to the fold level in SCOP, which describes overall orientation of secondary structures and takes into account the sequence connectivity between the secondary structure elements. The homologous super family and homologous family levels are equivalent to the super family and family levels in SCOP with similar evolutionary definitions, respectively.

Comparison of SCOP and CATH SCOP

  • is almost entirely based on manual comparison of structures by human experts with no quantitative criteria to group proteins. -It is argued that this approach offers some flexibility in recognizing distant structural relatives, because human brains may be more adept at recognizing slightly dissimilar structures that essentially have the same architecture.

CATH

  • Is a combination of manual curation and automated procedure, which makes the process less subjective. -For example, in defining domains, CATH first relies on the consensus of three different algorithms to recognize domains. When the computer programs disagree, human intervention will take place. In addition, the extra Architecture Level in.
  • CATH makes the structure classification more continuous. -The drawback of the systems is that the fixed thresholds in structural comparison may make assignment less accurate.

CHAPTER 17

Genome Mapping, Assembly, and Comparison

Genomics is the study of genomes. Genomic studies are characterized by

simultaneous analysis of a large number of genes using automated data gathering tools. The topics of genomics range from genome mapping, sequencing, and functional genomic analysis to comparative genomic analysis.

Genomic study can be tentatively divided into structural genomics and functional genomics.

  • Structural genomics refers to the initial phase of genome analysis, which Includes construction of genetic and physical maps of a genome, identification of genes, annotation of gene features, and comparison of genome structures -Functional genomics refers to the analysis of global gene expression and gene functions in a genome.

GENOME MAPPING

The first step to understanding a genome structure is through genome mapping, which is a process of identifying relative locations of genes, mutations or traits on a chromosome. There are three type of mapping such as  linkage maps  physical maps  cytologic maps which describe genomes at different levels of resolution.Their relations relative to the DNA sequence on a chromosome are illustrated in

. More details of each type of genome maps are discussed next.

Genetic linkage maps , also called genetic maps , identify the relative positions of genetic markers on a chromosome and are based on how frequent the markers are inherited together. The rationale behind genetic mapping is that the closer the two

Physical maps are maps of locations of identifiable landmarks on a genomic DNA regardless of inheritance patterns. The distance between genetic markers is measured directly as kilobases (Kb) or megabases (Mb). Because the distance is expressed in physical units, it is more accurate and reliable than centiMorgans used in genetic maps.

Cytologic maps refer to banding patterns seen on stained chromosomes, which can be directly observed under a microscope. The observable light and dark bands are the visually distinct markers on a chromosome.

Figure : Overview of various genome maps relative to the genomic DNA sequence. The maps represent different levels of resolution to describe a genome using genetic markers. Cytologic maps are obtained microscopically. Genetic maps ( grey bar ) are obtained through genetic crossing experiments in which chromosome re combinations are analyzed. Physical maps are obtained from overlapping clones identified by hybridizing the clone fragments (grey bars) with common probes (grey asterisks).

GENOME ANNOTATION

The genome annotation process provides comments for the features. This involves two steps: gene prediction and functional assignment. Some examples of finished gene annotations in Gen Bank have been described in the Biological Database section Gene Ontology Way to capture biological knowledge in written and computable form includes:  Biological process  Molecular function  Cellular component

How Many Genes in a Genome? The exact number of genes in the human genome is unknown, but is likely to be in the same range as most other eukaryotes.The gene number, however, does not dictate complexities of a genome  One of the main tasks of genome annotation is to try to give a precise account of the total number of genes in a genome.  Before the human genome sequencing was completed, the estimated gene numbers ranged from 20,000 to 120,000.  Since the completion of the sequencing of the human genome, with the use of more sophisticated gene finding programs, the total number of human genes now dropped to close to 25,000 to 30,000.  Although no exact number is agreed upon by all researchers, it is now widely believed that the total number of human genes will be no more than 30,000. This compares to estimates of 50,000 in rice, 30,000 in mouse, 26,000 in Arabidopsis , 18,400 in C. elegans , and 6,200 in yeast  The discovery of the low gene count in humans may be ego defeating to some as they realize that humans are only five times more complex than baker’s yeast and apparently equally as complex as the mouse. What is worse, the food in their rice bowls has twice as many genes. The finding seriously challenges the view that humans are a superior species on Earth.

CHAPTER EIGHTEEN

Functional Genomics

The field of genomics encompasses two main areas, structural genomics and functional Genomics

 Structural genomics: deals with genome structures with a focus on the study of genome mapping and assembly as well as genome annotation and comparison;  Functional Genomics: is largely experiment based with a focus on gene functions at the whole genome level using high throughput approaches

The high throughput analysis of all expressed genes is also termed transcriptome analysis , which is the expression analysis of the full set of RNA molecules produced by a cell under a given set of conditions

Transcriptome analysis using ESTs, SAGE, and DNA microarrays forms the core of functional genomics and is key to understanding the interactions of genes and their regulation at the whole-genome level.

SEQUENCE-BASED APPROACHES

1. Expressed Sequence Tags(EST)  One of the high throughput approaches to genome-wide profiling of gene expression is sequencing expressed sequence tags(ESTs).  ESTs are short sequences obtained from cDNA clones and serve as short identifiers of full-length genes.  ESTs are typically in the range of 200 to 400 nucleotides in length obtained from either the 5_ end or 3_ end of cDNA inserts.  EST sequences are often of low quality because they are automatically generated without verification and thus contain high error rates.  Although these limitations, EST technology is still widely used. This is because EST libraries can be easily generated from various cell lines, tissues, organs, and at various developmental stages.

2. SAGE ( Serial analysis of gene expression)

 SAGE is another high throughput, sequence-based approach for global gene expression profile analysis.  Unlike EST sampling, SAGE is more quantitative in determining mRNA expression in a cell.  In this method, short fragments of DNA (usually 15 base pairs [bp]) are excised from cDNA sequences and used as unique markers of the gene transcripts.  This approach is much more efficient than the EST analysis in that it uses a short nucleotide tag to define a gene transcript and allows sequencing of multiple tags in a single clone.  In a SAGE experiment, sequencing is the most costly and time-consuming step. It is difficult to know how many tags need to be sequenced to get a good coverage of the entire transcriptome.  Another obvious drawback with this approach is the sensitivity to sequencing errors owing to the small size of oligonucleotide tags for transcript representation. One or two sequencing errors in the tag sequence can lead to ambiguous or erroneous tag identification.  Another fundamental problem with SAGE is that a correctly sequenced SAGE tag sometimes may correspond to several genes or no gene at all.  To improve the sensitivity and specificity of SAGE detection, the lengths of the tags need to be increased for the technique.

3. MICROARRAY-BASED APPROACHES

 The most commonly used global gene expression profiling method in current genomics research is the DNA microarray-based approach.  A microarray (or gene chip) is a slide attached with a high-density array of immobilized DNA oligomers (sometimes cDNAs) representing the entire genome of the species under study.  Atypical DNAmicroarray experiment involves amulti step procedure:

o fabrication of microarrays by fixing properly designed oligonu cleotides representing specific genes; o hybridization of cDNA populations onto the microarray; o scanning hybridization signals and image analysis transformation and normalization of data ; o analyzing data to identify differentially expressed genes as well as sets of genes that are coregulated

Microarray Data Classification One of the key features of DNA microarray analysis is to study the expression of many genes in parallel and identify groups of genes that exhibit similar expression patterns. The similar expression patterns are often a result of the fact that the genes involved are in the same metabolic pathway and have similar functions. The genetic basis of the coregulation could be the result of common promoters and regulatory regions.

Supervised and Unsupervised Classification Based on the computed distances between genes in an expression profile, genes with similar expression patterns can be grouped. The classification analysis can be either supervised or unsupervised.  A supervised analysis refers to classification of data into a set of predefined categories. For example, depending on the purpose of the experiment, the data can be classified into predefined “diseased” or “normal” categories.  An unsupervised analysis does not assume predefined categories, but identifies data categories according to actual similarity patterns. The unsupervised analysis is also called clustering , which is to group patterns into clusters of genes with correlated profiles.

For microarray data, gene clustering, functions of previously uncharacterized genesmay be discovered. Clustering methods include hierarchical clustering and partitioning clustering (e.g., k-means, self-organizing maps [SOMs]). The following discussion focuses on several of the most frequently used clustering methods.

Hierarchical Clustering. Ahierarchical clusteringmethodis in principle similar to

the distance phylogenetic tree-building method. It produces a treelike structure that represents a hierarchy or relative relatedness of data groups.

k-Means Clustering. In contrast to hierarchical clustering algorithms, k-means clustering does not produce a dendrogram, but instead classifies data through a single step partition. Thus, it is a divisive approach. In this method, data are partitioned into k-clusters, which are prespecified at the outset. The value of k is normally randomly set but can be adjusted if results are found to be unsatisfactory

Figure Example of k-means clustering using four partitions. Closeness of data points is indicated by resemblance of colors (see color plate section).

Self-Organizing Maps. Clustering by SOMs is in principle similar to the k-means method. This pattern recognition algorithm employs neural networks. It starts by defining a number of nodes. The data points are initially assigned to the nodes at random. The distance between the input data points and the centroids are calculated. The data points are successively adjusted among the nodes, and their distances to the centroids are recalculated. After many iterations, a stabilized clustering pattern are reached with the minimum distances of the data points to the centroids.

COMPARISON OF SAGE AND DNA MICROARRAYS

SAGE and DNA microarrays are both high throughput techniques that determine Global mRNA expression levels. A number of comparative studies have indicated That the gene expression measurements from these methods are largely consistent with each other .However, the two techniques have important differences.  First, SAGE does not require prior knowledge of the transcript sequence, whereas DNA microarray experiments can only detect the genes spotted on the microarray. Because SAGE is able to measure all the mRNA expressed in a sample, it has the potential to allow discovery of new, yet unknown gene transcripts.  Second, SAGE measures “absolute” mRNA expression levels without arbitrary reference standards, where as DNA microarrays indicate the