








Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
The rules and exceptions for the representation of readthrough transcripts in the CCDS dataset. Readthrough transcripts are defined as those that contain exons from two or more distinct and adjacent genes on the same strand. various scenarios, such as inferred CCDS representations, public notes, downstream AUG exceptions, and NMD exceptions. It also discusses the use of non-canonical start codons and the representation of proteins from NMD transcripts.
Typology: Study Guides, Projects, Research
1 / 14
This page cannot be seen from the preview
Don't miss anything!









Conservation : We define conservation by observing sequence similarity for orthologous loci at the level of the genome sequence between two or more species with an emphasis (for curating human and mouse) on conservation observed in the genomes assemblies for human, chimp, macaque, mouse, rat, dog, and cow. Additionally, genome conservation may also be observed within a species for paralogous loci. Agreement with other independently curated datasets such as Swiss-Prot protein records may also be taken into consideration. Genome conservation may be observed using existing public tools, such as the UCSC Vertebrate conservation track, or in similar in-house tools provided to support curator staff. a. Strong conservation: genome sequence is conserved in at least 2 species that are evolutionarily distant (e.g., different taxonomic orders). Strong conservation support (or experimental data) is needed when considering a large N-terminal extension (>100aa). b. Weak conservation: genome sequence is conserved in closely related species but not conserved in more distantly related species (e.g., such as within primates, within rodents, or within mouse strains) c. Note : Variation at the protein termini is valid and expected and can be lineage-specific. Small differences in N-terminal length between, for instance, human and mouse, are expected. Large differences may be valid but should be supported by available transcript, publication, and conservation data. Kozak signal strength:
The following start codon selection guidelines are used for a transcript that contains multiple possible in-frame start codons: Default Rule : Always annotate the CDS starting from the upstream AUG unless one of the following exceptions applies. Note : The expectation is that frequently none of the exceptions will apply; therefore, we will often annotate an upstream AUG with a weak Kozak signal and weak conservation. Attribute: Downstream AUG exceptions represented in the CCDS data set are tracked with the ‘CDS uses downstream AUG’ attribute, which can be found in the ‘Attributes’ section of relevant CCDS reports, or in the ‘CCDS_attributes.[YearMonthDay/current].txt’ files in the CCDS FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In some cases, an explanatory Public Note may also accompany the attribute in the CCDS report page.
initiation at the upstream non-AUG site. Therefore, we should have good support for a decision to annotate the CDS initiating at a non-AUG site. Note : A hairpin secondary structure downstream of a weak Kozak site may facilitate ribosome pausing and thus increase the likelihood of initiation from that site (PMID: 2236042).
For annotating the CDS starting from the first AUG: a) There is a strong Kozak signal for the first AUG and it is an extension of the primary ORF (regardless of the Kozak signal for the downstream AUG, and regardless of genome conservation). b) There is a weak Kozak signal but there is strong genome conservation for the first AUG and the extension doesn’t conflict with experimental information about translation initiation or localization. In this context, strong conservation at the level of genome sequence is observed between two or more species; the species do not need to be widely diverged (e.g. primate- specific N-terminal differences are valid). c) There is a weak Kozak signal and there is weak or no conservation for the first AUG, but the extension improves the protein with regard to adding or completing a domain or signal/transit peptide. d) There is no functional information, whether direct or indirect (domains), for the protein function. For annotating the CDS starting from an internal AUG: a) There is a weak Kozak signal and no conservation for the first AUG, and very strong conservation and a strong Kozak signal at a downstream AUG. There is significant genome conservation observed among species with evolutionary distance and there is consistency in the location of the downstream AUG site and N-terminus region of the protein. b) There is a very strong historical use; the protein as defined from the internal AUG is considered the community reference standard. Note : if you think this is a case where newer data indicates historical use is faulty then it may be useful to consult with an expert on the gene/protein to confirm the N-terminus representation. The community standard N-terminus should be supported by available public data. In other words, there should be a compelling reason to not annotate from the upstream AUG especially when there is conservation support or a good Kozak signal. In one real case, the community expert pointed out that the upstream AUG site being considered
was invalid because the transcript representation was in error; promoter studies determined that the predominant transcript start occurred after the first AUG site that was in question. The transcript representation had been extended further 5’ of the known promoter based on weak transcript support that the scientific expert did not consider valid. c) Note : if the ‘internal’ AUG site in question is also available as the first AUG site on a different transcript due to use of an alternate promoter, or alternate splicing, then (given sufficient support) both transcripts and both N-terminal protein options can be annotated. Naturally, all transcripts have to themselves meet quality and abundance criteria to be considered for representing as annotated alternate transcripts. Cases where a leader peptide can be predicted for both N-termini are less clear and may require further discussion, with consideration for the signal peptide length.
Nonsense-mediated mRNA decay (NMD) is a eukaryotic surveillance pathway that destroys abnormal transcripts encoding a truncated protein due to the presence of a premature termination codon (PMIDs: 12502788 , 15040442 and 23435113 ). The CCDS collaboration NMD guidelines are based on the exon junction complex model (PMIDs: 15040442 and 23435113 ), whereby a transcript is assumed to be an NMD candidate if the stop codon is located >50 nts upstream of the last exon-exon junction. The products of NMD transcripts are generally not represented in the CCDS dataset unless the exceptions outlined in Cases 1 and 2 below apply. Attribute: NMD exceptions with CCDS representation are tracked with the ‘Nonsense-mediated decay (NMD) candidate’ attribute, which can be found in the ‘Attributes’ section of relevant CCDS reports, or in the ‘CCDS_attributes.[YearMonthDay/current].txt’ files in the CCDS FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In some cases, an explanatory Public Note may also accompany the attribute in the CCDS report page. Case 1 : The gene is protein-coding; there is abundant transcript data; all available transcripts are NMD candidates. E.g., this is a known protein-coding gene and it is considered an error of omission to not represent the protein. NMD, annotate protein and track as known annotation concern AUG Distance from stop codon to last splice site is >50 nt. Supported protein; protein-coding locus type is not in question.
Some exons of either gene may be skipped and novel exons may be included. Some readthrough transcripts may span more than two genes. Readthrough transcripts may encode a protein derived from coding exons from one or both loci (e.g., may be the same as the downstream locus; may be a fusion protein based on coding sequence from both genes), or they may encode a novel protein product due to CDS frameshifts with respect to one of the genes, or they may be non-coding due to NMD. The CCDS collaboration definition of readthrough is very specific in that the individual partner genes must be distinct, and the readthrough transcripts must share >=1 exon (or >=2 splice sites except in the case of a shared terminal exon) with each of the distinct shorter loci. Unlike the broader definition of “conjoined” genes described in Prakash et al. (PMID: 20967262 ), the CCDS readthrough definition does not include cases where the genes are otherwise considered to be co-transcribed (e.g., human HOXC4 , HOXC5 and HOXC6 ) (PMID: 2898768 ), bicistronic (e.g., human CERS1 and GDF1 )(PMID: 2034669 ), or overlapping each other but not sharing splice sites (e.g., the 3’ exon of the mouse Mon1b gene overlaps the 5’ exon of the Syce1l gene), or genes that have nested arrangements relative to each other (e.g., human and mouse protocadherin gene clusters). Note: Any readthrough transcript that has matching protein annotation from NCBI and Ensembl/Havana is eligible for CCDS representation. However, recent reports in the literature (example PMID:26861889) suggest that readthrough transcripts are pervasive, especially when cells are subject to stress, and may not encode a protein. Furthermore, both NCBI and Ensembl/Havana use two-gene or three-gene models to represent readthrough transcripts (see below), which can result in user confusion about transcript:gene association and/or the third readthrough locus may artificially increase the gene count. Therefore, the CCDS collaboration has decided to represent proteins encoded by readthrough transcripts only if there is strong transcript support from multiple sources or published experimental support for the protein encoded by the readthrough transcript. In future, assignment of CCDS IDs to proteins encoded by readthrough transcripts will be considered by CCDS curators on a case-by-case basis. In most cases, the collaboration uses a separate locus to represent readthrough transcripts (three-gene model when there are two distinct individual genes, see diagram below). However, depending on the locus type of the individual genes, in some cases a two-gene model is used and the readthrough transcript is treated as a variant of one of the individual genes. The decision to represent a two-gene versus a three-gene model also includes a protein similarity consideration, i.e., a consideration of whether the protein produced from the readthrough transcript is more similar to the protein product of one individual gene versus the other, or if the readthrough product is very different. The following criteria are used for locus type combinations that could produce a protein-coding readthrough transcript: Treatment Upstream locus type Downstream locus type Three-gene model Protein-coding Protein-coding, non-coding, or transcribed pseudogene
Three-gene model Transcribed pseudogene Protein-coding Two-gene model Transcribed non-coding Protein-coding _Evidence required*: *Note:_ The following requirements are based on current RefSeq guidelines, which are more stringent with respect to transcript completeness and independent support evidence than currently required for Havana readthrough annotation (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/assets/guidelines.pdf). CCDS
data or predicted domain structure in the protein. In practice, however, many valid transcript variants or very long proteins lack full-length transcript support, as available in public International Nucleotide Sequence Database Collaboration (INSDC) databases. In order to fulfill the CCDS project’s goal to represent as many consistently annotated protein-coding genes as possible, the collaboration allows inferred exon combination representations in the dataset. This typically occurs in two scenarios: