CMR logo
Search  for

TIGR Gene Naming and Annotation Conventions

All genomes sequenced at TIGR and all genomes on the CMR sequenced at other sequencing centers are taken through an automated annotation process at TIGR.  Names and functional annotation of genes in TIGR genomes are then manually curated, but CMR genomes sequenced at other centers do not undergo manual curation.  Name assignments usually are not based on experimental evidence for the gene itself, since such evidence rarely exists.  Rather, they are inferred from sequence similarity to previously characterized genes. Therefore, all TIGR name assignments should be regarded as provisional.  We strive to annotate each gene with as much information as we can confidently impart, but are also wary of inferring too much from sequence similarity.  We prefer to err on the side of caution and we have devised a nomenclature scheme that reflects our degree of confidence in a particular assignment.

We encourage feedback from the community to help identify errors or to provide suggestions to improve the annotation of our genes.

Information used during manual curation
  • Pairwise search results

    Protein translations of all genes are searched vs. a non-redundant amino acid database to generate a file of pairwise alignments.   Matches to experimentally characterized proteins are highlighted for special consideration. Curators may also perform their own queries of databases available on the Web.

  • HMM matches

  • Protein translations of all genes are searched against Hidden Markov Models (HMMs) built at TIGR (TIGRFAMs) and at Sanger (Pfams).  HMMs are statistical models built from multiple alignments of proteins which share sequence similarity.  TIGR classifies HMMs into more than a dozen 'isology' types, each of which represents a different degree of confidence about function.
  • Paralogous families

  • Protein translations of all genes in a genome are searched vs. themselves to identify clusters of similar sequences.

  • Biologically significant motifs and sites

  • Protein translations of all genes are searched vs. PROSITE for biologically significant patterns.  Potential transmembrane domains are predicted by TmHMM.   For enzymes, curators review active site information from matches in SwissProt, MEROPS, and other databases.

  • Gene context

    Presence within a cluster of genes with a common functional theme can be significant in some assignments - particularly for genes such as ABC transporters or enzymes involved in a biosynthetic or metabolic pathway.

  • Genome Properties

  • Each of TIGR's Genome Properties comprises a suite of a genes that function in a known metabolic pathway, cellular activity, or cellular structure.  Protein translations may be automatically assigned to one or more Genome Property based on HMM matches and context-based rules.


Levels of Database Match

We annotate each gene by assigning as many descriptors as are relevant.  Descriptors include at minimum a common name, role category, and Gene Ontology (GO) 'fucntion' and 'process' terms for each gene, but may also include a gene symbol, an Enzyme Commission number, and public comments.  In the course of reviewing data we have developed the following criteria regarding assignments. 
  1.  Confident assignment: indicated by a specific name and gene symbol
    • The protein translation has a good database match to a protein of known function. Both pairwise and multiple protein sequence alignments reveal high degree of identity/similarity (typically >35% identity) along  the entire length of the protein.  There may be a match to a highly specific (i.e., 'equivalog' isology type) HMM.  Active sites, substrate or cofactor binding sites, or motifs that are characteristic of a protein should be conserved. Note: we use the Escherichia coli gene symbol when available.
  1. Function uncertain: indicated by "putative" or 'homolog' in the name (gene symbol optional)
    • If we think the gene is almost certainly performing the function the name implies, but are less than fully confident, we precede the name with "putative". In this case, the evidence for function is very much like that for 'confident assignment', except for one or two weak lines of evidence, e.g.,  the percent identity/similarity or HMM score is marginally lower than for confident assignment.

    • A different type of 'function uncertain' assignment is indicated by the use of 'homolog' in the common name. The assignment can arise from two situations. In the first, sequence homology is very strong, but unlike a ‘putative’ match, we do NOT believe the query protein has the same function as the match. This might be because some critical piece of evidence is absent (e.g., non-conservation of catalytic residues in an enzyme), or because the function is not predicted to exist in this particular organism (e.g., photosynthetic enzyme matches in a non-photosynthetic organism).  In the second situation,  evidence is too weak to apply 'confident' or putative names, and there are also no family names available.  However, based on sequence homology there may be important information which would be lost if we called the gene product just a 'conserved hypothetical' protein, e.g., the matching protein in a pathogen is known to be toxic to host organisms, but has no other functional characterization. In this case we could use the matching protein's name but add 'homolog' to it, and apply descriptors appropriate for a protein of unknown function and process.

    • Note that while using 'homolog' to denote non-conserved function has been a long time-practice of TIGR annotators, using 'homolog' to capture important information that would otherwise be lost is a practice adopted in 2005. The criteria for 'putative' annotation have also been tightened.  Therefore it is likely that some older annotations that were called 'putative' would be called 'homolog' under the new naming criteria.
  1. Specific assignment not possible, but protein family or domain assignment is possible: indicated by protein family name or domain name.
    • When the best (or only) annotation evidence indicates membership in a defined family, rather than orthology to a specific gene, we use family names defined in TIGR or Pfam HMMs, curated databases such as SwissProt, or in the literature.

    • When the extent of sequence homology is limited to a defined protein domain (usually modelled as an HMM), rather than a defined family, we may use the domain name. Since in the literature domains sometimes are used to define a family, the distinction between such names is not rigid.

    • Note that the cellular function or process associated with a protein family or domain may be experimentally defined to some degree; however, they can also be completely unknown, in which case the family or domain name connotes nothing more than sequence homology.  This will be reflected in the role categories and GO term descriptors assigned to the protein.
  1. No evidence of function, defined family, or defined domain: indicated by the name 'conserved hypothetical protein'
    • These protein translations only produces full-length matches to conceptual translations in other species, i.e.,  there are no experimentally characterized matches, HMM matches, or family names that can be reasonably derived from the evidence.  Exceptions are made when there is a match to a lipoprotein motif or detection of substantial hydrophobic regions; these are called 'putative lipoprotein' and 'putative membrane protein', respectively.
  1. No database matches: indicated by the name 'hypothetical protein'
    • These protein translations have no significant sequence similarity to any characterized or uncharacterized genes. In these cases, the open reading frame was identified by the gene-finding algorithm but there is no additional evidence to indicate whether it is or is not an actual gene.  

Disrupted reading frames

Genes from the first four of the above categories can be qualified by appending the following terms to the common name, when there is evidence that the open reading frame (ORF) is disrupted:
  1. authentic frameshifts/authentic point mutations: indicated by  'authentic frameshift' or 'authentic point mutation'
    • When an ORF is disrupted by a either a single confirmed frameshift or point mutation, we simply add this information after the common name.
  1. multiple/mixed frameshifts and point mutations: indicated by 'degenerate'
    • When an ORF is disrupted by multiple frameshifts or a mixture of frameshifts and point mutations we assume that the ORF is not functionally expressed. and we denote this with the term "degenerate" after the common name.
  1. interruptions: indicated by 'interruption'
    • Interruptions are cases in which conserved amino and carboxyl terminal portions of an ORF are separated by some other sequence, such as a transposon. These are labelled with "interruption-N" and "interruption-C" after the common name.
  1. truncations: indicated by 'truncation'   
    • When a significant segment of the ORF is missing from the N- or C-terminal end - enough so that we believe that it is no longer functionally expressed - we add 'truncation' to the common name.
  1. programmed frameshifts: indicated by 'programmed frameshift'
    • When an ORF contains an in-frame termination codon and a naturally-occurring frameshift prior to the termination codon regulates translation of the ORF, we add "programmed frameshift" to the common name.
  1. internal deletions: indicated by 'internal deletion'
    • An internal deletion is the absence of a region of DNA in the interior of an ORF relative to its orthologs. Internal deletions are shorter than interruptions, but long enough such that we expect the deletion to impair function.  We denote them by adding 'internal deletion' to the common name.
  1. fusions: indicated by 'fusion'
    • Two proteins which have been fused into one reading frame by a deletion event in the genome are denoted by 'fusion' in the common name.

  2. selenocysteine-containing proteins: indicated by 'selenocysteine-containing'
    • In certain organisms the 'stop' codon TGA encodes the amino acid selenocysteine. The genome must contain a selenocysteine-tRNA and the enzyme selenide, water dikinase. Proteins which meet these criteria have 'selenocyteine-containing' added to their common names.

Contact Us | © J. Craig Venter Institute