All genomes sequenced at TIGR and all genomes on the CMR sequenced at
other sequencing centers are taken through an automated
annotation process at TIGR. Names and functional annotation of genes in
TIGR genomes are then manually
curated, but CMR genomes sequenced at other centers do not
undergo manual curation. Name assignments usually are not based
on experimental evidence for the gene itself, since such evidence
rarely exists. Rather, they are inferred from sequence similarity
to previously characterized genes. Therefore, all TIGR name assignments
should be regarded as provisional. We strive to annotate each
gene with as much information as we can confidently impart, but are
also wary of inferring too much from sequence similarity. We
prefer to err on the side of caution and we have devised a nomenclature
scheme that reflects our degree of confidence in a particular
assignment.
We encourage feedback from the community to help identify errors or to
provide suggestions to improve the annotation of our genes.
Information used during manual curation
|
- Pairwise search
results
Protein translations of all genes are searched vs. a non-redundant
amino acid database to generate a file of pairwise
alignments. Matches to experimentally characterized
proteins are highlighted for special consideration. Curators may also
perform their own queries of databases available on the Web.
- HMM matches
Protein translations of all genes are searched against Hidden Markov
Models (HMMs) built at TIGR (TIGRFAMs) and at Sanger (Pfams). HMMs are
statistical models built from multiple alignments of proteins which
share sequence similarity. TIGR classifies HMMs into more than a
dozen 'isology' types, each of which represents a different degree of
confidence about function.
- Paralogous families
Protein translations of all genes in a genome are
searched vs. themselves to identify clusters of similar sequences.
- Biologically significant
motifs and sites
Protein translations of all genes are searched vs. PROSITE for biologically
significant patterns. Potential transmembrane domains are
predicted by TmHMM.
For enzymes, curators review active site information from matches in SwissProt, MEROPS, and other databases.
- Gene context
Presence within a cluster of genes with a common functional theme can
be significant in some assignments - particularly for genes such as ABC
transporters or enzymes involved in a biosynthetic or metabolic pathway.
- Genome Properties
Each of TIGR's Genome
Properties comprises a suite of a genes that function in a
known metabolic pathway, cellular activity, or cellular
structure.
Protein translations may be automatically assigned to one or more
Genome Property based on HMM matches and context-based rules.
|
Levels of Database
Match
We annotate each gene by assigning as many descriptors as are
relevant. Descriptors include at minimum a common name, role
category, and Gene
Ontology (GO) 'fucntion' and 'process' terms for each gene, but may
also include a gene
symbol, an Enzyme Commission number, and public comments. In
the course of reviewing data we have developed the following criteria
regarding assignments.
- Confident assignment: indicated by a specific
name and gene symbol
- The protein translation has a good database match to a
protein of known function. Both pairwise and multiple protein sequence
alignments reveal high degree of identity/similarity (typically
>35% identity) along the entire
length of the protein. There may be a match to a highly specific (i.e., 'equivalog' isology type)
HMM. Active sites, substrate or cofactor binding
sites, or motifs that are characteristic of a protein should be
conserved. Note: we use the Escherichia
coli gene
symbol when available.
- Function uncertain: indicated by
"putative" or 'homolog' in the name (gene symbol optional)
- If we think the gene is almost certainly performing the
function the name implies, but are less than fully confident, we
precede
the name with "putative". In this case, the evidence for function
is very much like that for 'confident assignment', except for one or
two weak lines of evidence, e.g.,
the percent identity/similarity or HMM score is marginally lower than
for confident assignment.
- A different type of 'function uncertain' assignment is
indicated by the use of 'homolog' in the common name. The assignment
can arise from two situations. In the first, sequence homology is very
strong, but unlike a ‘putative’ match, we do NOT
believe the query protein has the same function as the match. This
might be because some critical piece of evidence is absent (e.g.,
non-conservation of catalytic residues in an enzyme), or because the
function is not
predicted to exist in this particular organism (e.g.,
photosynthetic enzyme matches in a non-photosynthetic organism).
In the second situation, evidence is too weak to
apply 'confident' or putative names, and there are also no family names
available. However, based on sequence homology there may be
important information which would be lost if we called the gene product
just a
'conserved hypothetical' protein, e.g.,
the matching protein in a pathogen is
known to be toxic to host organisms, but has no other functional
characterization. In this case we could use the matching protein's name
but add 'homolog' to it, and apply descriptors appropriate for a
protein of unknown function and process.
- Note that while using 'homolog' to denote non-conserved
function has been a long time-practice of TIGR annotators, using
'homolog' to capture important information that would otherwise be lost
is a practice adopted in 2005. The criteria for 'putative' annotation
have also been tightened. Therefore it is likely that some older
annotations that were called 'putative' would be called 'homolog' under
the new naming criteria.
- Specific assignment not possible, but protein family
or domain assignment is possible:
indicated by protein family name or domain name.
- When the best
(or only) annotation evidence indicates membership in a defined family,
rather than orthology to a specific gene, we use family
names defined in TIGR or Pfam HMMs, curated databases such as
SwissProt, or in the literature.
- When the extent of sequence homology is limited to a
defined protein domain (usually modelled as an HMM), rather than a
defined family, we may
use the domain name. Since in the literature domains sometimes
are used to define a family, the distinction between such names is not
rigid.
- Note that the cellular function or process associated
with a
protein family or domain may be experimentally defined to some degree;
however, they can also be completely unknown, in which case the family
or domain name connotes nothing more than sequence homology. This
will be reflected in the role categories and GO term descriptors
assigned to the protein.
- No evidence of function, defined family, or defined domain:
indicated by the name 'conserved
hypothetical protein'
- These protein
translations only produces
full-length matches to conceptual translations in other species, i.e.,
there are no experimentally characterized matches, HMM matches, or
family names that can be reasonably derived from the evidence.
Exceptions are made when there is a match to a lipoprotein motif or
detection of substantial hydrophobic regions; these are called
'putative lipoprotein' and 'putative membrane protein', respectively.
- No database matches: indicated by the name
'hypothetical
protein'
- These protein translations have no significant sequence
similarity to any characterized
or uncharacterized genes. In these cases, the open reading frame was
identified by the gene-finding algorithm but there is no additional
evidence to indicate
whether it is or is not an actual gene.
Disrupted reading frames
Genes from the first four of the above categories can be qualified by appending the following terms to the common name, when there is evidence that the open reading frame
(ORF) is
disrupted:
- authentic frameshifts/authentic point mutations:
indicated by 'authentic frameshift' or 'authentic point mutation'
- When an ORF is disrupted by a either a single confirmed
frameshift or point mutation, we simply add this information
after
the common name.
- multiple/mixed frameshifts and point mutations:
indicated by 'degenerate'
- When an ORF is disrupted by multiple frameshifts or a
mixture of frameshifts and point mutations we assume that the ORF is
not functionally expressed. and we denote this with the term
"degenerate" after the common name.
- interruptions: indicated by 'interruption'
- Interruptions are cases in which conserved amino and
carboxyl terminal portions of an ORF are separated by some
other sequence, such as a transposon. These are labelled with
"interruption-N" and "interruption-C" after the common name.
- truncations: indicated by 'truncation'
- When a
significant segment of
the ORF is missing from the N- or C-terminal end - enough so that we believe that it is no longer
functionally expressed - we add 'truncation' to the common name.
- programmed frameshifts: indicated by 'programmed
frameshift'
- When an ORF contains an in-frame
termination codon and a naturally-occurring frameshift prior to the
termination codon regulates translation of the ORF, we
add "programmed frameshift" to the common name.
- internal deletions: indicated by 'internal deletion'
- An internal deletion is the absence of a region
of DNA in the interior of an ORF relative to its orthologs. Internal
deletions are shorter than interruptions, but long enough such that we
expect the deletion to impair function. We denote them by adding
'internal deletion' to the common name.
- fusions: indicated by 'fusion'
- Two proteins which have been fused
into one reading frame by a deletion event in the genome are denoted by
'fusion' in the common name.
- selenocysteine-containing proteins: indicated by
'selenocysteine-containing'
- In certain organisms the 'stop' codon TGA encodes the
amino acid
selenocysteine. The genome must contain a
selenocysteine-tRNA and the enzyme selenide, water dikinase. Proteins
which meet these criteria have 'selenocyteine-containing' added to
their common names.
|