Wooster Home Page
Wooster Home Page
 
Timken Science Library
About the Science Library | Guides to Research | Timken Science Library Site Index | Need Help?

Reading Sequence Records

In order to search the NCBI Entrez sequence databases effectively and to interpret the data contained in the records, it is important to have an understanding of the data fields within the records. NCBI provides a Sample GenBank Record at their site with data fields annotated. You can obtain important information from these records if you know where to look. You should review the Sample Genbank Record to be sure you know what is presented because this information can help you complete your assignments in Cell Physiology.

Two Databases You Will Encounter

There are two types of databases we will be most concerned with in this class, both of which are accessible through the Entrez Nucleotide database. They are the GenBank and RefSeq databases, described below.

GenBank is an archival database that serves as a repository for sequences submitted by researchers from all over the world. It is a redundant database, meaning that there may be many records for the same sequence, submitted by different researchers. Submitters maintain control over the content of their records. There is no controlled vocabulary.

RefSeq is a curated database in which each record provides a summary of what is currently known about a particular gene or protein. The Reference Sequence (RefSeq) collection aims to provide a comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA) and protein products, for major research organisms. RefSeq standards serve as the basis for medical, functional and diversity studies. They provide a stable reference for gene identification and characterization, mutation analysis, expression studies, polymorphism discovery and comparative analyses. RefSeqs are used as a standard for the functional annotation of some genome sequencing projects, including those of human and mouse. The significance of the RefSeq records is value-added information that has been supplied by experts (curators). Curation is an ongoing process. The Comment block of a RefSeq record will indicate the status or level of curation of the record.

A RefSeq record is a good source for learning more about the function of a gene because it pulls together related sequence records and literature citations in one place. See the NCBI RefSeq Definitions page to learn more about the RefSeq accession format and curation status as well as for hints on retrieving RefSeq records using the Entrez search system.

The two types of records are compared in detail on NCBI’s Format of Sequence Record page. This comparison is from an NCBI tutorial titled Introduction to Molecular Biology Information Resources and compares the format of primary (archival) and reference (curated) sequence records for a specific human cancer gene from GenBank and RefSeq. If you examine the sample GenBank record (U07343) and the sample RefSeq record (NM_000249), you will note differences between the two. Scrolling down on the page will reveal comments, analysis and highlights concerning the similarities and differences between these two records and other related protein sequence records.

Annotations

The files you encounter when searching sequence databases will be annotated, meaning they will have additional information that identifies important regions in your sequence or tells you something about the history, function, or source of the sequence. To get the most out of these files you should be familiar with what these annotations can tell you. For our purposes, the most important annotations are the Features and Definitions fields.

Features

The information presented in the Features annotation can tell you about a gene’s structure. For example, in some cases you may want to know where the introns are in a sequence. The features section will have introns identified by nucleotide number on the left hand side. The labeled Sample GenBank Record also indicates a record in which the coding sequences are identified and their functions listed (if known). This can be valuable in helping you parse a sequence and extract the information you need. The range of features represented in a sequence record may include regions that:

  • perform a biological function;
  • affect or are the result of the expression of a biological function;
  • interact with other molecules;
  • affect replication of a sequence;
  • affect or are the result of recombination of different sequences
  • are a recognizable repeated unit
  • have secondary or tertiary structure;
  • exhibit variation, or have been revised or corrected.

The overall goal of the Features annotation is to provide an extensive vocabulary for describing features in a flexible framework for manipulating them. The Features annotation uses a set of shared rules that allow the bioinformatics databases to exchange data. Ultimately the data can be retrieved and manipulated by software because of adherence to these rules. See the Features link of the sample GenBank record, or the additional links contained within, to learn more about the many features that may be included in sequence records. The following are a few of the Features that might appear in a sequence record as defined in NCBI documentation. For more information about these and other features, see the GenBank Feature Key Reference Manual.

CDS coding sequence; sequence of nucleotides that corresponds with the sequence of amino acids in a protein (location includes stop codon); feature includes amino acid conceptual translation.
Exon region of a genome that codes for a portion of spliced mRNA, rRNA and tRNA; may include 5’UTR, all CDSs and 3’UTR.
Intron a segment of DNA that is transcribed, but removed from within the transcript by splicing together the sequences (exons) on either side of it;
Gene region of biological interest identified as a gene and for which a name has been assigned.
mRNA messenger RNA; includes 5’untranslated region (5’UTR), coding sequences (CDS, exon) and 3’untranslated region (3’UTR).
Promoter region on a DNA molecule involved in RNA polymerase binding to initiate transcription.
Protein_bind non-covalent protein binding site on a nucleic acid.
Source identifies the biological source of the specified span of the sequence.

Definition (Search as TITL)

This section has a brief description of the sequence. It includes information such as source organism, gene name/protein name or some description of the sequence’s function (if the sequence is non-coding). This field is the equivalent of the title of the record and may be useful for some types of global search routines (such as all genes from an organism or all identified members of a gene family). If you want to limit your Entrez Nucleotide or Entrez Protein search to this field, enter your search term into the search window followed by the field tag [TITL], e.g. TCP1-beta[TITL]. If the sequence has a coding region (CDS), a description may be followed by a completeness qualifier, such as “complete cds.” This may help you to identify complete sequences for your work. See the Sample GenBank Record and click on the definition link to see a detailed description of the definition field and its contents.

Other Important Annotations

As you gain experience in life science research, you will find that you need the information found in other annotations. Some of these that might help you in this class are listed below.

Locus field: The top line of the record provides important information at a glance, including the locus name, sequence length (in number of base pairs or amino acid residues), molecule type, GenBank division (groups of organisms or sequences from many organisms generated by specific technologies) and most recent modification date.

Source/Organism: The Source field provides the name of the organism from which the sequence was obtained. The Organism field includes the scientific name of the organism and the lineage.

Reference/Submitter block: This field lists the publications by the authors of the sequence that discuss the data reported in the record. Note that there are typically many more references in a RefSeq record than in an archival record. The submitter block is usually the last citation in the reference list and contains information about the researcher(s) and laboratory that submitted the sequence. The words “Direct Submission” appear instead of an article title.

Identification Numbers: There are several different kinds of identification numbers that appear in sequence records.

  • The accession number is the unique identifier for a sequence record. An accession number (U49845) applies to the complete database record and remains stable even if updates/revisions are made to the record.
  • The version number (U49845.1) and GI (1293613) are unique identifiers for the sequence data within a record. If any change occurs to the sequence data, no matter how large or small, the version number for that sequence is incremented by one decimal and a new GI number is assigned.

Records from the RefSeq database of reference sequences have a different and easily recognizable accession number format that begins with two letters followed by an underscore bar and six or more digits. For example:

NT_123456   constructed genomic contigs
NM_123456   mRNAs
NP_123456   proteins
NC_123456   chromosomes

This distinctive accession number is important because it can help you identify a RefSeq record from a list of brief records retrieved in an Entrez search. Locating a RefSeq record may help you make sense of a large number of similar archival records retrieved from GenBank.

The RefSeq Scope and Accessions page of the NCBI tutorial provides more information on how to recognize a RefSeq accession number.

References

Geer, R.C. & Messersmith, D.J. 2002. Introduction to Molecular Biology Resources. Retrieved January 9, 2006 from the National Center for Biotechnology Information site: http://www.ncbi.nlm.nih.gov/Class/MLACourse/.

Timken Science Library • 410 East University Street • Wooster, Ohio 44691 • 330-263-2079
Last updated: November 10, 2007
Suggestions