1. Introduction
The
Universal Protein Resource (UniProt), a collaboration between the
European Bioinformatics Institute (EBI), the Swiss Institute of
Bioinformatics (SIB), and the Protein Information Resource (PIR), is
comprised of four databases, each optimized for different uses. The
UniProt Knowledgebase (UniProtKB) is the central access point for
extensively curated protein information, including function,
classification and cross-references. The UniProt Reference Clusters
(UniRef) combine closely related sequences into a single record to speed
up sequence similarity searches. The UniProt Archive (UniParc) is a
comprehensive repository of all protein sequences, consisting only of
unique identifiers and sequences. The UniProt Metagenomic and
Environmental Sequences (UniMES) database is a repository specifically
developed for metagenomic and environmental data.
2. UniProtKB

UniProtKB/Swiss-Prot; a
curated protein sequence
database which strives to provide a high level of annotation (such as
the description of the function of a protein, its domains structure,
post-translational modifications, variants, etc.), a minimal level of
redundancy and high level of integration with other databases UniProtKB/TrEMBL.
Release 57.7 of 01-Sep-2009: 497293 entries.
UniProtKB/TrEMBL; a
computer-annotated supplement of Swiss-Prot that
contains all the
translations of EMBL nucleotide sequence entries not yet integrated in
Swiss-Prot. Release 40.7 of 01-Sep-2009: 9145906 entries
3. Structure of an UniProtKB
entry
The entries in the UniProt Knowledgebase
are structured so as to be usable by human readers as well
as by computer programs. The explanations, descriptions,
classifications and other comments are in ordinary English.
Wherever possible, symbols familiar to biochemists, protein
chemists and molecular biologists are used. Here is a sample
entry:
PTEN_HUMAN
Each line begins with a two-character
line code, which indicates the type of data contained in
the line. The current line types and line codes and the
order in which they appear in an entry, are shown in the
table below.
|
Line code |
Content |
Occurrence in an entry |
| ID |
Identification |
Once; starts the entry |
| AC |
Accession number(s) |
Once or more |
| DT |
Date |
Three times |
| DE |
Description |
Once or more |
| GN |
Gene name(s) |
Optional |
| OS |
Organism species |
Once or more |
| OG |
Organelle |
Optional |
| OC |
Organism classification |
Once or more |
| OX |
Taxonomy cross-reference |
Once |
| OH |
Organism host |
Optional |
| RN |
Reference number |
Once or more |
| RP |
Reference position |
Once or more |
| RC |
Reference comment(s) |
Optional |
| RX |
Reference cross-reference(s) |
Optional |
| RG |
Reference group |
Once or more (Optional if RA
line) |
| RA |
Reference authors |
Once or more (Optional if RG
line) |
| RT |
Reference title |
Optional |
| RL |
Reference location |
Once or more |
| CC |
Comments or notes |
Optional |
| DR |
Database cross-references |
Optional |
| PE |
Protein existence |
Once |
| KW |
Keywords |
Optional |
| FT |
Feature table data |
Once or more |
| SQ |
Sequence header |
Once |
| (blanks) |
Sequence data |
Once or more |
| // |
Termination line |
Once; ends the entry |
As shown in the above table,
some line types are found in all entries, others
are optional. Some line types occur many times
in a single entry. Each entry must begin with an
identification line (ID) and end with a
terminator line (//).
Here is the full user manual; and
here is a list of key words used in
SwissProt.
4. Database query
Single entry retrieval:
http://www.uniprot.org/
Batch download:
http://www.uniprot.org/batch/
FTP:
Downloading (or here)
the databases
Blast search against the database
5. Other sections of UniProt
UniParc (An
Example entry)
UniParc is a comprehensive and
non-redundant database that contains most of the publicly available
protein sequences in the world. Proteins may exist in different source
databases and in multiple copies in the same database. UniParc avoided
such redundancy by storing each unique sequence only once and giving it
a stable and unique identifier (UPI) making it possible to identify the
same protein from different source databases. UniParc contains only
protein sequences. All other information about the protein must be
retrieved from the source databases using the database cross-references.
Currently UniParc contains protein sequences from the following
publicly available databases:
UniRef
In the UniRef90 and UniRef50 databases no pair of
sequences in the representative set has >90% or >50% mutual sequence
identity. The UniRef100 database presents identical sequences and
sub-fragments as a single entry with protein IDs, sequences,
bibliography, and links to protein databases.
6. IPI
International
Protein Index contains a number of non-redundant proteome sets of
higher eukaryotic organisms constructed from UniProtKB/Swiss-Prot,
UniProtKB/TrEMBL, Ensembl and RefSeq.
Yet another database?
"Despite the complete determination of the genome
sequence of several higher eukaryotes, their proteomes remain relatively
poorly defined. Information about proteins identified by different
experimental and computational methods is stored in different databases,
meaning that no single resource offers full coverage of known and
predicted proteins. IPI (the International Protein Index) has been
developed to address these issues and offers complete nonredundant data
sets representing the human, mouse and rat proteomes, built from the
Swiss-Prot, TrEMBL, Ensembl and RefSeq databases."
Stats of the current release
********************
|
Date |
Species |
Version |
Entries_count |
|
3-Sep-09 |
human |
3.63 |
84118 |
|
|
mouse |
3.63 |
56882 |
|
|
rat |
3.63 |
39883 |
|
|
zebrafish |
3.62 |
38433 |
|
|
arabidopsis |
3.61 |
37183 |
|
|
chicken |
3.57 |
25723 |
|
|
cow |
3.49 |
31533 |
Check the
IPI FAQ to find out the
answers to these questions:
What is the difference between IPI and UniProtKB?
What is the difference between IPI and UniParc?
What is the difference between IPI and UniRef 100?
IPI databases can be downloaded here:
ftp://ftp.ebi.ac.uk/pub/databases/IPI/current/
7. References
Kersey P. J., Duarte J., Williams A., Karavidopoulou
Y., Birney E., Apweiler R. The International Protein Index: An
integrated database for proteomics experiments. Proteomics 4(7):
1985-1988 (2004).
http://expasy.org/sprot/userman.html
|