UniProt, IPI, PubMed, and Python programming

I519 Lab 2 (Sep.11 2009)


 UniProt and IPI





1. Introduction

The Universal Protein Resource (UniProt), a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB), and the Protein Information Resource (PIR), is comprised of four databases, each optimized for different uses. The UniProt Knowledgebase (UniProtKB) is the central access point for extensively curated protein information, including function, classification and cross-references. The UniProt Reference Clusters (UniRef) combine closely related sequences into a single record to speed up sequence similarity searches. The UniProt Archive (UniParc) is a comprehensive repository of all protein sequences, consisting only of unique identifiers and sequences. The UniProt Metagenomic and Environmental Sequences (UniMES) database is a repository specifically developed for metagenomic and environmental data.

2. UniProtKB                                                  

UniProtKB/Swiss-Prot; a curated protein sequence database which strives to provide a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases UniProtKB/TrEMBL. Release 57.7 of 01-Sep-2009: 497293 entries.

UniProtKB/TrEMBL; a computer-annotated supplement of Swiss-Prot that contains all the translations of EMBL nucleotide sequence entries not yet integrated in Swiss-Prot. Release 40.7 of 01-Sep-2009: 9145906 entries

3. Structure of an UniProtKB entry

The entries in the UniProt Knowledgebase are structured so as to be usable by human readers as well as by computer programs. The explanations, descriptions, classifications and other comments are in ordinary English. Wherever possible, symbols familiar to biochemists, protein chemists and molecular biologists are used. Here is a sample entry: PTEN_HUMAN

Each line begins with a two-character line code, which indicates the type of data contained in the line. The current line types and line codes and the order in which they appear in an entry, are shown in the table below.

Line code Content Occurrence in an entry
ID Identification Once; starts the entry
AC Accession number(s) Once or more
DT Date Three times
DE Description Once or more
GN Gene name(s) Optional
OS Organism species Once or more
OG Organelle Optional
OC Organism classification Once or more
OX Taxonomy cross-reference Once
OH Organism host Optional
RN Reference number Once or more
RP Reference position Once or more
RC Reference comment(s) Optional
RX Reference cross-reference(s) Optional
RG Reference group Once or more (Optional if RA line)
RA Reference authors Once or more (Optional if RG line)
RT Reference title Optional
RL Reference location Once or more
CC Comments or notes Optional
DR Database cross-references Optional
PE Protein existence Once
KW Keywords Optional
FT Feature table data Once or more
SQ Sequence header Once
(blanks) Sequence data Once or more
// Termination line Once; ends the entry

As shown in the above table, some line types are found in all entries, others are optional. Some line types occur many times in a single entry. Each entry must begin with an identification line (ID) and end with a terminator line (//).

Here is the full user manual; and here is a list of key words used in SwissProt.

4. Database query

Single entry retrieval: http://www.uniprot.org/

Batch download: http://www.uniprot.org/batch/

FTP: Downloading (or here) the databases

Blast search against the database

5. Other sections of UniProt

UniParc (An Example entry)                                          

UniParc is a comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world. Proteins may exist in different source databases and in multiple copies in the same database. UniParc avoided such redundancy by storing each unique sequence only once and giving it a stable and unique identifier (UPI) making it possible to identify the same protein from different source databases. UniParc contains only protein sequences. All other information about the protein must be retrieved from the source databases using the database cross-references.

Currently UniParc contains protein sequences from the following publicly available databases:


In the UniRef90 and UniRef50 databases no pair of sequences in the representative set has >90% or >50% mutual sequence identity. The UniRef100 database presents identical sequences and sub-fragments as a single entry with protein IDs, sequences, bibliography, and links to protein databases.

6. IPI

International Protein Index contains a number of non-redundant proteome sets of higher eukaryotic organisms constructed from UniProtKB/Swiss-Prot, UniProtKB/TrEMBL, Ensembl and RefSeq.

Yet another database?

"Despite the complete determination of the genome sequence of several higher eukaryotes, their proteomes remain relatively poorly defined. Information about proteins identified by different experimental and computational methods is stored in different databases, meaning that no single resource offers full coverage of known and predicted proteins. IPI (the International Protein Index) has been developed to address these issues and offers complete nonredundant data sets representing the human, mouse and rat proteomes, built from the Swiss-Prot, TrEMBL, Ensembl and RefSeq databases."

Stats of the current release

Date Species




human 3.63 84118
mouse 3.63 56882
rat 3.63 39883
zebrafish 3.62 38433
arabidopsis 3.61 37183
chicken 3.57 25723
cow 3.49 31533

Check the IPI FAQ to find out the answers to these questions:

What is the difference between IPI and UniProtKB?
What is the difference between IPI and UniParc?
What is the difference between IPI and UniRef 100?

IPI databases can be downloaded here:

7. References

Kersey P. J., Duarte J., Williams A., Karavidopoulou Y., Birney E., Apweiler R. The International Protein Index: An integrated database for proteomics experiments. Proteomics 4(7): 1985-1988 (2004).


Practice - UniProt

  1. Retrieve information form UniProt for protein P04637.

  2. What is the name of the protein? What is the  function of the protein?

  3. Get the entry in plain text format and take a look at it.

  4. Search the same protein in UniParc

PubMed and Google Scholar



1. Introduction

PubMed, available via the NCBI Entrez retrieval system, was developed by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), located at the U.S. National Institutes of Health (NIH).  PubMed provides access to citations from biomedical literature. LinkOut provides access to full-text articles at journal Web sites and other related Web resources. PubMed also provides access and links to the other Entrez molecular biology resources. As of Sep.07, 2009, there are 19129804 articles, from 23336 Journals.

2. Database query

PubMed query can be performed against various fields. You could explicitly use these tags in your query, although PubMed does provide easier ways to do it.

  1. The "normal" way: Entrez-Pubmed, you can add Limits or do the Advanced search.

What role does pain have in sleep disorders?
Here is a sample search result.

What articles have Watson JD published?
Enter the author’s last name plus initials without punctuation in the search box and click Go.

  1. Find a specific citation: Single Citation Matcher
  2. Match a list of citations: Batch Citation Matcher

Full author names may be searched for citations published from 2002 forward if the full author name is available in the article.

Here is the full PubMed help.

3. Google Scholar

In case you don't know Google Scholar, just Google "Google Scholar". In case you don't know Google, just Google "Google".

Problem with Google Scholar (from Wiki):

A significant problem with Google Scholar is the secrecy about its coverage. Some publishers do not allow it to crawl their journals. Elsevier journals were not included before mid-2007, when Elsevier began to make most of its ScienceDirect content available to Google Scholar and Google's web search.[8] As of February 2008 the absentees still included the most recent years of the American Chemical Society journals. Google Scholar does not publish a list of scientific journals crawled, and the frequency of its updates is unknown. It is therefore impossible to know how current or exhaustive searches are in Google Scholar. Nonetheless, it allows easy access to published articles without the difficulties encountered in some of the most expensive commercial databases.


Practice - PubMed

  1. How many articles has Watson, James D published?

  2. How many articles does PubMed have? And what is the year of the oldest article in PubMed?

  3. Find this article from PubMed:
    Nature Biotechnology  22, 1177 - 1178 (2004). Then find the same article by Google Scholar.

  4. How many articles have been published in the journal Nature Biotechnology?

Writing function and class in Python



1. Learning Resources

Check out this for function and this for class from the Python Official Website.

A function:

>>> def fib(n):    # write Fibonacci series up to n
...     """Print a Fibonacci series up to n."""
...     a, b = 0, 1
...     while b < n:
...         print b,
...         a, b = b, a+b
>>> # Now call the function we just defined:
... fib(2000)
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597

A class:

class MyClass:
    """A simple example class"""
    i = 12345
    def f(self):
        return 'hello world'
x = MyClass()

2. Examples

The microwave class: How to Create a Class in Python from ehow

Practice - Python

  1. Write a function that add 1 to a number

  2. Write a class that has one methods, which add 1 to a number.

       Contact: Yong Fuga Li, x@y.z, with x = yonli, y = indiana, z = edu