fastacmd README 
===============
Last updated: 12/02/2004


Table of Contents

    Introduction

    Command line options

    Usage

    Return values

    Notes/Troubleshooting



Introduction
------------

fastacmd retrives FASTA formatted sequences from a blast database, as long as
it it was successfully formatted using the '-o' option.  

Command line options
--------------------
The fastacmd options are:

fastacmd 2.2.10   arguments:

  -d  Database [String]  Optional
    default = nr
  -p  Type of file
         G - guess mode (look for protein, then nucleotide)
         T - protein   
         F - nucleotide [String]  Optional
    default = G
  -s  Search string: GIs, accessions and loci may be used delimited
      by comma. [String]  Optional
  -i  Input file wilth GIs/accessions/loci for batch
      retrieval [String]  Optional
  -a  Retrieve duplicate accessions [T/F]  Optional
    default = F
  -l  Line length for sequence [Integer]  Optional
    default = 80
  -t  Definition line should contain target gi only [T/F]  Optional
    default = F
  -o  Output file [File Out]  Optional
    default = stdout
  -c  Use Ctrl-A's as non-redundant defline separator [T/F]  Optional
    default = F
  -D  Dump the entire database as (default is not to dump anything):
      1 FASTA
      2 Gi list
 [Integer]  Optional
    default = 0
  -L  Range of sequence to extract (Format: start,stop)
      0 in 'start' refers to the beginning of the sequence
      0 in 'stop' refers to the end of the sequence [String]  Optional
    default = 0,0
  -S  Strand on subsequence (nucleotide only): 1 is top, 2 is bottom [Integer]
    default = 1
  -T  Print taxonomic information for requested sequence(s) [T/F]
    default = F
  -I  Print database information only (overrides all other options) [T/F]
    default = F
  -P  Retrieve sequences with this PIG [Integer]  Optional

  Please note that options -t and -c are only relevant to non-redundant 
  databases only (ie: protein nr and pataa, as provided in the NCBI ftp site)

Usage
-----

1.) Retrieving a sequence by identifier:

fastacmd -d nt -s 555
>gi|555|emb|X65215.1|BTMISATN B.taurus microsatellite DNA (624bp)
ACCTCCACTAGCTTTGTTTGTAGTGATGCTCTGTAGCACCACTGGGAAGCCCTTTAATGAATGTGCCTTTCCGCAAATCA
CACACACACAAATACACTTATAGAAACAAGGTGATTTTCTTGAAATAATAAAACAAAATTTGGAAGAAGATTTTTACTGT
CTTAGGAAAAGTAAGGCATTGGAAGGTGGCTAGGTATGACATATGAAGTTGCATTTTAAAACTGGAATTGGACAACTGAT
ATTCAGTGATATTTATGCTACTACCTTCTAGAATCGAGAGCATGCACCCCACTCTGTACTCTTGCCTGGAGAATCCATGA
TGAGAGCCTGGTAGGCTGCAGTCCATGGGGTCACACAGAGTCGGACATGACTGAGCGACTTCACTTTCACTTTTCAATTT
CATGCATTGGAGCCGGAAATGGCAACCCACTCCAGTGTTCTTGCCTGGAGAATCCCAGGGATGGGGAAGCCTGGTGGGCT
GCTGTCTATGGGGTCGCAGAGAGTCAGACACGACTGAAGTGACTTAGCAGCAACCTTCTGGAATAAACGCCTCAGGCTTT
AAACTCTGGCTTGACCATTCACTAGCCATGGGATCCACTAGAGTCGACCTGCAGGCATGCAAGC

If the identifier is not a gi or an accession, you must pass the entire seqid
string to fastacmd. That is, if your sequence is

>gnl|mydb|myid my sequence description
ACGT...

, you must search for it with

fastacmd -d mydb -s 'gnl|mydb|myid'

2.) Printing a summary of database statistics:

fastacmd -d nt -I
Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0,
1 or 2 HTGS sequences) 
           1,711,089 sequences; 7,976,531,563 total letters

File name:
/usr/ncbi/db/blast/nt
   Date: Mar 26, 2003 10:25 PM    Version: 4    Longest sequence: 1,421,559 bp


3.) Obtaining a FASTA file from a blast database:

fastacmd -D 1 -d nt -o nt.fsa
[output removed for brevity]

4.) Retrieving only part of a sequence:

fastacmd -d nt -s 555 -L0,32
gi|555:1-32 B.taurus microsatellite DNA (624bp)
ACCTCCACTAGCTTTGTTTGTAGTGATGCTCT

5.) Retrieving taxonomic information for a given sequence:

fastacmd -d nt -s 555 -T
NCBI sequence id: gi|555|emb|X65215.1|BTMISATN
NCBI taxonomy id: 9913
Common name: cow
Scientific name: Bos taurus

6.) Obtaining a list of gis from a blast database:

fastacmd -D 2 -d nt -o nt.gis
[output removed for brevity]


Return values
-------------
The following exit values are returned:

     0     Completed successfully

     1     An error occurred

     2     Blast database was not found

     3     Failed search (accession, gi, taxonomy info)

     4     No taxonomy database was found


Notes/Troubleshooting
---------------------

A) Taxonomy information

In order to access to the taxonomy information using fastacmd, the blast
databases should have been obtained from the NCBI ftp site
(ftp://ftp.ncbi.nih.gov/blast/db) and an additional set of
files are needed. These files are archived as taxdb.tar.gz under the same
directory as the blast databases on the NCBI ftp site.
Please install these files in the same directory as the blast databases (and
do not forget to update your ncbi configuration file to point to this
directory). 
Here are some of the error messages one might encounter when accessing the
taxonomy information from the blast databases:

fastacmd -d testdb -s 555 -T
[fastacmd] ERROR: Taxonomy information not encoded in your blast database.

This blast database does not contain the taxonomy id encoded for this
gi/accession. Only preformatted blast databases provided by the NCBI contain 
taxonomy identifiers encoded (formatdb cannot add this).


fastacmd -d patnt -s 412262 -T
[fastacmd] ERROR: Taxonomy information is not available. Please download it from
ftp://ftp.ncbi.nih.gov/blast/db/taxdb.tar.gz

Download the required files and install them as described above.