fastacmd README =============== Last updated: 12/02/2004 Table of Contents Introduction Command line options Usage Return values Notes/Troubleshooting Introduction ------------ fastacmd retrives FASTA formatted sequences from a blast database, as long as it it was successfully formatted using the '-o' option. Command line options -------------------- The fastacmd options are: fastacmd 2.2.10 arguments: -d Database [String] Optional default = nr -p Type of file G - guess mode (look for protein, then nucleotide) T - protein F - nucleotide [String] Optional default = G -s Search string: GIs, accessions and loci may be used delimited by comma. [String] Optional -i Input file wilth GIs/accessions/loci for batch retrieval [String] Optional -a Retrieve duplicate accessions [T/F] Optional default = F -l Line length for sequence [Integer] Optional default = 80 -t Definition line should contain target gi only [T/F] Optional default = F -o Output file [File Out] Optional default = stdout -c Use Ctrl-A's as non-redundant defline separator [T/F] Optional default = F -D Dump the entire database as (default is not to dump anything): 1 FASTA 2 Gi list [Integer] Optional default = 0 -L Range of sequence to extract (Format: start,stop) 0 in 'start' refers to the beginning of the sequence 0 in 'stop' refers to the end of the sequence [String] Optional default = 0,0 -S Strand on subsequence (nucleotide only): 1 is top, 2 is bottom [Integer] default = 1 -T Print taxonomic information for requested sequence(s) [T/F] default = F -I Print database information only (overrides all other options) [T/F] default = F -P Retrieve sequences with this PIG [Integer] Optional Please note that options -t and -c are only relevant to non-redundant databases only (ie: protein nr and pataa, as provided in the NCBI ftp site) Usage ----- 1.) Retrieving a sequence by identifier: fastacmd -d nt -s 555 >gi|555|emb|X65215.1|BTMISATN B.taurus microsatellite DNA (624bp) ACCTCCACTAGCTTTGTTTGTAGTGATGCTCTGTAGCACCACTGGGAAGCCCTTTAATGAATGTGCCTTTCCGCAAATCA CACACACACAAATACACTTATAGAAACAAGGTGATTTTCTTGAAATAATAAAACAAAATTTGGAAGAAGATTTTTACTGT CTTAGGAAAAGTAAGGCATTGGAAGGTGGCTAGGTATGACATATGAAGTTGCATTTTAAAACTGGAATTGGACAACTGAT ATTCAGTGATATTTATGCTACTACCTTCTAGAATCGAGAGCATGCACCCCACTCTGTACTCTTGCCTGGAGAATCCATGA TGAGAGCCTGGTAGGCTGCAGTCCATGGGGTCACACAGAGTCGGACATGACTGAGCGACTTCACTTTCACTTTTCAATTT CATGCATTGGAGCCGGAAATGGCAACCCACTCCAGTGTTCTTGCCTGGAGAATCCCAGGGATGGGGAAGCCTGGTGGGCT GCTGTCTATGGGGTCGCAGAGAGTCAGACACGACTGAAGTGACTTAGCAGCAACCTTCTGGAATAAACGCCTCAGGCTTT AAACTCTGGCTTGACCATTCACTAGCCATGGGATCCACTAGAGTCGACCTGCAGGCATGCAAGC If the identifier is not a gi or an accession, you must pass the entire seqid string to fastacmd. That is, if your sequence is >gnl|mydb|myid my sequence description ACGT... , you must search for it with fastacmd -d mydb -s 'gnl|mydb|myid' 2.) Printing a summary of database statistics: fastacmd -d nt -I Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,711,089 sequences; 7,976,531,563 total letters File name: /usr/ncbi/db/blast/nt Date: Mar 26, 2003 10:25 PM Version: 4 Longest sequence: 1,421,559 bp 3.) Obtaining a FASTA file from a blast database: fastacmd -D 1 -d nt -o nt.fsa [output removed for brevity] 4.) Retrieving only part of a sequence: fastacmd -d nt -s 555 -L0,32 gi|555:1-32 B.taurus microsatellite DNA (624bp) ACCTCCACTAGCTTTGTTTGTAGTGATGCTCT 5.) Retrieving taxonomic information for a given sequence: fastacmd -d nt -s 555 -T NCBI sequence id: gi|555|emb|X65215.1|BTMISATN NCBI taxonomy id: 9913 Common name: cow Scientific name: Bos taurus 6.) Obtaining a list of gis from a blast database: fastacmd -D 2 -d nt -o nt.gis [output removed for brevity] Return values ------------- The following exit values are returned: 0 Completed successfully 1 An error occurred 2 Blast database was not found 3 Failed search (accession, gi, taxonomy info) 4 No taxonomy database was found Notes/Troubleshooting --------------------- A) Taxonomy information In order to access to the taxonomy information using fastacmd, the blast databases should have been obtained from the NCBI ftp site (ftp://ftp.ncbi.nih.gov/blast/db) and an additional set of files are needed. These files are archived as taxdb.tar.gz under the same directory as the blast databases on the NCBI ftp site. Please install these files in the same directory as the blast databases (and do not forget to update your ncbi configuration file to point to this directory). Here are some of the error messages one might encounter when accessing the taxonomy information from the blast databases: fastacmd -d testdb -s 555 -T [fastacmd] ERROR: Taxonomy information not encoded in your blast database. This blast database does not contain the taxonomy id encoded for this gi/accession. Only preformatted blast databases provided by the NCBI contain taxonomy identifiers encoded (formatdb cannot add this). fastacmd -d patnt -s 412262 -T [fastacmd] ERROR: Taxonomy information is not available. Please download it from ftp://ftp.ncbi.nih.gov/blast/db/taxdb.tar.gz Download the required files and install them as described above.