phrap.doc

/*****************************************************************************
# Copyright (C) 1994-1999 by Phil Green.
# All rights reserved.
#
# This software is part of a beta-test version of the swat/cross_match/phrap
# package. It should not be redistributed or
# used for any commercial purpose, including commercially funded
# sequencing, without written permission from the author and the
# University of Washington.
#
# This software is provided ``AS IS'' and any express or implied
# warranties, including, but not limited to, the implied warranties of
# merchantability and fitness for a particular purpose, are disclaimed.
# In no event shall the authors or the University of Washington be
# liable for any direct, indirect, incidental, special, exemplary, or
# consequential damages (including, but not limited to, procurement of
# substitute goods or services; loss of use, data, or profits; or
# business interruption) however caused and on any theory of liability,
# whether in contract, strict liability, or tort (including negligence
# or otherwise) arising in any way out of the use of this software, even
# if advised of the possibility of such damage.
#
#*****************************************************************************/

DOCUMENTATION FOR PHRAP AND CROSS_MATCH (VERSION 0.990319)

phrap ("phragment assembly program", or "phil's revised assembly
program"; a homonym of "frappe" = French for "swat") -- a
program for assembling shotgun DNA sequence data. Key features:
allows use of entire read (not just trimmed high quality part); uses a
combination of user-supplied and internally computed data quality
information to improve accuracy of assembly in the presence of
repeats; constructs contig sequence as a mosaic of the highest quality
parts of reads (rather than a consensus); provides extensive information
about assembly (including quality values for contig sequence) to
assist trouble-shooting; able to handle very large datasets.
N.B. phrap does not provide editing or viewing capabilities; these
are available with consed and phrapview. It is strongly recommended
that phrap be used in conjunction with the base calls and base quality
values produced by the basecaller, phred; and with the sequence
editor/assembly viewer, consed.

cross_match -- a general-purpose utility (based on a "banded" version
of SWAT, an efficient implementation of the Smith-Waterman algorithm)
for comparing any two sets of (long or short) DNA sequences. For
example, it can be used to compare a set of reads to a set of vector
sequences and produce vector-masked versions of the reads; a set of
cDNA sequences to a set of cosmids; contig sequences found by two
alternative assembly procedures (e.g. phrap and xbap) to each other;
or phrap contigs to the final edited cosmid sequence. It is slower but
more sensitive (because it allows gaps) than BLASTN.

CONTENTS

I. INSTALLATION

II. RUNNING PHRAP & CROSS_MATCH
1. Phrap
2. Cross_match

III. INPUT DATA FILES
1. Read naming convention
2. Sequence files
3. Quality files
4. Vector screening

IV. COMMAND LINE OPTIONS
1. Scoring of pairwise alignments
2. Banded search
3. Filtering of matches
4. Input data interpretation
5. Assembly
6. Consensus sequence construction
7. Output
8. Miscellaneous

V. VIEWING PHRAP ASSEMBLIES WITH OTHER PROGRAMS
1. Consed
2. Phrapview
3. Gap

VI. PHRAP OUTPUT

VII. CROSS_MATCH OUTPUT

VIII. EST ASSEMBLIES (TO BE ADDED)

IX. PROBLEMS
1. Insufficient memory
2. Other phrap- or cross_match-generated error messages (TO BE ADDED)
3. Phrap- or cross_match-generated warning messages (TO BE ADDED)
4. "Crashes" reported by operating system
5. Long running time
6. Misassemblies, incomplete assemblies, incorrect consensus sequence (TO BE ADDED)
7. Reporting problems

APPENDIX: ALGORITHMS

I. INSTALLATION

The source code for the swat/cross_match/phrap package will be sent to
you in the form an email message containing a uuencoded .tar.Z file;
you will need to have access to a Unix system for the initial
unpacking, but once you've uudecoded it and unpacked the .tar file
(steps i and ii below), you should be able to compile the programs
on computers running other operating systems -- they should be portable
to almost anything with a decent C compiler and adequate memory (64 Mb
RAM or more is desirable). Here are the steps needed to unpack and
install the programs:

i. Save the email message as a file (for example, "temp.mail"). If
possible, do this using the Unix mail command, rather than another
mail program -- some mail programs (e.g. Pine) remove trailing spaces
on each line of incoming messages, which will corrupt a uuencoded
message.
Do not attempt to modify the saved mail message in any way. That is
unnecessary and may corrupt the message.

ii. To unpack the saved email message, execute the following two
commands on a Unix workstation, in the directory containing the file
created in step i above:

> uudecode temp.mail

> zcat distrib.tar.Z | tar xvf -

If either of these commands results in an error message, it is likely that
the email message was corrupted by your mail program (see step i above).

iii. To produce working versions of the programs, move (if necessary)
all of the files produced by the above command to the computer on
which you wish to run the programs (which must have a C compiler!),
and execute the following command in the directory that contains the
files:

> make

If your compiler does not recognize the -O2 optimization flag (which
should be evident from warning messages it produces), you should change
the line

CFLAGS= -O2

in the file "makefile" to

CFLAGS= -O

Then remove all files ending in .o produced by the original make, and
recompile.

iv. If you have datasets with more than 64,000 reads, or that include
sequences longer than 64,000 bp, you will need the .manyreads and/or
.longreads versions of phrap and cross_match. These are created using the
command

> make manyreads

v. If you are operating a non-commercial (academic or government)
computer facility which provides access to several independent
investigators, you are required by the licensing agreement to set the
permissions on the executables and source code to allow execute but
not read access, so that the programs may not be copied.

II. RUNNING PHRAP & CROSS_MATCH

1. Phrap.

i. PhredPhrap script. Generally if you are assembling reads
basecalled by phred you should run phrap via the phredPhrap script
rather than directly. PhredPhrap runs phred; runs another script,
phd2fasta, to make the fasta format sequence file from the .phd files
generated by phred; runs cross_match to mask vector sequence; runs
phrap; and finally runs some other programs that provide useful
information to consed (e.g. tagging of repeats). It will accept any
of the phrap command line options described below (section IV).
PhredPhrap and its documentation are included in the consed
distribution.
Although phredPhrap automates most of the steps in the assembly, you
will still need to do several things: (a) set up the directory
structure that is assumed by phredPhrap and make sure your
chromatogram files are appropriately located in this structure; (b)
adhere to the read naming convention assumed by phrap, or make other
arrangements to get template/chemistry/read direction information into
the .phd or fasta file (THIS IS IMPORTANT -- SEE section III.1 below);
and (c) make sure that appropriate clone and subclone (sequencing)
vector sequences are included in a fasta file used for screening with
cross_match (see section III.4 below).

ii. The following instructions pertain to "standalone" operation of
phrap. Before the run, you may want to increase the amount of memory
available to phrap: see the instructions for using the Unix
"limit" command under IX.1, "Insufficient memory" below.
The command line should be of the form

> phrap seq_file1 [seq_file2 ...] [-option value] [-option value] ...

where each seq_file is the name of a sequence file in FASTA format as
described below in III.2. Available options are described in section IV
below. The standard output is extensive and should be redirected to a
file (phrap.out in the following example).

Example:

> phrap C05D11.reads.screen -minmatch 20 -new_ace > phrap.out

Note that the quality file is not specified on the command line, but
instead must be named appropriately (as described below in III.3) to be
recognized by phrap; in this example its name would need to be
C05D11.reads.screen.qual. If more than one seq_file is provided,
then reads in the first file are assembled only against sequences in
the 2d (and later) files; no sequences are compared to other sequences
in the same file. [N.B. THE ABILITY TO USE MULTIPLE SEQUENCE FILES IS
NOT AVAILABLE AT PRESENT -- YOU SHOULD PROVIDE ONLY A SINGLE SEQUENCE
FILE TO PHRAP]. In this case, many features of phrap (e.g. checking
for anomalies, vector, etc.) which rely on comprehensive read-read
comparisons are turned off, SWAT scores are used in place of
LLR_scores, and implied merges are not performed -- i.e. each read in
the first file is merged with at most one sequence in the 2d
file. This usage of phrap is convenient for aligning a set of reads
against a known "finished" sequence in order to look for
polymorphisms, for example.

2. Cross_match.

The command line should be of the form

> cross_match seq_file1 [seq_file2 ...] [-option value] [-option value] ...

where each seq_file is the name of a sequence file in FASTA format as
described below in III.2. Available options are described in section
IV below. The standard output should generally be redirected to a
file. If there are at least two files, all sequences in the first
file are compared to those in the second (and any subsequent) files;
if there is only one file, all sequences in this file are compared to
each other (N.B. at present entries are not compared to their own
complements, but are compared to themselves in the same orientation,
and to all other entries in both orientations). Matches meeting
relevant criteria (see section IV) are written to the standard output.

III. INPUT FILES

Input files to cross_match and phrap are of two types: sequence files
(required), and quality files (optional, but strongly recommended for
phrap). If you are doing phrap assemblies of reads generated by the
basecalling program phred, then it is recommended that you generate
these input files automatically using the phd2fasta or phredPhrap
scripts distributed with phred and consed (see II.1.i above).
Even in this case however you should read the following section on the
read naming convention, since you will either need to make sure that
the read names attached to your chromatogram files adhere to the
naming convention, or you will need to modify the phredPhrap script
(or create your own script) so that correct template names and read
orientation information is appended to the .phd files after they are
produced by phred.

1. Read naming convention (phrap only)

In addition to the sequence and quality data, phrap needs to know
three things for each read:
(i) the name of the subclone (or other template) from which the read
is derived; this is used, for example, in checking for chimeric
subclones (for which purpose it is necessary to know when two reads
are from the same subclone so that they are not regarded as
independently confirming each other) and for certain other data
anomalies;
(ii) the orientation of the read (forward or reverse) within the
subclone, in cases where data is acquired from each end of a subclone
insert (at present this information is used only for consistency
checking following assembly, and not in the assembly itself, but this
will change in future versions);
(iii) the chemistry used to generate the read (which influences
phrap's decisions as to how to treat discrepancies between potentially
overlapping reads, and as to how to adjust qualities of reads which
confirm each other -- confirmation of a read by another read with
different chemistry counts as significantly as confirmation by an
opposite-strand read).

The above information is generally conveyed via the read names using
the "St. Louis" read naming convention described below. If your
laboratory uses a different naming system, you have several options:

(i) provide the relevant information in the description field of the
.fasta file instead (see below), or (if you are using the phd2Fasta
script) as tags in the .phd files for each read;
(ii) see whether appropriate values of the command line options
-subclone_delim and -n_delim (described below) can be used to bring
your own naming convention into conformity with the St. Louis
convention -- if your naming convention is such that the subclone name
always occurs as the first part of the read name, and the end of the
subclone name occurs at the first (or n-th for some fixed value of n)
occurrence of a particular character, then you may be able to use
these options without having to change your read names;
(iii) create a script which translates your read names into St. Louis
form (this does not mean that you cannot retain the original names as
well -- for example, for purposes of running phrap you could create a
directory of "links" to your chromatogram files, such that the link
names adhere to the naming convention, and then run the phredPhrap
script on this directory);
(iv) alter the subroutines in the source module "names.c" in order to
parse your read names correctly; or
(v) ignore the issue and hope that it does not cause problems. This
may have unpredictable adverse effects on assembly and is NOT
recommended.

The "St. Louis" read naming convention assumed by phrap is as follows:

The portion of the read name up to the first '.', if any, is
the name of the subclone from which the read is derived (i.e. the DNA
template used in the sequencing reaction). If desired, the command line
option -subclone_delim can be used to change the character which
delimits the end of the subclone name to something other than '.',
and/or the option -n_delim may be used to indicate that the subclone
name contains a fixed number of occurrences of this character within
it. See section IV below for a description of command line options.
The orientation of the read within the subclone, and the chemistry,
are indicated by the first letter following the '.', as follows:

"s" forward direction read on single stranded (SS) template, dye primer chemistry
"f" forward read on double stranded (DS) template, dye primer chemistry
"r" DS reverse read, dye primer chemistry
"x" SS forward read, standard dye terminator chemistry
"z" DS forward read, standard dye terminator chemistry
"y" DS reverse read, standard dye terminator chemistry
"i" SS forward read, big dye terminator chemistry
"b" DS forward read, big dye terminator chemistry
"g" DS reverse read, big dye terminator chemistry
"t" for T7 (cDNAs)
"p" for SP6 (cDNAs)
"e" for T3 (cDNAs)
"d" for special
"c" consensus pieces
"a" assembly pieces

The remainder of the name is ignored. If the read name ends in ".seq"
the .seq is removed.

Example: the name BAC112_a11b3.x3_pg indicates a forward read with
(old) dye terminator chemistry from the subclone BAC112_a11b3.

2. Sequence files.

With both phrap and cross_match, either a single sequence file or
multiple sequence files can be specified (N.B. WITH CURRENT VERSION OF
PHRAP ONLY A SINGLE SEQUENCE FILE SHOULD BE SPECIFIED). If a single
file is specified, all sequences in it are compared to each other; if
more than one file is specified, sequences in the first file are
compared to sequences in the second (and subsequent) files. Typically
phrap is used with a single input sequence file containing the reads,
unless one wants to assemble one set of reads "against" another
sequence or set of sequences (see below). Cross_match is typically
used with two input sequence files, the first containing the "query"
sequences and the second containing the "subject" sequences, but it
may also be used to compare the sequences in a single input file to
each other.

Phrap is designed to be able to use all of each read sequence in the
assembly, not just the trimmed (highest quality) part, so the full
sequence of each read should be provided when available. If
phred-generated qualities are not available however then the default
quality parameter -default_qual should be set appropriately (section
IV).

Sequence files must be in FASTA format:
(i) Each sequence entry has a single header line as follows: The
first character is '>'. This is followed immediately (i.e. with no
intervening spaces or other characters) by the sequence name, and then
(optionally) by a space or spaces, followed by descriptive information
about the sequence.
(ii) The header line is followed by a separate line or lines
containing the sequence.

The sequence names, descriptive information, and sequences can be of
arbitrary length (up to a combined total size for each type of
about 2 billion characters), since space for them is allocated
dynamically. If there are more than 64,000 entries in the first
(query) file, or if any of the sequences are longer than about 64,000
characters, the .manyreads version of phrap or cross_match must be
used.

Descriptive information on the header line may optionally be used to
indicate template (subclone), read orientation, read chemistry and
dye. If present, this information overrides whatever would be implied
by the read name. The template name is indicated by including the word
"TEMPLATE:", followed by a space, followed by the name of the
template. Orientation is indicated by the word "DIRECTION:", followed
by the word "fwd" or "rev". Chemistry is indicated by the word
"CHEM:", followed by the word "prim" (for dye primer), "term" for dye
terminator, or "unknown". Dye is indicated by the word "DYE:" followed
by "rhod", "big", "ET", "d-rhod", or "unknown". The chemistry and dye
information is provided automatically by newer versions of phred
(version 0.980904.a or later), which obtain it from the chromatogram
file. (Use the script phd2fasta distributed with phred and consed to
create the .fasta file from the .phd files created by phred).

Example:

>read#5 DIRECTION: rev CHEM: prim DYE: ET TEMPLATE: a23f1 any_other_information
GAAAGATCTCATTGATCACTCTATTCAAGTGGGAGTCTCCGGTCTTT
ATGATCGATTTGTGAATCTTCGTATCAAAGTTGGAGCTGACAAGTATCCA
TTGCTTGCGAAATGGGCTCAAATTTTCACTCAGGGAGTCGTCTTCGATCC
TTCAAGAATTCATCAATGTGCTCAAAAGTTGGCTGGAGAAGCTCGTGATC
GGAAGAGAGATGGATGTACTGTGGCATCAACTGCAGTAGCTTCAATGGTT
TATGGAAAGAGTATGTTATTt

All descriptive information in the header line is written back out to
the .ace and .singlets files.

Non-alphabetic characters (including '*', and digits) in the sequence
lines are automatically stripped out when the file is read in, except
for '-' which is converted to 'N'; '>' must not appear anywhere within
the sequence since it is assumed to start the header of a new sequence
(even if it is not at the beginning of a line). Lower case letters are
converted to upper case on readin, so it is not possible (as it was in
old phrap versions) to use case as a crude indication of quality.

3. Quality files.

For each input sequence file in FASTA format, you may optionally
include a corresponding file of data quality information. This is
strongly recommended for phrap runs since it greatly improves the
accuracy of assembly and of the consensus sequence; with cross_match,
the qualities are at present used only for purposes of annotating and
tabulating discrepancies and not in the alignment scoring. The name of
this file must consist of the name of the FASTA file, with ".qual"
appended.
The format of the .qual file is similar to that of the corresponding
FASTA file. For each read there should appear a header line identical
(except possibly for the "description" field) to that in the FASTA
file. This is followed by one or more lines giving the qualities for
each base. Quality values should be integers between 0 and 99
(inclusive), and should be separated by spaces. No other (non-digit,
non-white space) characters should appear. The total number of quality
values for each read must match the number of bases for that read in
the FASTA file. Also, the orders of the reads in the FASTA and
corresponding .qual file must be identical. N's are automatically
assigned quality 0, overriding any value present in the .qual file;
however X's retain whatever value is present in the .qual file.
If provided, quality information is used by phrap both in the
assembly itself (where discrepancies between high quality bases are
used to discriminate repeats from true overlaps) and in determining
the contig consensus sequence (which is formed as a mosaic of the
highest quality parts of the reads). Because phrap generates its own
quality measures (based on read-read confirmation) it performs
reasonably well even if no input quality is provided; however when it
is available, it is important to provide input data quality
information since this can substantially increase the accuracy of the
consensus, particularly in regions where there are strand-specific
effects on quality (e.g. compressions) -- in such cases the input
quality information may constitute the only basis for choosing one
strand over the other.
If your sequencing data is from automated sequencing machines it is
strongly recommended that you generate the read sequences using the
basecalling program phred, which will produce an appropriate quality
file for input to phrap. Phred generally gives more accurate base
calls overall, for a longer distance, than the ABI software does. On
the basis of the trace characteristics, phred computes a probability p
of an error in the base call at each position, and converts this to a
quality value q using the transformation q = -10 log_10(p). Thus a
quality of 30 corresponds to an error probability of 1 / 1000, a
quality of of 40 to an error probability of 1 / 10000, etc.
The quality value 99 is reserved for base calls that have been
visually inspected and verified as "highly accurate" (during editing),
and 98 for bases that have been edited but are not highly accurate
(these are converted to quality 0 in phrap). If two reads appear to
match but have discrepancies involving bases that have quality 99 in
both reads, phrap does not allow them to be merged during assembly.
At present this is the only way to break false joins made by phrap.
If you wish to generate your own quality values you should be
prepared to experiment. It is particularly important that possible
insertion errors (which are fairly frequent late in the trace, with ABI
basecalls) receive low quality, because in choosing the "consensus"
phrap tends to favor the reads with the highest total quality in a
given region. Also, if at all possible you should use quality values
that are defined in terms of error probabilities in the manner
described above, since to some extent phrap is tuned to expect these
(and its output qualities will then have a similar interpretation).
If all input quality values are relatively small (less than 15),
phrap assumes that they do not correspond to error probabilities and
attempts to rescale them so that the largest quality value is
approximately 30. This allows different input scales to be used.

4. Vector screening

Following creation of a FASTA file with the raw read sequences, it is
important to remove or screen out sequencing (subclone) vector
sequence before running phrap. (It is unnecessary to remove the
cloning (e.g. cosmid or BAC) vector, unless the inserts from multiple
overlapping clones are being assembled simultaneously; however it is
generally useful to remove at least the central part of the cloning
vector, since this then provides a natural point at which phrap can
break the circular sequence into a linear one). Unremoved sequencing
vector may cause reads to be identified as "possible chimeras", or
otherwise interfere with proper assembly. Some vector removal
programs may be unreliable at identifying vector sequences that are
rearranged or found in the lower quality part of the read, and I
recommend that you screen for sequencing vector using cross_match as
described here, whether or not you have already screened them using
another program.

To carry out the screening, first create a FASTA file, say "vector",
with all of the vector sequences you want to screen for (I use one
that contains all vector sequences used in any of the ongoing
sequencing projects, in case the vector is misidentified in the
current project). If your read file is named "C05D11.reads" (for
example), then run the command

> cross_match C05D11.reads vector -minmatch 10 -minscore 20 -screen > screen.out

(This uses somewhat more sensitive parameter settings than the
cross_match defaults). The '-screen' option causes a file named
"C05D11.reads.screen" to be created, containing "vector masked"
versions of the original sequences: i.e. any region that matches any
part of a vector sequence is replaced by X's. This ".screen" file is
what should be provided as input to phrap. The output from
cross_match (which has been redirected to the file screen.out in the
above example) lists the matches that were found (see below).

N.B. If you have created a .qual file for your reads (say
C05D11.reads.qual), be sure to rename or copy it to
C05D11.reads.screen.qual so that it will be recognized by phrap when
C05D11.reads.screen is used as the input FASTA file.

The phredPhrap script automatically carries out the above steps
(i.e. vector screening using cross_match, and renaming of the .qual
file); however it is your responsibility to make sure the vector
sequence file includes vector sequences appropriate for your cloning
and sequencing vectors.

IV. COMMAND LINE OPTIONS

This section describes all command line options available with
cross_match and phrap, grouped by category. Each option name is
followed immediately by the "default value" assumed by the program,
and a brief description. A '*' appearing as the default value
indicates that the option is a flag rather than a parameter; in this
case no value should be included on the command line after the option
name, and by default it is turned off. [None] means there is no
default, i.e. the parameter is not relevant when the option is
omitted.

Unless otherwise indicated, an option can be used with either
cross_match or phrap.

1. Scoring of pairwise alignments

The following options control the scoring of pairwise sequence
alignments. By default, matching residues receive a reward of +1,
mismatches get a penalty of -2, gap opening residues a penalty of -4
(= -2 - 2), and gap extension residues a penalty of -3 (= -2 - 1). The
highest scoring word-nucleated local alignment between a pair of
sequences is found, and its score is then complexity-adjusted
(i.e. scores of matches between sequence regions of biassed nucleotide
composition are adjusted downwards). These parameters are chosen such
that regions of two sequences that are about 70% or more identical
will tend to have a positive alignment score. For more stringent
comparisons, -penalty should be made more negative (e.g. a value of -9
will tend to find alignments that are 90% identical), and/or -minscore
should be increased; for more sensitive comparisons, a score matrix
should be used instead.

option name & default value

-penalty -2
Mismatch (substitution) penalty for SWAT comparisons.

-gap_init penalty-2
Gap initiation penalty for SWAT comparisons.

-gap_ext penalty-1
Gap extension penalty for SWAT comparisons.

-ins_gap_ext gap_ext
Insertion gap extension penalty for SWAT comparisons (insertion in
subject relative to query).

-del_gap_ext gap_ext
Deletion gap extension penalty for SWAT comparisons (deletion in
subject relative to query).

-matrix [None]
Score matrix for SWAT comparisons (if present, supersedes -penalty)
Matrix format: (TO BE ADDED)

-raw *
Use raw rather than complexity-adjusted Smith-Waterman scores.

2. Banded search

The following options control the region of the Smith-Waterman matrix
that is searched. In phrap and cross_match (as opposed to swat, which
performs full Smith-Waterman comparisons), word-nucleated banded
Smith-Waterman comparisons are performed: the set of input sequences
is scanned to find pairs of perfectly matching subsequences ("word
matches") meeting certain criteria, and for each such word match a
band in the Smith-Waterman matrix that is centered on the diagonal
defined by the match is searched to find an optimal-scoring alignment.
If there are multiple matching words for a given pair of sequences,
the union of the corresponding bands is searched. The search is
"recursive" in the sense that if an optimal-scoring alignment is found
whose score exceeds -minscore the remainder of the band is searched
again; as a result multiple alignments may be found within a single
band.
The word matches that are considered are "maximal" in the sense that
they cannot be extended in either direction without introducing a
discrepancy in one of the two sequences. If the length of the word
match is at least -maxmatch, it is accepted; otherwise if the
complexity-adjusted length is less than -minmatch, or if the matching
sequence occurs more than -max_group_size times on the forward strands
of the query file, it is rejected. The latter criterion effectively
downweights frequently occurring words. It can be turned off by
setting -maxmatch equal to -minmatch.

-minmatch 14
Minimum length of matching word to nucleate SWAT comparison. if
minmatch = 0, a full (non-banded) comparison is done [N.B. NOT
PERMITTED IN CURRENT VERSION]. Increasing -minmatch can dramatically
decrease the time required for the pairwise sequence comparisons; in
phrap, it also tends to have the effect of increasing assembly
stringency. However it may cause some significant matches to be
missed, and it may increase the risk of incorrect joins in phrap in
certain situations (by causing implied overlaps between reads with
high-quality discrepancies to be missed).

-maxmatch 30
Maximum length of matching word. For cross_match, the default value
is equal to minmatch, instead of 30.

-max_group_size 20
Group size (query file, forward strand words)

-word_raw *
Use raw rather than complexity-adjusted word length, in testing
against minmatch (N.B. maxmatch always refer to raw lengths). (The
default is to adjust word length to reflect complexity of matching
sequence).

-bandwidth 14
1/2 band width for banded SWAT searches (full width is 2 times
bandwidth + 1). Decreasing bandwidth also decreases running time at
the expense of sensitivity. Phrap assemblies of clones containing long
tandem repeats of a short repeat unit (< 30 bp) may be more accurately
assembled by decreasing -bandwidth; -bandwidth should be set such that
2 bandwidth + 1 is less than the length of a repeat unit. -bandwidth 0
can be used to find gap-free alignments.

3. Filtering of matches

The following options control which matches are displayed (cross_match)
or used in assembly (phrap).

-minscore 30
Minimum alignment score.

-vector_bound 80
Number of potential vector bases at beginning of each read. Matches
that lie entirely within this region are assumed to represent vector
matches and are ignored. For cross_match, the default value is 0
instead of 80.

-masklevel 80
(cross_match only). A match is reported only if at least (100 -
masklevel)% of the bases in its "domain" (the part of the query that
is aligned) are not contained within the domain of any higher-scoring
match.
Special cases:
-masklevel 0 report only the single highest scoring match for each query
-masklevel 100 report any match whose domain is not completely contained
within a higher scoring match
-masklevel 101 report all matches

4. Input data interpretation

The following options influence how input data (including read names)
are interpreted.

-default_qual 15
Quality value to be used for each base, when no input .qual file is
provided. Note that a quality value of 15 corresponds to an error rate
of approximately 1 in 30 bases, i.e. relatively accurate sequence. If
you are using sequence that is substantially less accurate than this
and do not have phred-generated quality values you should be sure to
decrease the value of this parameter.

-subclone_delim .
(phrap only). Subclone name delimiter: Character used to indicate end
of that part of the read name that corresponds to the subclone name

-n_delim 1
(phrap only). Indicates which occurrence of the subclone delimiter
character denotes the end of the subclone name (so for example
-subclone_delim _ -n_delim 2
means that the end of the subclone name occurs at the
second occurrence of the character '_'). Must be the same for all
reads!

-group_delim _
(phrap only). Group name delimiter: Character used to indicate end of
that part of the read name that corresponds to the group name
(relevant only if option -preassemble is used); this character must
occur before the subclone delimiter (else it has no effect, and the
read is not assigned to a group).

-trim_start 0
(phrap only). No. of bases to be removed at beginning of each read.

5. Assembly

The following options are used to control completeness and stringency
of assembly; note that the options -minmatch, -bandwidth, -penalty,
and -minscore discussed above also affect assembly stringency.

-forcelevel 0
(phrap only). Relaxes stringency to varying degree during final
contig merge pass. Allowed values are integers from 0 (most
stringent) to 10 (least stringent), inclusive.

-bypasslevel 1
(phrap only). Controls treatment of inconsistent reads in merge.
Currently allowed values are 0 (no bypasses allowed; most stringent)
and 1 (a single conflicting read may be bypassed).

-maxgap 30
(phrap only). Maximum permitted size of an unmatched region in
merging contigs, during first (most stringent) merging pass.

-repeat_stringency .95
(phrap only). Controls stringency of match required for joins. Must
be less than 1 (highest stringency), and greater than 0 (lowest
stringency).

-revise_greedy *
(phrap only). Splits initial greedy assembly into pieces at "weak
joins", and then tries to reattach them to give higher overall score.
Use of this option should correct some types of missassembly.

-shatter_greedy *
(phrap only). Breaks assembly at weak joins (as with -revise_greedy)
but does not try to reattach pieces.

-preassemble *
(phrap only). Preassemble reads within groups, prior to merging with
other groups. This is useful for example when the input data set
consists of reads from two distinct but overlapping clones, and it is
desired to assemble the reads from each clone separately before
merging in order to reduce the risk of incorrect joins due to
repeats. The preassemble merging pass is relatively stringent and not
guaranteed to merge all of the reads from a group.
Groups are indicated by the first part of the read name, up to the
character specified by -group_delim.

-force_high *
(phrap only). Causes edited high-quality discrepancies to be ignored
during final contig merge pass. This option may be useful when it is
suspected that incorrect edits are causing a misassembly.

6. Consensus sequence construction

The following options affect the weighted directed graph that is used
to find the consensus sequence. Increasing their values will reduce
the size of the graph, which should reduce memory requirements for the
phrap run (substantially in some cases) but may decrease the accuracy
of the consensus sequence that is found.

-node_seg 8
(phrap only). Minimum segment size (for purposes of traversing
weighted directed graph).

-node_space 4
(phrap only). Spacing between nodes (in weighted directed graph) .

7. Output

The following options control the creation or formatting of certain
output files, and of the standard output.

-tags *
Tag selected lines in the standard output, to facilitate
parsing. (N.B. this does NOT refer to the phrap-generated tags that
are included in the .ace file and viewable in consed.)

-screen *
(cross_match): Create a ".screen" file. This is a FASTA format
sequence file, in which any sequence region in a sequence in the first
input sequence file that matches some region in a sequence in the
second (or later) file is replaced by X's. This option is primarily
used to create "vector-masked" copies of the reads prior to phrap
assembly.
(phrap): when the -old_ace or -new_ace option is specified (see
below), this option causes parts of the read sequences that consist of
phrap-inferred sequencing vector and chimeric segments to be replaced
by X's in the .ace file. (The "phrap-inferred sequencing vector"
consists of the beginning part of the read that either does not match
any other read, or matches only the beginning parts of other reads;
"phrap- inferred chimeric segments" consist of segments of the read
that match other reads but do not match the consensus at that
location.)

-old_ace *
(phrap only). Create ".ace" file in old style format.

-new_ace *
(phrap only). Create ".ace" file in a new style format (STRONGLY
recommended over old_ace !!)

-ace *
(phrap only). Same as -new_ace.

-view *
(phrap only). Create ".view" file suitable for input to phrapview.

-qual_show 20
(phrap only). Cutoff for flagging "low_quality" regions in contig
sequence and "high quality" discrepancies between read and
contig. Bases in the .contigs and .ace file are lower case if and only
if their LLR-converted quality values are below this value.

-print_extraneous_matches *
(phrap only). Print information about non-local matches between
contigs.

-exp [None]
(gcphrap only). Name of a directory in which output experiment files
are to be placed.

-alignments *
(cross_match only). Display the alignment for each match.

-discrep_lists *
(cross_match only). Give list of discrepancies for each match.

-discrep_tables *
(cross_match only). Give table of discrepancies (by quality) for each
match (default is to display a single table, that combines results for
all matches).

8. Miscellaneous

-retain_duplicates *
(phrap only). Retain exact duplicate reads, rather than eliminating
them.

-max_subclone_size 5000
(phrap only). Maximum subclone size -- for forward-reverse read pair
consistency checks.

-trim_penalty -2
(phrap only). Penalty used for identifying degenerate sequence at
beginning & end of read.

-trim_score 20
(phrap only). Minimum score for identifying degenerate sequence at
beginning & end of read.

-trim_qual 13
(phrap only). Quality value used in to define the "high-quality" part
of a read, (the part which should overlap; this is used to adjust
qualities at ends of reads.

-confirm_length 8
(phrap only). Minimum size of confirming segment (segment starts at
3d distinct nuc following discrepancy).

-confirm_trim 1
(phrap only). Amount by which confirming segments are trimmed at
edges.

-confirm_penalty -5
(phrap only). Penalty used in aligning against "confirming" reads.

-confirm_score 30
(phrap only). Minimum alignment score for a read to be allowed to
"confirm" part of another read.

-indexwordsize 10
Size of indexing (hashing) words, used in finding word matches
between sequences. The value of this parameter has a generally minor
effect on run time and memory usage.

V. VIEWING PHRAP ASSEMBLIES WITH OTHER PROGRAMS

1. Consed

Consed is the recommended sequence editor for use with phrap. Its
development has been closely tied to the development of phrap and it
has been designed to take advantage of the information regarding the
assembly that is generated by phrap (e.g. phrap's consensus sequence
and quality values, and tags indicating potential assembly problems
and various data anomalies such as chimeras and compressions).
To use consed with phrap, ace files must be generated during the phrap
run using the flag -new_ace or -old_ace on the command line. This is done
automatically if you use the script phredPhrap to carry out all of the
data processing steps.
For additional information, consult consed documentation.

2. Phrapview

Phrapview is a graphical tool that provides a "global" view of the
phrap assembly, complementary to the "local" view provided by the
sequence editor CONSED; it will become obsolete when a similar globabl
view is added to CONSED.

a. Installation
Phrapview is written in perl/Tk; to run it you will need to have
installed on your system a recent version of perl (perl5 or later),
together with the perl/Tk perl module written by Nick Ing-Simmons. The
perl/Tk scripts can be obtained from any CPAN site in
modules/by-authors/Nick_Ing-Simmons/, for example:

ftp://ftp.perl.org/pub/CPAN/modules/by-authors/Nick_Ing-Simmons/

The latest version of perl/Tk is the most recent Tk800.xxx.tar.gz file
(where xxx is the most recent version number). The latest version of
perl can also be obtained from any CPAN site in /src, for example

ftp://ftp.perl.org/pub/CPAN/src/perl5.005_02.tar.gz

(Perl 5.005_02 was the most recent version when this was written.)

If you have problems with running phrapview, please make sure that you
have both a recent version of perl installed and a recent version of
perl/Tk. If you are unsure whether your version is recent enough, you
or your system administrator should download the latest sources and
install them.

b. Running phrapview

Phrapview is invoked with the command

phrapview filename

where filename is the name of a ".view" file that was created by
running phrap with the -view option.

N.B. The format of the .view file is still under development. To
ensure that phrapview will read correctly the .view file produced by
phrap, make sure that both programs (phrap and phrapview) were part of
the same release.

The following information can be displayed:

(i) Basic information concerning the numbers of reads, singletons,
contigs, chimeras, etc. Each contig, and each chimeric read, is
represented by a horizontal black line proportional in length to the
contig or read size. Note that although they are shown separately in
this display, chimeric reads in fact generally are incorporated by
phrap into one of the contigs, in a region that corresponds to one of
the two chimeric pieces; the location of this is indicated by the
black "chimera match" line connecting the chimera to a contig (see
below).

The following information requires pressing an appropriate "button" to
display:

(ii) Depth of coverage. Graphs (in yellow) above and below the
contig line indicate the depth of coverage (i.e. number of reads
aligned against the contig) on each strand at each position; the
horizontal orange lines correspond to a depth of 10. In addition to
the ordinary depth of coverage (which counts all reads in the contig,
and is mainly useful to indicate regions where an abnormal deficit or
excess occurs), the "reduced depth" can be displayed. This counts only
reads that are non-chimeric, have a positive LLR score against the
contig, and have no positive LLR scoring match to a read elsewhere in
the assembly -- i.e. the "confidently placed" reads. (See phrap.doc
for a description of LLR scores). Misjoins usually occur in places
where the reduced depth is 0 on both strands.
Regions of the contig where the depth is 0 on one or both strands are
indicated in red.

(iii) "Matches" and "links". "Matches" are pairwise read matches
between reads in different parts of the assembly, and are indicated by
a single (curved or straight) line connecting the midpoints of the two
matching regions. Contig matches (between two non-chimeric reads in
the assembled contigs) and chimera matches (between a chimeric read
and itself or another read) can be displayed separately. "Links"
connect the startpoints of two reads that derive from the same
subclone: forward-reverse links correspond to reads from opposite ends
of a subclone insert, while same-strand links correspond to walking or
duplicated reads in the same direction on the insert. Links will only
be indicated properly if the read naming convention assumed by phrap
is used (see section III.1 above).
Each match or link is considered to be in one of three classes,
indicated by color: "problems" (red), which indicate serious
discrepancies or a significant possibility of incorrect, incomplete,
or non-unique assembly; "ok" (black), which tend to confirm the
assembly or are otherwise consistent with it; and "grayzone" (blue or
green), which may in some cases indicate problems but are probably ok.
Links are considered "ok" if the two read starts meet the expected
spacing and orientation constraints, otherwise they are marked as
"problems" (in many cases however these reflect subclone tracking
errors rather than assembly errors). Matches are considered "ok", and
are not shown, if they are between reads assembled in the same
location, or have LLR scores below the cutoff. Matches are considered
"problems" if they have LLR scores above the cutoff, are non-local,
and involve regions where the reduced depth is 0. "Problem" matches
between or within large contigs often indicate the presence of
near-perfect repeats, which may or may not have been misassembled; or
(in some cases) a join that was missed. "Problem" matches between a
small (e.g. singleton) contig and a large one typically represent
cases where a read could not be assembled into its proper location by
phrap, due generally to some anomaly involving it (e.g. an internal
region of low quality data). Non-local matches with LLR scores above
the cutoff which do not lie in regions of reduced depth 0 are
considered to be "grayzone". Usually these involve isolated pairs of
reads each of which extends part way into a repeat, such that the
overlapping region is too small to contain high-quality discrepancies.
The fact that the reduced depth is non-zero means that there are other
reads in the same region which do not have any positive LLR scores to
the other region, so it is unlikely that there is a significant
misassembly in such cases, although it is possible that one of the two
matching reads may have been incorrectly placed.
Chimeras generally have one "ok" match (to the site where the read
was assembled) and one or more "problem" matches to the region from
which the other part of the chimera derives (sometimes these matches
all have negative LLR scores and are thus not shown with the default
LLR cutoff). Some chimeras actually represent subclones with internal
deletions; these tend to be obvious since the two pieces map to nearby
locations in some contig and are in the same orientation there.
Double-headed arrows on matches indicate that the matching regions
are on opposite strands; double-headed arrows on links indicate that
the orientation of the two reads is inconsistent with the read names
(i.e. forward-reverse pairs that point in the same direction, or same
strand pairs that point in opposite directions; these are colored as
"problems" if the reads are in the same contig).

(iv) Low quality contig bases (bases having phrap qualities that are
lower than a specified threshold), and high quality discrepancies
(positions where there is an aligned read discrepant with the contig
base and having a phrap quality that exceeds the threshold). A
horizontal green line above the contig indicates the quality
threshhold value ("qual cutoff"). Low quality contig bases are
indicated by a vertical blue line drawn from a height corresponding to
the contig quality value, up to the threshhold line; high quality
discrepancies are indicated by a vertical blue line drawn from a
height corresponding to the discrepant read's quality, down to the
threshhold line. Thus in each case longer lines are more "serious"
(reflect a larger deviation from the threshhold, in the wrong
direction).

Passing the mouse pointer (slowly!) over an item of types ii-iv
above results in its being highlighted (converted to yellow); it
remains highlighted until the pointer is moved over another item, or a
contig line. Textual information about the highlighted item is shown
in a box in the top right region of the display.
Several parameters can be changed by entering new values in
appropriate boxes at the bottom of the display. Four different
magnifications can be changed: horizontal magnification; the "spacing
magnification" which determines the vertical spacing between contigs;
and the depth and quality magnifications, which affect the scales used
for coverage depth and quality. Values of each of these are specified
as percents of the original (startup) magnification. The role of LLR
cutoff, and qual cutoff are described above. The unalign parameter
refers to the size of the largest segment involved in the pairwise
read match that is unaligned against the contig sequence. If large,
such segments may indicate a misplaced read or (occasionally) an
otherwise undetected chimera. Min fwd-rev and max fwd-rev refer to
the minimum and maximum allowed separation (in bp) between (the starts
of) forward-reverse read pairs; i.e. these represent the minimum and
maximum allowed subclone insert size. Separations outside this range
(or in the wrong orientation) are marked as problems. Min ss and max
ss are the minimum and maximum spacing between same-strand reads (i.e.
size of a walking step).
The display is updated to reflect changes made in the above
parameters only after a relevant button (Show Contig Matches, etc.) is
pressed.

Colors can be changed by altering the variable definitions that occur
near the beginning of the phrapview source file, and re-running the
program.

3. Gap

A version of phrap ("gcphrap", for "gap-compatible phrap") suitable
for use with the Staden gap4 package may be created as follows: Obtain
relevant source files from the Staden web site
(http://www.mrc-lmb.cam.ac.uk/pubseq/index.html). Put these in the
same directory as the phrap source files, and enter the command

> make gcphrap

Consult the Staden web site for instructions on using gcphrap in
conjunction with gap4, including input and output file formats. In
addition to the usual phrap command line options, gcphrap accepts an
option -exp, which is used to specify the name of a directory in which
the output experiment files will be placed.

VI. PHRAP OUTPUT

Phrap output includes the "standard output", the "standard error", and
a series of output files. The latter receive names that consist of the
input read file name, with an appropriate suffix (e.g. ".ace")
appended.

The standard output is described below. The other files
are

(i) The standard error (which is normally written to the terminal,
but may be redirected to a file). This contains information
indicating the point in the run that phrap has reached, summary
results for some of the steps, and various warning and error messages.
(ii) The .contigs file. This is a FASTA file containing the contig
sequences (without pads; bases in this file are upper case if and only
if the quality is >= qual_show). These include singleton contigs
consisting of single reads with a match to some other contig, but that
couldn't be merged consistently with it.
(iii) The .contigs.qual file. This has the phrap-generated qualities
for the contig bases.
(iv) The .singlets file. A FASTA file containing the singlet reads
(i.e. the reads with no match to any other read).
(v) The .log file. This includes various diagnostic information, and
a summary of aspects of the assembly; it is probably only of interest
to me, for trouble-shooting purposes.
(vi) The .problems file. Again this is probably only of interest to
me.
(vii) The .ace file (produced when the options -new_ace or
-old_ace is used). This file is required for viewing the assembly
using consed. It's format is described in the consed documentation.
(vii) The .view file (produced when the option -view is
used). Required for viewing the assembly using phrapview.

Standard output:

The standard output has the goal of giving as complete a summary as
possible of the final assembly, including data anomalies and possible
sites of assembly error. The information it provides includes a list
of the contigs and the reads they contain, information about the
quality of the alignments of the reads against the contig sequence,
matching regions within and between contigs, suspect reads (probable
deletions or chimeras), sites of possible errors in the assembly or
contig sequence (e.g. discrepancies between reads and contig
sequence), and results of the forward/reverse read consistency checks.

The output includes, in order:

(i) Summary information about the run conditions (command line,
phrap version number, and parameter settings) and the input data
file(s) (sequence and quality, template/read counts and read
orientations, chemistry). It is worth reviewing this information
routinely for possible problems with the input data, such as parameter
misspecification, read nomenclature that is causing faulty inferences
by phrap regarding templates or chemistry, or a missing quality file.

(ii) Low-quality regions at beginnings or ends of reads which
consist largely (>50 %) of a single nucleotide, due (usually) to
degenerate signal. These are internally converted to N's, to prevent
them from causing spurious matches which could interfere with
assembly. (Note that the .ace and .singlets files output by phrap will
contain the N's in place of the original sequence).

(iii) Exact and near-exact duplicate reads. Exact duplicates, which
probably represent duplicate entry of the same data, are excluded from
assembly (this feature may be turned off using the option
-retain_duplicates). The near-exact duplicates are only listed if they
are from different subclones (insofar as this information is available
to phrap!).

(iv) Probable unremoved sequencing vector, detected as regions at
beginnings of reads which match only the beginnings of other reads.
Phrap's method for identifying these is not foolproof and should not
be considered a substitute for screening out sequencing vector as
described above (III.4). If vector has been removed in that manner,
the "unremoved vector" detected here may simply be inaccurate vector
sequence, that did not match the correct sequence in the original
cross_match screen but does match other reads in which similar errors
occur. You may want to consider adding the (potentially erroneous)
sequence that is displayed in the phrap output to your vector file,
and redo the cross_match vector screen, to increase the likelihood
that you catch all of the vector. I would appreciate hearing of any
examples of sequence that is not vector being erroneously labelled as
such by phrap.

(v) "Internal read matches", i.e. reads which have off-diagonal same
orientation matches to themselves (due e.g. to short tandem repeats).
This information is ignored at present, but ultimately will be used to
improve assembly of tandem repeat regions. Due to the use of
complexity-corrected Smith-Waterman scores some commonly occurring
short microsatellites may not be detected here.

Node-rejected pairs.

(vi) Multi-segment reads, defined as reads whose confirmed parts
break into two or more pieces. These include chimeras and non-chimeric
"single links". They are omitted from the initial merging passes to
reduce the risk of false joins in the assembly, but are incorporated
in later merging passes; as a result, chimeric reads will generally be
assembled into a contig, but should be tagged as chimeric.

(vii) Probable deletion reads, identified as reads which match some
other read but have a gap with respect to it of >10 bp (more
specifically, read 1 is determined to have a deletion with respect to
read 2 if there are two pairs of matching segments between the reads
such that the segments in read 1 are nearly adjacent (within 20 bases)
and the separation between the segments in read 2 is at least 10bp
larger than the separation in read 1), and for which there is no other
clone that shows the same deletion (the latter condition is needed to
rule out the possibility of repeated sequences for which some copies
have deletions). For each deletion clone and each such matching read,
the apparent size of the deletion, location of the deletion and (in
parentheses) name of the matching (undeleted) read and extent of the
segment that was deleted from in it are given.
The probable deletions are not used in assembly and instead appear
as singleton contigs.

Revised quality (= phrap quality), and LLR score histograms, for two passes.

(viii) Summary of "rejected" pairwise alignments; i.e. matches which
either prematurely terminate (do not extend as far as they should,
given the apparent sequence quality), or include mismatches between
high quality bases. In such cases the reads are probably from
different repeats, and so the alignments are not used in the assembly
(N.B. at present incompletely extended matches are considered during
the assembly, since a more sensitive criterion used there will reject
them if warranted).

(ix) (N.B. This part of output needs revision). Summary of
information about accepted pairwise alignments, including no. of
confirmed reads (i.e. reads matching some other read), their average
lengths (total, confirmed, "strongly confirmed" (i.e. matching a
reverse sense read) and trimmed); a crude estimate of the clone size
(assuming all non-rejected matches are real -- if there are
near-perfect repeats, the size may be underestimated) and depth of
coverage; a histogram of depths (letting the depth of a read be the
maximum depth that occurs in it); a summary of the discrepancies
between matching reads (by strand sense, nucleotide and quality); and
a histogram of spacings between adjacent indel pairs. By sense of the
aligned pair (reads in same direction, reads in reverse directions).

Blocked reads
# reads lacking a high-quality segment
Bypassed reads

(x) Isolated singlets. These are reads having no non-vector match
(with score at least minscore) to any other read. Shown for each read
are its length, and the number of high-quality non-X bases. For most
reads, the latter number should be 0, indicating that the read is
either entirely vector (its high-quality part has been completely X'd
out) or is of very low quality; reads for which this value is positive
may be contaminants.

(xi) Contigs (including "singleton contigs" consisting of a single
read -- these are cases in which the read did have a match with some
other read(s), but could not be consistently assembled with it). The
contigs are sorted by (increasing) number of reads, so that the
singleton contigs appear first. The following information appears for
each contig: the number of reads; the total length of the contig
sequence (not including pads); the length of the trimmed sequence
(i.e. after regions consisting entirely of lower case letters, N, and
X have been removed from either end of the contig); and a list of the
reads in the contig and information about their alignments. For each
read the following information is given: a 'C' if the read is in
reverse orientation; the starting and ending positions for the read
(i.e. its full length including vector and low-quality data, not just
the part that is alignable) with respect to the contig sequence;
the read name; the score of the alignment of the read against the
contig sequence; (in parentheses) the highest score of a match of the
read against some other read in a different contig or elsewhere in the
same contig (so reads for which this number is non-zero are those
which overlap a repeat, or an incorrectly or incompletely assembled
region); the per cent mismatch, insertion, and deletion rates for the
alignment of the read against the contig sequence; the number of bases
at the beginning of the read which were not included in the alignment;
(in parentheses) the number of bases at the beginning of the read
which did not align against any other read (i.e. were not confirmed);
and similarly for the bases at the end of the read. If the
unparenthesized number is substantially greater than the parenthesized
number that accompanies it, there could be a problem with the
alignment. Note however that different penalties are used for aligning
a read against another read as opposed to aligning it against the
contig sequence.

Following the read list is information to be used in assessing the
quality of the assembly. This includes:

A description of regions in the contig sequence whose quality has
been adjusted on the basis of alignments of reads to the contig (e.g.
regions that had double-stranded confirmation in the initial pairwise
comparison may not have when the contig is assembled, due to
resolution of repeats; and regions that were not confirmed in the
initial pairwise comparisons may be confirmed in the assembled contig,
due to the use of a lower gap penalty which gives more complete
alignments).

Histogram of the qualities of the contig bases; the extents of the
leading and trailing quality 0 regions of the contig (such regions are
likely to be highly inaccurate); and a list of the bases that have
quality < qual_show.

Slack and HQ Mismatch histograms (not yet documented here).

A table showing "gaps" on each strand (i.e. regions for which there
is no aligned part of any read, so that additional data, or possibly
editing to permit alignment of existing data, is required); the
closest read that could potentially be extended or edited to cover the
gap; whether or not the inaccurate, unaligned part of that read
already covers the gap (in which case editing might be sufficient to
close it); and the length of an accurate extended read (from the same
start in the same clone) that would be required to cover the gap. The
table appears following the list of reads in the contig and their
positions. An example (from a contig in C05D11):

Gap Size Closest read (Start) Covers Read length required
now? to cover
Top strand:
left - 244 244+
1809 - 1880 72 x72c5.s1 (1382) No 499
2967 - 2993 27 ae82b07.s1 (2562) Yes 432
5392 - 5536 145 z17c5.s1 (4965) No 572
6036 - 6320 285 z13f6.s1 (5537) No 784
8137 - right 0+ z26c9.s1 (7667) No 469+

Bottom strand:
left - 0 0+ z17b4.s1 ( 563) No 563+
1138 - 1139 2 z14f3.s1 (1143) Yes 6
6940 - 7230 291 z17c10.s1 (7666) No 727
7717 - right 420+

The gaps that say "left" or "right" are the ones at the left and
right ends of the contig (the gap size in this case is the part of the
existing contig sequence at that end that is not double stranded, with
the plus indicating that there is an additional gap to the left or the
right of the contig; these gaps are always designated not covered
("No")), and the relevant read in this case is the one that should be
extended to close gaps between contigs, while the other gaps are
internal gaps and the relevant reads are the ones required to get
double-stranding. Note one problem in this example: the bottom strand
gap of size 2 lies in the inaccurate 5' part of the z14f3.s1 read, so
that would probably not be an appropriate read to edit or get more
data from. Need appropriate cutoff values (i.e. size of 5' part of
read to ignore; this obviously depends on whether the 5' end has
already been trimmed prior to inputting it to phrap). Also, note that
since the full read lengths are used by phrap, the ends of the contigs
may include some very inaccurate sequence; so the size of the end gaps
(i.e. the total amount of inaccurate sequence there) could be larger
than indicated; this isn't particularly important though since in any
case one will always want to extend those reads to try to close the
gaps between contigs.

Phrap uses a lower gap penalty (-2) in aligning reads to the contig
sequence (to allow more complete double-stranding than the -9 default
used for aligning reads to each other).

Gaps with "No" in the Covers? column will always require more data (an
extended read or walk step, depending on the required length). Those
with "Yes" may be coverable by editing of the appropriate read; but it
may be simpler to automatically get longer reads for them also,
particularly if the gap size is large, since substantial editing would
probably be required in that case to use the existing data. Walking
primers could be output directly from OSP analysis of the contig
sequence (high quality part, i.e. all upper case letters) in cases
where the gap looks too large to close with an extended read
(presumably "finish" already does this?)

Following the gap table, there is a table giving, for each quality
value, the total number of bases of that quality in the reads for the
contig; the number which are aligned (i.e. included in the SWAT
alignments of the reads against the contig); the cumulative no. of
aligned bases (for this and higher qualities); the number of bases not
included in the alignment (but potentially alignable -- i.e. not
lying before the beginning or after the end of the second sequence);
the number of discrepancies of each type (substitution, deletion,
insertion); the total # of discrepancies (for this quality level) and
their percentage (of the aligned bases of the given quality), and the
cumulative # of discrepancies and their percentage (of the cumulative
no. of aligned bases). The regions at the ends of the sequence that
consist entirely of quality 0 are reassigned quality values of -1,
since these usually correspond to the lowest quality data at the
extreme ends of reads. (Not done in the contig quality histogram.)
An 'N' in either a read or the contig sequence is always counted as a
substitution error.

Depth 0 regions.

Following this there is more specific information about possible
problems with the assembly:

Unaligned segments: parts of reads of length at least 10 that did not
align to the contig sequence (the part of the read having quality -1
is not considered).

High quality discrepancies: sites in the contig sequence for
which at least one read with a quality at least qual_show disagrees
with the contig sequence. Most errors in the contig sequence will be
found either among these sites or in the regions where the contig
quality is < qual_show.

Low quality suspects: sites in the contig sequence of quality <
qual_show (but greater than 0) for which there is a discrepant read of
equal or greater quality. Such sites are particularly error-prone.
The list gives the site position, base in the contig sequence and its
quality, and quality of the discrepant read.

[Formerly also printed out: for each of the two strands, the # of
reads on that strand having a substitution, insertion or deletion
(with respect to the contig sequence) at that position, and depth (#
of reads whose alignments extend across that position). The site
position is preceded by a '<' (resp. '=') if there is a read with a
discrepant higher (resp. equivalent) quality base at that position.]

There then is given a list of the regions that are not covered by
"unique-reads" (i.e. reads with no non-rejected matches); and of
regions that match (as detected by one or more non-rejected pairwise
alignment between reads) other contigs or other regions in the same
contig. These are likely to be either true near-perfect repeats or
reflections of an error (of omission) in the assembly. To help
distinguish these possibilities, such regions are designated spanned
(if some read in the current contig extends all the way across it and
includes sequence to either side of it; a list of all such reads is
given) or "UNSPANNED" (if no such read exists). Misassembly errors of
commission generally arise from true near-perfect unspanned repeats,
and should result in an "UNSPANNED" flag for at least one of the two
matching regions. However not all of them do -- while the fact that
repeats exist should generally be detected by phrap, the full extent
of the repeat may in some cases not be detected, and the apparent
repeat may be "spanned" even though the full (true) repeat is not.
Moreover in some cases of misassembly only one of the matching contig
regions may be flagged as unspanned. Misassembly errors of omission
should always result in "UNSPANNED" matching regions at the end of one
or both contigs.
Regions of a non-singleton contig matching an anomalous (chimeric or
deleted) read are indicated only in the output for the singleton contig
that contains the anomalous read.

(xii) Following the contig information, there is a summary of the
forward/reverse read consistency checks. For each pair of reads with
the same clone name are given the contig no. and orientation within the
contig (0 = top strand, 1 = bottom strand) for each read; an "L" (for
"link") if the reads occur in different contigs; a '*' if there is an
inconsistency (e.g. forward and reverse reads which are not pointing
towards each other, or two forward reads which are not on the same
strand); distance between read starts for forward/reverse pairs
(should correspond to clone insert size, and be positive); and
distance between starts for forward/forward or reverse/reverse pairs.

N.B. IN ALL PLACES WHERE A POSITION IS GIVEN, THE POSITION IS WITH
RESPECT TO THE UNPADDED SEQUENCE, NOT THE PADDED SEQUENCE.

VII. CROSS_MATCH OUTPUT

In addition to the standard output, the files produced include the
standard error and .log files (as for phrap); and the .screen file
(produced when the option -screen is used).

Standard output

The standard output lists matches between any sequence in the first
input file (the "query" sequences) and a sequence in the second (or
later) input file (the "subject" sequences); or, if a single input
file is provided, the matches between any two sequences in this file.
The matches that are reported are controlled by the command line
options -minscore and -masklevel, as well as by the options that
control scoring of the alignments and the band search (see section IV
above). The reported matches are ordered by query, and for each query
by the position of the start of the alignment within the query.
For each reported match, an initial output line gives summary
information:

Example:

440 2.38 1.39 0.79 hh44a1.s1 33 536 ( 0) C 00311 ( 3084) 8277 7771 *

Interpretation:
440 = smith-waterman score of the match (complexity-adjusted, by default).
2.38 = %substitutions in matching region
1.39 = %deletions (in 1st seq rel to 2d) in matching region
0.79 = %insertions (in 1st seq rel to 2d) in matching region
hh44a1.s1 = id of 1st sequence
33 = starting position of match in 1st sequence
536 = ending position of match in 1st sequence
(0) = no. of bases in 1st sequence past the ending position of match
(so 0 means that the match extended all the way to the end of
the 1st sequence)
C 00311 : match is with the Complement of sequence 00311
( 3084) : there are 3084 bases in (complement of) 2d sequence prior to
beginning of the match
8277 = starting position of match in 2d sequence (using top-strand
numbering)
7771 = ending position of match in 2d sequence
* indicates that there is a higher-scoring match whose domain partly
includes the domain of this match.

Following this line there appears (if the flag -discrep_lists is
specified) a listing of the discrepancies between the aligned regions
of the two sequences, giving for each discrepancy its type, position,
nucleotide, and quality (in the query sequence), and its position in
second sequence. This list is followed (if -discrep_tables is
specified) a table giving, for each quality value, the total number of
bases of that quality in the first sequence; the number which are
included in the SWAT alignment; the cumulative no. of aligned bases
(for this and higher qualities); the number of bases not included in
the alignment (but potentially alignable -- i.e. not lying before the
beginning or after the end of the second sequence); the number of
discrepancies of each type (substitution, deletion, insertion); the
total # of discrepancies (for this quality level) and their percentage
(of the aligned bases of the given quality), and the cumulative # of
discrepancies and their percentage (of the cumulative no. of aligned
bases). This information is useful for computing accuracy as a
function of quality, for automatically generated contig sequences.
In the discrepancy listing and table, the regions at the ends of the
sequence that consist entirely of quality 0 are reassigned quality
values of -1, since these usually correspond to the lowest quality
data at the extreme ends of reads. (Not done in the phrap output...)
An 'N' in either sequence is always counted as a substitution error.
If the flag -alignments is specified, a complete alignment for each
reported match is displayed.

VIII. EST ASSEMBLIES
(TO BE ADDED)

IX. PROBLEMS

1. Insufficient memory.

If the run stops prematurely, displaying the message

FATAL ERROR: REQUESTED MEMORY UNAVAILABLE

then you need to increase the amount of memory alloted to you by the
operating system. This can usually be done without obtaining
additional physical RAM for your computer -- it is usually enough just
to have the operating system allocate additional virtual memory to
your process. (However if you are substantially exceeding physical RAM
the run may take an exceptionally long time to complete). On Unix
systems this can be done using the "limit" command, as follows:

i. Run "limit" with no arguments to list the system resources currently
available to you. For example (this output is for a DEC alpha
computer -- the output on other computers may look somewhat
different):

> limit
cputime unlimited
filesize unlimited
datasize 131072 kbytes
stacksize 2048 kbytes
coredumpsize unlimited
memoryuse 699392 kbytes
descriptors 4096 files
addressspace 1048576 kbytes

The relevant parameter here is datasize, currently set to
about 131 megabytes.

ii. Now run limit with the option -h to see what the maximum possible
values for these parameters are:

> limit -h
cputime unlimited
filesize unlimited
datasize 1048576 kbytes
stacksize 32768 kbytes
coredumpsize unlimited
memoryuse 699392 kbytes
descriptors 4096 files
addressspace 1048576 kbytes

This shows that datasize can be increased to 1048576 kbytes, or about
1 Gb. It may be possible to increase this even further by altering
the operating system configuration (this may require rebuilding the
kernel and is best left to an expert).

iii. Run limit once again, providing the desired value for datasize, e.g.:

> limit datasize 500000

which increases datasize to 500 megabytes. Any value smaller than the
value returned in step ii can be used.

iv. Now rerun phrap. Note that the new datasize value will only apply
to the particular shell (window) in which you execute the limit
command, so you may need to repeat step iii next time you log in or if
you switch to a different window.

2. Other phrap- or cross_match-generated error messages
(TO BE ADDED)

3. Phrap- or cross_match-generated warning messages

These can generally be ignored, unless there is something obviously
wrong or incomplete about the assembly
(TO BE ADDED)

4. "Crashes" reported by operating system

True "crashes" (e.g. segmentation fault, floating point exception)
generally indicate a bug in the program. I would greatly appreciate
having any such problem reported to me, as described below.

5. Long running time.

If the run appears to be "stuck" or takes an exceptionally long time
to complete, try using a larger value for the parameter -minmatch
(above, section IV.2)

6. Misassemblies, incomplete assemblies, incorrect selection of read for
consensus sequence.
Marked discrepancy not split ...
(TO BE ADDED)

7. Reporting problems

If you have a problem that is not addressed by any of the above,
first be sure you have read sections I-IV of the documentation and
that you are using a current version of the programs. Then repeat the
run in which the problem occurred, capturing the standard error to a
file. For example, the following command, run on a Unix computer in a
C shell, captures the standard output to a file phrap.out and the
standard error to a file stderr.file:

(phrap reads.screen > phrap.out) >& stderr.file

Then send a copy of the standard error (only! -- not the standard
output), i.e. the file stderr.file in the above example, to me at the
following email address:

phg@u.washington.edu

(If the standard error is more than 1 or 2 pages you have probably
inadvertently captured the standard output.) Please do NOT send the
standard output, datasets, or any other large file without making
prior arrangements to do so with me! Large files cannot be sent to the
above address, which has a limited disk quota on a Univ. of Washington
computer.

APPENDIX: ALGORITHMS

[N.B. MUCH OF THE FOLLOWING DESCRIPTION IS SOMEWHAT OUT-OF-DATE].

Outline of phrap assembly:

0) Read in sequence & quality data, trim off any near-homopolymer runs
at ends of reads, construct read complements.

1) Find pairs of reads with matching words. Eliminate exact duplicate
reads. Do swat comparisons of pairs of reads which have matching
words, compute (complexity-adjusted) swat score.

2) Find probable vector matches and mark so they aren't used in assembly.

3) Find near duplicate reads.

4) Find reads with self-matches.

5) Find matching read pairs that are "node-rejected" i.e. do not have
"solid" matching segments.

6) Use pairwise matches to identify confirmed parts of reads; use these to
compute revised quality values.

7) Compute LLR scores for each match (based on qualities of discrepant and
matching bases).

(Iterate above two steps).

8) Find best alignment for each matching pair of reads that have more
than one significant alignment in a given region (highest LLR-scores
among several overlapping).

9) Identify probable chimeric and deletion reads (the latter are
withheld from assembly).

10) Construct contig layouts, using consistent pairwise matches in
decreasing score order (greedy algorithm). Consistency of layout is
checked at pairwise comparison level.

11) Construct contig sequence as a mosaic of the highest quality parts
of the reads.

12) Align reads to contig; tabulate inconsistencies (read / contig
discrepancies) & possible sites of misassembly. Adjust LLR-scores of
contig sequence.

Phrap adjusted quality values/error probabilities: Phrap computes
adjusted quality values for each read on the basis of read-read
confirmation information, as follows. If a read is confirmed by an
opposite-strand or different-chemistry (dye terminator vs. dye primer)
read at a given position, that position is given a quality which is
the sum of the two input read qualities; when more than one
opposite-strand read confirms a given position, only the single
highest quality from all opposite-strand matching reads is used. If
qualities are related to error probabilities as described above, this
procedure can be interpreted as computing an error probability for
each base which is the product of the error probabilities for the two
reads; since error profiles for opposite-strand or different chemistry
reads are essentially independent, this is reasonable, although it is
somewhat conservative in that it does not take into account
same-strand matches (which clearly should count for something,
although they certainly are not independent).
Contig positions are assigned quality values equal to the highest
adjusted quality of any read at that position, and then adjusted
downward to take into account any discrepancies with other reads. As a
result, when phred input qualities are used, the output phrap
qualities associated to each contig position have a natural
interpretation as (conservative) error probabilities. Such error
probabilities provide an extremely useful guide to where editing or
additional data collection is needed.
The phrap adjusted qualities are used in computing "LLR scores" for
each pairwise match between two reads. These scores take into account
the qualities of the base calls in the reads, and are (approximate)
log-likelihood ratios for comparing the hypothesis that the reads
truly overlap to the hypothesis that they are from 95% similar
repeats. The point is that discrepancies between overlapping reads are
due to base-calling errors and thus tend to occur in low-quality
bases, whereas reads from different repeats can have high-quality
discrepancies that are due to sequence differences between the
repeats; and the probability of the observed data under each
hypothesis can be quantified using the interpretation of the phrap
qualities in terms of error probabilities. A pairwise match tends to
have a positive LLR score if the two reads overlap, whereas it tends
to have a negative LLR score if the two reads are from different
repeats (unless the repeats are nearly identical).

Description of algorithms:

0) The procedure used in both phrap and cross_match to find
sequence matches is the following. First, (in phrap only) any region
at the beginning or end of a read that consists almost entirely of a
single letter is converted to 'N's; such regions are highly likely to
be of poor data quality which if not masked can lead to spurious
matches. Reads are then converted to uppercase (in order to allow
case-insensitive word matches). All matching words of length at least
minmatch between any pair of sequences are then found, by (i)
constructing a list of pointers to each position (in each sequence)
that begins a word of at least minmatch letters not containing 'N' or
'X'; (ii) sorting the list (using a modified version of quicksort,
with string comparison as the comparison function + a few tricks to
improve speed by keeping track of the parts of the words that are
known already to match); (iii) scanning the sorted list to find pairs
of matching words. For each such pair, a band of a specified width
(in the imaginary dot matrix for the two sequences), centered on the
diagonal defined by the matching words, is defined. Overlapping bands
(for the same pair of sequences) are merged. Following construction
and merging of bands, a "recursive" SWAT search of each band is used
to find matching segments with score greater than or equal to
minscore: "recursive" here means that if such a match is found, the
corresponding aligned segments in each sequence are (conceptually) X'd
out and the process repeated on the remaining portions (if any) of
each sequence. This procedure allows (at the cost of some redundant
calculation) detection of multiple matching pieces in different
locations, and will usually find most copies of repeats (since they
generally occur in separate bands).
By default, SWAT scores are complexity-adjusted (so that matches for
which the collection of matching nucleotides has biassed composition
have their scores significantly penalized.)

1) In phrap, sequence pairs with score >= minscore are considered for
possible merging. A critical issue here is the appropriate score
matrix for SWAT. [Given the generally high accuracy of reads and the
desirability of minimizing false matches due to imperfect repeats, a
relatively high penalty seems desirable. Currently I use +1 for a
match, -9 for a mismatch involving A,C,G, or T, 0 for a match or
mismatch involving N, -1 for a match or mismatch involving X, -11 for
the gap initiating penalty (the first residue in a gap), and -10 for
the gap extension penalty (each subsequent residue).] Setting the
indel penalties slightly higher than the mismatch penalties gives
better alignments by favoring mismatches in compression regions. Since
SWAT uses profiles, one has the option of distinguishing different
quality levels by use of different symbols (e.g. upper case for more
accurate calls, lower case for less accurate calls) and setting the
penalties appropriately; this would involve using the quality levels
to adjust the symbols used in the reads, which should be done AFTER
the word matching routine. (At present it does not seem particularly
useful to use differing positive scores for different nucleotides to
reflect their different frequencies).

1a) Determine "confirmed" part of each read (i.e. the part which
appears in a SWAT alignment against some other read; a read with the
same name -- up to an internal '.' if any -- is not considered
confirming). Probable chimeras are detected as reads for which the
confirmed part can be separated into two non-overlapping pieces,
separated by at most MAX_CHIMERA_GAP bp (currently 30) such that the
part confirmed by a given read lies in one or the other piece but not
both; AND such that each piece has a prematurely terminating alignment
with some other read. Chimeric reads may arise from i) chimeric
clones; ii) gel mistracking across lanes; iii) unremoved sequencing
vector; (iv) deletion clones. Reads having two non-overlapping
confirmed pieces (but failing the "premature termination" condition)
are often non-chimeras that are the only link between two
non-overlapping contigs, and thus should be permitted in the assembly.
Identify "strongly confirmed" regions in each read: having matches
to a reverse sense read.
Identify deletions.
Identify "rejected" alignments -- those which don't extend as far as
they should, or that have mismatches involving high quality bases.

2) Sort all matching pairs by decreasing LLR score, and assemble
layout by progressively merging pairs with high score. Consistency of
merge is required: the SWAT alignment implies relative offset of one
contig with respect to another, and hence a potential implied overlap.
The SWAT alignment(s) should extend over essentially the entire region
of implied overlap, apart from an allowable gap (necessary to allow
for the fact that the ends of the alignments are somewhat uncertain).
Probable deletion clones (identified as reads which have two adjacent
pieces matching two separated pieces of another read, and having no
confirming read across the breakpoint), and chimeras are not used in
any merges. [N.B. Following is obsolete: Potential chimeras are
deferred to a second pass (as well as being flagged in output), so
that true reads will assemble first.]

3) Construct contig "consensus" as a mosaic of individual reads.
Strategy: ends of alignments, and midpoints of perfectly matching
segments of sufficient length, define crosslinks - pursue crosslinks
which increase accuracy and extend read. Formally this is done by
constructing a weighted directed graph whose nodes consist of
(selected) positions in reads; there are bidirectional edges with
weight 0 between aligned bases in overlapping reads, and
unidirectional edges from 5' to 3' positions within a single read,
with weight equal to the total quality of the sequence between the two
nodes. Standard C.S. algorithms (dating back to Tarjan) permit
identification of a path with maximal weight, in time linear w.r.t.
number of nodes.
The quality values for the resulting sequence are inherited from the
read segments of which it is composed.