Whole genome analysis of segmental duplications (Dec 2005)
This project is collabotated by,
Haixu Tang (Indiana University, Bloomington)
Pavel A. Pevzner (University of California, San Diego)
Zhaoshi Jiang and Evan E. Eichler (University of Washington, Seattle).
This analyisis is based on human genome build 35, 2004.
Total input: 28856 alignments (gzipped)
Clustering duplication groups (overlap length 500) results in:
1635 single duplications (gzipped) and
666 duplication groups and
(gzipped tar files of the alignments of each group).
The clustering procedure is as following. We build a graph
in which every duplication alignment is represented as a node.
If one segment in a duplication alignment intersects with
one segment in the other duplication alignment with a
certain length (default as 500), we link an
edge between the corresponding nodes of the duplications.
Finally we define the clusters (groups) of duplications
as the connected components in the above graph.
The single duplications are the duplications that
are not intersected with any other duplications (isolated
nodes in the above graph).
Obviously, the segments from different cluster of
duplications are not overlapped with each other,
therefore they can be handled separately.
We can build the repeat graph for each of
such group and derive sub-repeats one by one.
The sub-repeat lists (in gzipped tar format) of each of the above groups
can be found here. There are apprarently 666
files, each corresponding to one duplication cluster.
The repeat graph description files can be found here.
Also, the graph files in graphviz
format can be found here, which can be transformed
into postscript or other image format by software graphviz.
Suppose the duplication group name is something like
tmp-xxxx.aln, then the sub-repeat list file name will
be tmp-xxxx.aln.txt; the sub-repeats connectivity result
file will be tmp-xxxx.aln.bin.len.code.pair; the
graph description file will be watch.tmp-xxxx.aln.simp;
and the graphviz format graph file will be
tmp-xxxx.aln.simp.gvz.
Among these groups, there is a giant group which
have about half of the alignments, which has 15700 vertices and
26549 edges. The summary of tangles from its repeat graph
can be found here.
We also project the graphs to every human chromsome.
See the analysis on each chromosome here.
Please contact Haixu Tang
for details.