Whole genome analysis of segmental duplications (Dec 2005)

This project is collabotated by,

Haixu Tang (Indiana University, Bloomington)
Pavel A. Pevzner (University of California, San Diego)
Zhaoshi Jiang and Evan E. Eichler (University of Washington, Seattle).

This analyisis is based on human genome build 35, 2004. Total input: 28856 alignments (gzipped)
Clustering duplication groups (overlap length 500) results in:
1635 single duplications (gzipped) and
666 duplication groups and (gzipped tar files of the alignments of each group).

The clustering procedure is as following. We build a graph in which every duplication alignment is represented as a node. If one segment in a duplication alignment intersects with one segment in the other duplication alignment with a certain length (default as 500), we link an edge between the corresponding nodes of the duplications. Finally we define the clusters (groups) of duplications as the connected components in the above graph. The single duplications are the duplications that are not intersected with any other duplications (isolated nodes in the above graph). Obviously, the segments from different cluster of duplications are not overlapped with each other, therefore they can be handled separately. We can build the repeat graph for each of such group and derive sub-repeats one by one.

The sub-repeat lists (in gzipped tar format) of each of the above groups can be found here. There are apprarently 666 files, each corresponding to one duplication cluster. The repeat graph description files can be found here. Also, the graph files in graphviz format can be found here, which can be transformed into postscript or other image format by software graphviz. Suppose the duplication group name is something like tmp-xxxx.aln, then the sub-repeat list file name will be tmp-xxxx.aln.txt; the sub-repeats connectivity result file will be tmp-xxxx.aln.bin.len.code.pair; the graph description file will be watch.tmp-xxxx.aln.simp; and the graphviz format graph file will be tmp-xxxx.aln.simp.gvz.

Among these groups, there is a giant group which have about half of the alignments, which has 15700 vertices and 26549 edges. The summary of tangles from its repeat graph can be found here. We also project the graphs to every human chromsome. See the analysis on each chromosome here.

Please contact Haixu Tang for details.