Identify Segmental Duplications by Chaining Alignments Together

To define the boundaries of segmental duplications, alignments whose coordinates are monotonically increasing are chained together to form larger contiguous alignments. This compensates for short and fragmented alignments, which have arisen because of insertion or deletion events that have modified paralog-ous copies of DNA. Since we defined segmental duplications as regions of the genome having length greater than 5000 nt, we need to filter chained alignments that do not meet this minimum length requirement.

1. Sort GFF records by subject and query coordinates.

2. For records of the same subject and query chromosome pair, if adjacent sequence alignments are separated by less than 3000 nt, chain the alignments together

3. Remove chained alignments that are smaller than 5000 bp.

This step concludes the identification of large regions of the genome involved in recent segmental duplications. Large segmental duplications can often contain duplicate genes and/or be implicated in genomic disease and structural rearrangements; hence they have an inherent biological interest. Subheadings 3.9. and 3.10. discuss mapping genes to segmental duplications, identifying duplicate gene pairs, and characterizing gene duplications using the Gene Ontology.

