Filter GFF Records Based on Many Criteria

After converting the traditional BLAST alignments into GFF format, some alignments are excluded since not all are components of recent segmental duplications. To identify sequences meeting a stringent categorization of being a "recent segmental duplication," GFF records are filtered based on the criteria described below.

3.7.1. Filter Sequence Alignments With Less Than 90 Percent Identity

Recent segmental duplications are defined as paralogous sequences that share greater than 90% sequence similarity. Remove GFF records in which the percent identity attribute does not meet this minimum percent identity cutoff. This filtering criterion is applicable to both inter- and intrachromosomal sequence alignments.

3.7.2. Filter Suboptimal Sequence Alignments

Suboptimal sequence alignments occur when one sequence alignment is redundant in the sense that the subject and query elements are completely covered or spanned by another alignment. Remove the GFF record with the smaller span, which is considered a suboptimal alignment. This filtering step is applicable to both inter- and intrachromosomal sequence alignments.

3.7.3. Filter Identical Sequence Alignments

This filtering step is only applicable to intrachromosomal sequence alignments. Exclude self-self matches, whose GFF records have subject sequence coordinates that are identical to the query sequence coordinates.

