Download Genome of Interest

This protocol requires that the genome sequence being targeted for the identification of segmental and gene duplications be assembled and masked for repetitive elements.

Although this protocol is applicable to all eukaryotic genomes (see Note 7), the mouse genome will be used as our example. The May 2004 mouse genome assembly (referred to as mm5 by UCSC or Build 33 by NCBI) can be downloaded from UCSC ( as a zip file by executing the following command:

% wget

This zip file contains the mouse genome assembly with one FASTA file for each chromosome. Repetitive elements within each chromosome sequence have been identified with RepeatMasker ( and are represented in lower case letters; nonrepeating DNA sequences are shown in upper case letters. Once the genome has been downloaded, the zip file is uncompressed by executing the following command:

% unzip

Uncompressing this file will extract one FASTA file for each chromosome sequence. For the mouse genome, this should extract files: chrl.fa to chr19.fa, chrX.fa, chrY.fa, and chrM.fa (mitochondrial dna), as well as chr1_random.fa to chr19_random.fa, chrX_random.fa, chrY_random.fa, and chrUn_random.fa (see Note 8).

