## Notes

1. Since there are many ways to evaluate conservation in the alignment, we provide here a short description of the method used by VISTA. The VISTA curve is calculated as a windowed-average identity score for the alignment. A variable-sized window (Calc Window) is moved along the alignment, and a score is calculated at each base in the coordinate (base) sequence. That is, if the Calc Window is 100 bp, then the score for every point is the percentage of exact matches between the two sequences in a 100-bp-wide window centered on that point. Owing to resolution constraints when one is visualizing large alignments, it is often necessary to condense information about a hundred or more base pairs into one display pixel. This is done by only graphing the maximal score of all the base pairs covered by that pixel.

2. Regions are classified as "conserved" by analyzing scores for each base pair (see Note 1 for how the scores are calculated). Two parameters control this analysis, "Min Cons Width" (default value 100 bp) and "Cons Identity" (default value 70%). A region is considered conserved if the conservation over this region is greater than or equal to the Conservation Identity and it has the minimum length of "Minimum Conserved Width." After all the regions that satisfy these conditions are calculated, they are modified using the annotation information—UTRs and exons are considered conserved based on Cons Identity without taking into consideration their length. Thus we are not missing short highly conserved exons and UTRs. CNS (Conserved Noncoding Sequences) are trimmed when the alignments at the edge of the region contribute little to the overall conservation score. This can sometimes lead to CNS that are a bit shorter than the assigned "Min Cons Width."

3. Sometimes the researcher will encounter "overlaps"—areas in which multiple regions from one sequence were aligned to the same area on another sequence. This can happen because of a variety of reasons—repeats, duplications, and so on. When this happens, VISTA Browser will draw a "best conservation" curve—for every point in the overlapped region, it will evaluate every participating alignment and plot the highest score. This creates an optimistic view of the conservation. To view each alignment separately, click on the "alignment details" button. You will be able to download PDF plots of each separate alignment.

4. If you have pop-up blocking software (external, such as the Google toolbar, or built-in, in Internet Explorer 6 for example), you might need to temporarily disable it—this is usually done by holding down the CTRL key while clicking the button.

5. The GenomeVista and mVISTA servers only accept plain-text sequence files in the FASTA format. The FASTA format consists of a header that starts with the greater than symbol (>), followed by the sequence name (one word) and the sequence itself on the following lines. If you have the sequence in a Word document, make sure that when you save it, the "Save as type" field says "Plain text (*.txt)."

6. The parameters used for conserved region analysis can have a significant impact on the number and quality of conserved regions identified by VISTA. When comparing distant or very closely related species, one may want to change default parameters by varying percent identity and window size when finding conserved sequences. Figure 4 illustrates how different the same alignment can look when the visualization parameters are altered.

7. We recommend that the user carry out preliminary studies to determine transcription binding factors that are most likely to occur in a given sequence or are most interesting for your research. This information can be found through reading rele vant literature or looking up gene entries in public databases such as GenBank and Ensemble. A surprising amount of relevant information can often be found by performing a simple Google search on the gene name (try running a search on RUNX1). The new Google Scholar website at http://scholar.google.com can also be extremely useful for finding articles pertaining to your sequence of interest.

8. "Clustering" allows the users to identify TFBS that are present in groups or clusters. For an individual cluster to occur, K number of these binding sites must occur within N base pairs. K and N can be varied for different sites. In order for a group cluster to occur, K number of any TFBS need to occur within Nbase pairs.

### Acknowledgments

The VISTA project is an ongoing collaborative effort of a large group of scientists and engineers. It has been developed and maintained in the Genomics Division of Lawrence Berkeley National Laboratory. You can find the names of all contributors at the VISTA web site http://genome.lbl.gov/vista.

The project was partially supported by the Programs for Genomic Applications (PGA) funded by the National Heart, Lung, and Blood Institute (NHLBI/ NIH) and by the Office of Biological and Environmental Research, Office of Science, US Department of Energy.

### References

1. Batzoglou, S., Pachter, L., Mesirov, J. P., Berger, B., and Lander, E. S. (2000) Human and mouse gene structure: comparative analysis and application to exon prediction. Genome Res. 10, 950-958.

2. Brent, M. R. and Guigo, R. (2004) Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264-272.

3. Guigo, R., Dermitzakis, E. T., Agarwal, P., et al. (2003) Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. USA 100, 1140-1145.

4. Pennacchio, L. A., Olivier, M., Hubacek, J. A., et al. (2001) An apolipoprotein influencing triglycerides in humans and mice revealed by comparative sequencing. Science 294, 169-173.

5. Woolfe, A., Goodson, M., Goode, D. K., et al. (2004) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7.

6. Gottgens, B., Barton, L. M., Chapman, M. A., et al. (2002) Transcriptional regulation of the stem cell leukemia gene (SCL)—comparative analysis of five vertebrate SCL loci. Genome Res. 12, 749-759.

7. Waterston, R. H., Lindblad-Toh, K., Birney, E., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520-562.

8. Gibbs, R. A., Weinstock, G. M., Metzker, M. L., et al. (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493-521.

9. Kirkness, E. F., Bafna, V., Halpern, A. L., et al. (2003) The dog genome: survey sequencing and comparative analysis. Science 301, 1898-1903.

10. Hillier, L. W., Miller, W., Birney, E., et al. (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695-716.

11. Hardison, R. C. (2003) Comparative genomics. PLoS Biol. 1, 156-160.

12. Miller, W., Makova, K. D., Nekrutenko, A., and Hardison, R. C. (2004) Comparative genomics. Annu. Rev. Genom. Hum. Genet. 5, 15-56.

13. Dubchak, I. and Frazer, K. (2003) Multi-species sequence comparison: the next frontier in genome annotation. Genome Biol. 4, 122-128.

14. Frazer, K. A, Elnitski, L., Church, D. M., Dubchak, I., and Hardison, R. C. (2003) Cross-species sequence comparisons: a review of methods and available resources. Genome Res. 13, 1-12.

15. Wei, L., Liu, I., Dubchak, I., Shon, J., and Park, J. (2002) Comparative genomics approaches to study organism similarities and differences. J. Biomed. Inform. 35, 142-150.

16. Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E. M., and Dubchak, I. (2004) VISTA—computational tools for comparative genomics. Nucleic Acids. Res. 32(Web Server issue):W273.

17. Dubchak, I., Brudno, M., Pachter, L. S., et al. (2000) Active conservation of non-coding sequences revealed by 3-way species comparisons. Genome Res. 10, 13041306.

18. Mayor, C., Brudno, M., Schwartz, J. R., et al. (2000) VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16, 1046-1047.

19. Loots, G., Ovcharenko, I., Pachter, L., Dubchak, I., and Rubin, E. (2002) rVISTA for comparative sequence-based discovery of functional transcription factor binding sites. Genome Res. 12, 832-839.

20. Bray, N., Dubchak, I., and Pachter, L. (2003) AVID: a global alignment program. Genome Res. 13, 97-102.

21. Brudno, M., Do, C. B., Cooper, G. M., et al., and NISC Comparative Sequencing Program (2003) LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, 721-731.

22. Couronne, O., Poliakov, A., Bray, N., et al. (2003) Strategies and tools for whole genome alignments. Genome Res. 13, 73-80.

23. Brudno, M., Poliakov, A., Salamov, A., et al. (2004) Automated whole-genome multiple alignment of rat, mouse, and human. Genome Res. 14, 685-692.

24. Schwartz, S., Zhang, Z., Frazer, K. A., et al. (2000) PipMaker—a web server for aligning two genomic DNA sequences. Genome Res. 10, 577-586.

25. Schwartz, S., Elnitski, L., Li, M., et al., and NISC Comparative Sequencing Program (2003) MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 31, 3518-3524.

26. Schwartz, S., Kent, W. J., Smit, A., et al. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13, 103-107.

27. Pollard, D. A., Bergman, C. M, Stoye, J., Celniker, S. E., and Eisen, M. B. (2004) Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioin-formatics 5, 6-22.

28. van der Helm-van Mil, A. H., Wesoly, J. Z., and Huizinga, T. W. (2005) Understanding the genetic contribution to rheumatoid arthritis. Curr. Opin. Rheumatol. 17, 299-304.

29. Yamada, R. and Ymamoto, K. (2005) Recent findings on genes associated with inflammatory disease. Mutat. Res. 573, 136-151.

30. Brudno, M., Malde, S., Poliakov, A., et al. (2003) Glocal alignment: finding rearrangements during alignment. Bioinformatics Suppl 1, I54-I62.

31. Shah, N., Couronne, O., Pennacchio, L. A., et al. (2004) Phylo-VISTA: interactive visualization of multiple DNA sequence alignments. Bioinformatics 20, 636-643.

32. Martin, J., Han, C., Gordon, L. A., et al. (2004) The sequence and analysis of duplication-rich human chromosome 16. Nature 432, 988-994.

33. Parent, S. A., Zhang, T., Chrebet, G., et al. (2002) Molecular characterization of the murine SIGNR1 gene encoding a C-type lectin homologous to human DC-SIGN and DC-SIGNR. Gene 293, 33-46.

34. Chen, J., Kitchen, C. M., Streb, J. W., and Miano, J. (2002) Myocardin: a component of a molecular switch for smooth muscle differentiation. J. Mol. Cell. Cardiol. 34, 1345-1356.

35. Premzl, M., Delbridge, M., Gready, J. E., et al. (2005) The prion protein gene: identifying regulatory signals using marsupial sequence. Gene 349C, 121-134.

36. Anguita, E., Sharpe, J. A., Sloane-Stanley, J. A., Tufarelli, C., Higgs, D. R., and Wood, W. G. (2002) Deletion of the mouse a-globin regulatory element (HS 26) has an unexpectedly mild phenotype. Blood 100, 3450-3456.

37. Cooper, G. M., Brudno, M., Stone, E. A, Dubchak, I., Batzoglou, S., and Sidow, A. (2004) Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. 14, 539-548.

38. Margulies, E. H., Vinson, J. P., Miller, W., et al. (2005) An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl. Acad. Sci. USA 102, 4795-4800.

0 0