Integrating the Conservation or RP Data With cTFBS at Galaxy

1. At the Galaxy history page, the user now has the noncoding intervals with high phastCons scores, the noncoding intervals with high RP scores, and intervals with conserved GATA-1 binding sites. Select two of the results to combine, e.g., non-coding high-RP intervals and conserved GATA-1 binding sites, by clicking on the buttons next to each query.

2. Under "Action to Perform," click on the button for "Perform operations like intersection, etc." and click "Go." This takes the user to the Query Operations page. Only the queries selected from the history page are transferred to the operations page. For a given number of queries, only a certain set of operations is allowed. Those that are not allowed are dimmed.

3. To find all the noncoding, high-RP intervals that have a conserved GATA-1 binding site in proximity to them, under "Operation," click on the button next to "Proximity" (see Note 14). After the screen refreshes, use the pull-down options to return regions from the noncoding, high-RP query results that lie less than 50 bp in either direction from a region in the query for conserved GATA-1 binding sites. Click on "Go," which returns you to the history page. The page initially returned frequently shows the new query as "running." Again, periodically click "Refresh" to obtain the results.

4. The results are the predicted CRMs, based on three criteria—they have a high RP score, they are not exons, and they are close to or encompass a conserved match to a GATA-1 binding site. To retrieve the results of a selected query, select "Get output" from the list of "Actions to Perform" and hit "Go." For viewing the results, select "UCSC Browser custom track" or "Ensembl Genome Browser custom track." For a plain text file, select "Raw result file (bed)." The desired action is taken when you click "Go." Other features can be combined, such as high phastCons scores, and other operations can be performed on the data, using the utilities at Galaxy.

3.5. Retrieving High-RP or High-phastCons Intervals in Noncoding Sequences Using GALA

1. The GALA database is accessed at (Table 1) by selecting the link for "GALA." On the home page, the user finds links to GALA databases built for genomes of five different species (human, mouse, rat, chimp, and chicken), with up to three assemblies for each (see Note 15). Click on "Query page" under the appropriate species and assembly (e.g., Human July 2003 data release).

2. The query page is presented as an expandable selection of choices for categories, i.e., genes and gene predictions, expressed sequence tags and mRNA, comparative genomics, variation and repeats, expression and regulation, and mapping and sequencing, which are compatible with groups on the UCSC Genome Browser (see Note 16). Halfway down the page, you will find the query boxes for "Regulatory potential scores based on multiple alignments," with options for filtering the results by a minimum and maximum score. A good score for the minimal threshold is 0.001; leave the "less than or equal to" box blank. Alternatively, you may wish to query on the next item, PhyloHMM Cons (an earlier name for phastCons). A good score for the minimal threshold is 0.4 (18) (see Note 17).

3. Users wishing to investigate only a small genomic locus can choose the button to "Restrict search to interval" (near the bottom of the form). Otherwise, proceed to select the choice of output. "Text list" is the preferred choice when preparing datasets for use with subsequent operations.

4. Click "run query in background," so the server will save the results for 48 h. The results are returned on the GALA history page, where they can be combined with other queries (see step 6).

5. To collect exons, return to the GALA query page, and for the category "Genes and gene models," click on "Show the fields for this category" and then "Refresh" (toward the bottom of the page). The new page has many options for obtaining genes or parts of genes. Under "Protein Coding Genes, GALA's default set of genes," go to "Other gene fields" and click the box for "exons." Scroll to the bottom of the page, restrict the query to a chromosomal interval if desired, choose "text file" under "Output," and click on "run query in background."

6. Use the GALA history page to remove the exons from the high-RP intervals. Click the box next to each query on which you want to perform an operation (such as subtraction). Under "Compound queries," choose "SUBTRACTION." If you follow the steps in the order covered in here, choose the option to subtract "earlier minus later query" to subtract exonic intervals from the high-RP intervals. Using the pull-down menu, specify that "only overlapping segments" should be removed.

Click on "Run compound query in the background," located almost at the bottom of the page. The results are noncoding, high-RP intervals.

3.6. Retrieving Conserved Matches to TFBS Using the GALA Server

1. On the GALA query page, under the category of "Expression and Regulation," go to "Transcription factor binding sites" and choose, e.g., "only binding sites conserved in hg16Mm3Rn3, cutoff used was 0.85" (see Note 18).

2. Click on the button after "To select/add factor names," which opens a new page with all the choices. Select those of interest, and press the button "add selections to main form," which is at the bottom of the selection page (see Note 19).

3. The user is returned to the GALA query page. As before, users can limit the query by entering a restricted genomic interval, or they can query the entire genome. After selecting the desired output (e.g., "text list"), the user should click on "Run query in the background." A results page appears, after which the user can go to the history page.

3.7. Integrating the Conservation or RP Data With cTFBS Data at GALA

1. The GALA history page lists the queries that have been run, such as noncoding high-RP intervals and conserved GATA-1 binding sites, along with the number of results obtained for each. To find features that are in proximity to others, scroll down the page under "Compound queries" to "Proximity."

2. Enter the appropriate query numbers in the boxes under "Proximity," specifying that the noncoding, high-RP intervals "lie within 50bp" of regions in the conserved GATA-1 binding site query (see Note 20).

3. Select the type of output (such as "text list"), and then click "Run compound query in the background."

4. The results returned are the CRMs predicted by having a high-RP score, not being exons, and being close to a GATA-1 binding site that is conserved among human, mouse, and rat. Other criteria can be applied, and other operations (such as intersections or clustering) can be used for alternative predictions.

4. Notes

1. Instead of using the "Galaxy featured datasets," the user can follow the link to the Table Browser and retrieve genomic intervals whose phastCons scores exceed a desired threshold. However, this step takes a rather long time for the entire genome (searching through about 800 million records), and it is likely to time-out. Thus a user should limit this search to a specific interval (megabases should be no problem), or one can use the preselected intervals deposited in the "featured datasets." A similar logic holds for the RP scores.

2. It is often the case that most recent assembly is more complete and better annotated. However, it takes some time for annotations to be "lifted" onto new assemblies, and thus for some time after a new assembly is released, more information will be available on the previous assembly. As of this writing, the very extensive data on the ENCODE regions are available only for hg16, the July 2003 assembly of human.

3. Selecting the more sensitive threshold for phastCons score (©0.2) returns a large set of intervals that does the best job of finding known CRMs in the HBB gene complex (18). However, it almost certainly returns many false positives, and for some purposes, the more stringent threshold may be more appropriate.

4. The choice of the collection of genes used is, of course, up to the user. The Known Genes track is very extensive and quite reliable, but it misses some genes. Users may prefer RefSeq, Ensembl, or other sets. Users should be aware that despite the considerable overlap in these gene sets, there are many differences, and these will affect the results of subtracting them from a set of intervals to find noncoding conserved sequences.

5. In this step, and in all steps in which the user has an option to limit a query to a particular interval, it is important to realize that the larger the interval examined, the more time it takes for the database to complete the query. Thus, searching the entire genome (approx 3000 Mb) takes considerably longer than searching the ENCODE regions (approx 30 Mb), which will take longer than a given locus (perhaps 0.3 Mb). Likewise, the number of features in the intervals searched is a major determinant of time to complete the query. phastCons and RP scores are given for every aligning nucleotide, and thus there are almost 800 million of these records to search. In contrast, the number of exons in the KnownGenes set is about 400,000, and thus a query to retrieve them takes less time. For full data on dense features like phastCons or RP, downloading files is much more efficient.

6. Users may instead wish to choose exons with an additional short interval, e.g., 10 bases, at each end. By doing so, the user will include regions that may be indirectly under selection because of their proximity to exons.

7. The Galaxy history page will load immediately, even if the query has not finished running at the Table Browser. In this case, at the end of the query, the notation "running" appears. The user should periodically click on "refresh" to see when the query has been completed and the results sent to Galaxy.

8. In this step, or any time the user is on the history page, one of the options is to edit the descriptions. Select a query, and click on "More." The screen refreshes, and now the option to "Edit query descriptions" is displayed. This editing is particularly helpful for the results of operations, for which Galaxy simply refers to the queries by number, not by content. A similar feature is implemented in GALA.

9. The time it takes for an operation to complete is determined primarily by the number of intervals that are in each query.

10. All the returned intervals can easily be viewed in the UCSC Genome Browser. On the left of the Genome Browser display is a list of all the returned intervals, which are hyperlinks to new views that show each region. The text file that can be returned is in BED format, in which the first three columns are chromosome, start position, and stop position for each interval.

11. After seeing the results, if the user decides that the genomic regions selected for the queries requires optimization, e.g., it was too small or too large, return to step 5 and enlarge or reduce the coordinate distance.

12. Selecting "Regulatory potential (3way, human-mouse-dog, >0)" returns a set of 5.8 million intervals that does the best job of finding known CRMs in the HBB gene complex (18). It probably also returns some false positives. To increase the stringency of the search, users can go the UCSC Table Browser, and select "Expression and Regulation" as the group and "3x Reg Potential" as the track (currently only available on the human July 2003 assembly). By clicking on the "create" button for "filter," the user gets to a page at which the threshold can be set higher, e.g., dataValue is ©0.001. By clicking on "submit," this filter will be applied to the query when it is run.

13. The first set of filters is for the table of conserved binding sites, and the "name" refers to the name (or ID) of the weight matrix for a binding site. Thus one could enter a TRANSFAC ID for a particular weight matrix, such as "V$GATA1_02." Of course, this requires that the user know these IDs, which can be obtained from TRANSFAC. In the example given here, a wild card character ("*") was used to filter on "V$GATA*," which will include multiple binding sites for GATA-1 and GATA-3 (which have very similar binding sites). In order to filter based on the name of the transcription factor (not the binding site), users can take advantage of the ability of the Table Browser to filter on fields in related tables. On the filter page, choose the option to allow filtering on hg17.tfbsConsFactors, and choose "factor does match GATA-1" (or the name of the desired factor).

14. Users can find features in proximity to other features, such as described here, and the distance between them is set by the user. Alternatively, users may elect to perform a simple intersection. Note that the screen refreshes for each newly selected operation, because the parameters and choices relevant to each operation differ. In our research, we have found that using proximity has predicted some active CRMs that were missed by the intersection operation, but this is not frequent.

15. On the GALA home page users may want to access "Annotation statistics" to see the all the different types of data recorded, the number of records in each, and a partial list of fields in each table. Users can also go directly to their history page.

16. The default GALA query page lists only minimal or no choices for categories such as genes and gene models. Users who want to query on information within these should click on "Show the fields for this category" and then "Refresh" (toward the bottom of the page).

17. Users may wish to choose alignments computed between different species or filtered in various ways. These options are all under the comparative genomics section of the query page.

18. The options available for binding sites in GALA differ by the species and genome assembly. Here we selected binding sites conserved in human-mouse-rat alignments (hg16Mm3Rn3), but users can select other alignments, such as a pairwise humanchicken (hg16Gg2) or five-way human-chimp-mouse-rat-chicken (hg16Pt1Mm3 Rn3Gg2). The threshold scores ("cutoff') for the matches to the weight matrices are adjusted in each case.

19. Users can select by ID for weight matrices for factors instead of by name of the factor. Queries of all binding sites (not just the conserved ones) must be a limited to a chromosomal interval because of the very large number of sites in the entire genome (about 212 million for the human genome).

20. Users can elect to do intersections or other operations. Clustering is also supported, e.g., requiring that each high-RP interval have at least two conserved factor-binding sites within it. This set of operations is supported in both Galaxy and GALA.

0 0

Post a comment