Gene Sets cisRegions and Motifs

Depending on the goal of the study, appropriate gene sets need to be generated. One needs to understand that such an in silico analysis is actually an experiment. A hypothesis must be formulated. Suitable control sets also need to be constructed. For example, if one is interested in determining whether a specific transcription factor governs the observed differential expression of a gene set, the nondifferentially expressed genes could be employed as the negative set or control set. This would also help in downstream statistical assessment of the findings.

Also important is the definition of cis regions. Transcription factors are generally expected to bind region upstream of the transcription start sites (TSS), although in many instances they bind downstream of the TSS up to the first intron, as well around 3' end untranslated regions (3'-UTRs) (Fig. 2). Thus, depending on the assumptions made and the purpose of the analysis, one might opt to examine large regions (approx 10 kbp) around the two ends of the genes (which is normally done at the beginning of an analysis exercise) or conservatively probe a few thousand nucleotides around the TSS (e.g., 3 kbp upstream and 500 bp downstream of TS). All of these are conveniently supported by BEARR.

The final key piece is the binding site model. Binding site consensus sequences are generally available for the known transcription factor, as it is the most basic and easily understood model. For well-studied factors, more complex models, such as the position-weight matrix (PWM) (2), might also exist. As much as possible, information on the sites should be gathered, for example: how much deviations from the consensus sequence can still be tolerated for binding, whether there are key positions that have to be conserved, and where binding is usually located. These might be found in public databases, e.g., TRANSFAC (3).

3. Working With BEARR

This section details the key components of BEARR and highlights additional features. The BEARR website ( 1.0) is publicly available for nonprofit research usage. Heavy and frequent users are advised to request the source code for a local installation within their own computing resources.

Figure 3 is a screen capture of the BEARR home page; numbered regions are key input areas. Frequent references will be made to this figure throughout the rest of the chapter. Readers are also encouraged to read and follow the tutorial provided in the web site.

3.1. Understanding the Pipeline

Upon the submission of an analysis request, BEARR will:

1. Collect all the information filled in the form.

2. Extract sequences surrounding the given genes to the extent defined by the user.

3. If consensus sequence(s) has been defined, possibly with the amount of acceptable nucleotide mismatches, search in the sequences for sites that matches.

4. If a PWM is also given, use it to score sites in the sequences.

5. Produce a summary report, detailing the sites found.

Thus, for an accurate analysis result, users need to pay additional attention while providing information for the following areas:

1. Sequence extraction.

a. The two key parts in sequence extraction are the gene list (Fig. 3, input area 1) and the desired cis region for analysis (Fig. 3, input area 2).

b. The GenelDs input box accepts gene identifiers, delimited with white spaces (spacebar, tab, and newline characters). Strip the identifiers off extra characters (e.g., semicolons ";," commas ",") that are not part of recognizable identifiers. BEARR will not check or fix such errors. Although BEARR currently supports three different IDs, we only recommend the use of NCBI RefSeq ID (4) for stability reasons.

c. In defining the cis regions to be investigated, make sure that upstream and downstream are correctly defined. The most common mistake is to confuse the meaning of upstream and downstream around the 3' terminus and for reverse-strand genes.

2. Consensus search.

a. Users can specify the consensus patterns as single patterns (Fig. 3, input area 3) and/or two half-sites (Fig. 3, input area 4). The second option is for the numerous tandem binding sites. In both, users might also specify the amount of acceptable mismatches (or nucleotide deviations) from the inputted consen-

Sequence Extraction

Gene IDs (blank separated] J

a tist Tew

Consensus Search

Consensus binding sire (one par' rioj

Format Image 10 IistTComptete list of IDs

Organism: Homo sapiens (Build 33) v □ use DBTSS annotations and Hs Genome buifd 28

Mutate each consensus by 0 to 0 basepairs-

View list of binding site consensus I Convert IUPAC/IUB patterns

Upstream of TSS Downstream of TSS Upstream of 3'-terminus


Downstream of 3'-terminus 10

□ Tandem Site Consensus

Left site consensus |aggtca Minimum mutations Maximum mutations

Spacer Right site

Position Weight Matrix Analysis

□ Perform PWM analysis Hl=iU

Position Weight Matrix Analysis

□ Perform PWM analysis Hl=iU

Position Weight Matrix


PWM tráhstórmatíon: O Relative frequency © Log-likelihood


JLO 0 0 C 0 O 0

only the best hit(s)

hitís) scored at least 1

Vou can Ufa both raw or norma Uiad fraquancy tabla. Tha fyttam will normalise it autorrntieaJfy*

Convert TRANS FAC Matrix


empirical P-value of the hití s)

fNot«: Thii vili t|«v down th« tninlyjtj lutuianbBllv)


set Form ftatrtau* r#iLfltl from pr*iii1UJ quafiaf


set Form ftatrtau* r#iLfltl from pr*iii1UJ quafiaf

Fig. 3. BEARR main web site and interface. It consists of three clearly demarcated main parts: Sequence Extraction, Consensus Search, and Position-Weight Matrix Analysis. The input areas, which will be referred to frequently in this chapter, are numbered.

sus sequences. The two-half-site specification permits explicit control of the mutations on each half-site.

b. For pattern searching, BEARR adopts the use of Regular Expression (5), as it is highly flexible and easy to understand, but slightly adjusts it to meet the need of nucleotide motif searches. Three main, yet less widely known, operators are the dots ".," square brackets "[]," curly brackets "{}," and brackets "()."

c. Dots represent the wild-card characters, which is the same as the "N" under IUB/IUPAC nucleic acid nomenclature, for example, .TA matches ATA, CTA, GTA, and TTA.

d. A set of nucleotides enclosed in a pair of square brackets signifies only a single position and that the listed nucleotides are allowed to appear in that position, for example, TA[TA]A will find TATA and TAAA.

e. In many instances, repeats of same nucleotides (or set of nucleotides) are prevalent. Curly brackets indicate that the preceding character (or set of character) is to be repeated n time (using "{n}") or at least n times and at most m times (using "{n,m}"), for example, TA{2} searches for TAA, TA{2,3} finds TA and TAA, and [GA][GA][GA]C[AT][AT]G can be written as [GA]{3}C[AT]{2}G.

f. Brackets are similar to square brackets, in that they define possible nucleotides at the given position. To further illustrate, TATA will only match TATA, TA[TA] A will find TATA and TAAA, and TA{1,2}T will search for TAT and TAAT.

3. Position-weight matrix analysis.

a. The PWM (2) models the strength of protein-DNA binding at each position for each nucleotide. Such a matrix can thus be used to assess the binding likelihood of a given site. BEARR asks users to input the raw nucleotide counts or relative nucleotide frequencies at each position of the desired binding sites (Fig. 3, input area 5). This can be derived from samples of known functional sites.

b. Users have the option to use the relative frequency for scoring or to transform it into log-likelihood scores. It is advisable that log-likelihood transformation is used. Should it be desired, one might ask BEARR to keep only the best site for each sequence or, alternatively, a cutoff threshold could also be specified (Fig. 3, input area 6).

0 0

Post a comment