Gibbs Sampler: Identifying TF Binding Sites in DNA, Papers of Computer Science

The gibbs recursive sampler is a software tool designed to locate multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned dna sequences. It uses a bayesian method to infer the number and locations of binding sites for each tf motif, as well as a background model to account for heterogeneity in dna composition. The software includes features for palindromic, direct repeat, and concentrated alphabet models, preferred binding site locations, and a statistical significance test. It recursively sums over all possible alignments of sites in a sequence to obtain bayesian inferences on the number of sites and their alignments and orderings.

Typology: Papers

Pre 2010

Uploaded on 07/30/2009

koofers-user-xl8-2
koofers-user-xl8-2 🇺🇸

10 documents

1 / 6

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Gibbs Recursive Sampler: finding transcription
factor binding sites
William Thompson
1,
*, Eric C. Rouchka
2
and Charles E. Lawrence
1,3
1
The Wadsworth Center, New York State Department of Health, Albany, NY 12201-0509, USA,
2
Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292,
USA and
3
Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
Received February 14, 2003; Revised and Accepted April 9, 2003
ABSTRACT
The Gibbs Motif Sampler is a software package for
locating common elements in collections of biopoly-
mer sequences. In this paper we describe a new
variation of the Gibbs Motif Sampler, the Gibbs
Recursive Sampler, which has been developed
specifically for locating multiple transcription factor
binding sites for multiple transcription factors
simultaneously in unaligned DNA sequences that
may be heterogeneous in DNA composition. Here we
describe the basic operation of the web-based
version of this sampler. The sampler may be acces-
sed at http://bayesweb.wadsworth.org/gibbs/gibbs.
html and at http://www.bioinfo.rpi.edu/applications/
bayesian/gibbs/gibbs.html. An online user guide is
available at http://bayesweb.wadsworth.org/gibbs/
bernoulli.html and at
http://www.bioinfo.rpi.edu/
applications/bayesian/gibbs/manual/bernoulli.html
.
Solaris, Solaris.x86 and Linux versions of the
sampler are available as stand-alone programs for
academic and not-for-profit users. Commercial
licenses are also available. The Gibbs Recursive
Sampler is distributed in accordance with the ISCB
level 0 guidelines and a requirement for citation of
use in scientific publications.
INTRODUCTION
Transcription regulation is arguably the most important
foundation of cellular function, since it exerts the most
fundamental control over the abundance of virtually all of a
cell’s functional macromolecules. A predominant feature of
transcription regulation is the binding of regulatory proteins,
transcription factors (TFs), to cognate DNA binding sites
known as transcription factor binding sites (TFBS) in the
genome. The computational identification of TFBS through
the analysis of DNA sequence data has emerged in the last
decade as a major new technology for the elucidation of
transcription regulatory networks. The Gibbs Motif Sampler
is a software package used to locate common elements in
collections of biopolymer sequences. It has been applied to
the analysis of protein sequences (1,2). Gibbs sampling has
also been used extensively in the identification of TFBS
(3,4) and an earlier version of this software has been
available at this web location for some time. In this paper we
describe a new variation, the Gibbs Recursive Sampler,
designed to search for multiple TFBS simultaneously. It
includes several features that are designed specifically for
locating TFBS in unaligned DNA sequences. These features
are based on characteristics of TF/DNA complexes or their
components.
THE GIBBS RECURSIVE SAMPLER
Gibbs sampling is a Markov Chain Monte Carlo procedure that
has seen wide application in the statistical community. It was
first applied in bioinformatics as a tool for multiple sequence
alignment in 1993 (1). Gibbs sampling techniques have
subsequently seen numerous enhancements and applications
(2,5). A key feature of sequence-based Gibbs sampling algo-
rithms and related expectation maximization algorithms (6,7)
is the use of motif models in the form of product multinomial
models to capture sequence patterns common to the binding
sites of each TF.
The recursive sampler, described here, was specifically
developed for the identification of TFBS in unaligned DNA
sequences. It includes several features that are unique to this
software: a rigorous Bayesian method for inferring the number
and the locations of the TFBS for multiple TF motifs
simultaneously; a background model of the heterogeneity in
the composition of non-coding nucleotide sequence and the
ability to use prior information of binding motifs. In addition,
it includes features to allow the use of palindromic, direct
repeat and concentrated alphabet models, preferred binding
site locations, and a rigorous test of the statistical significance
of the results, the Wilcoxon signed-rank test.
In the following, we briefly describe how the algorithm
incorporates these features and we provide instructions on
the use of the algorithm and on the interpretation of its
results.
*To whom correspondence should be addressed. Tel: þ1 5184867882; Fax: þ1 518 473 2900; Email: [email protected]
3580–3585 Nucleic Acids Research, 2003, Vol. 31, No. 13
DOI: 10.1093/nar/gkg608
Nucleic Acids Research, Vol. 31, No. 13 #Oxford University Press 2003; all rights reserved
pf3
pf4
pf5

Partial preview of the text

Download Gibbs Sampler: Identifying TF Binding Sites in DNA and more Papers Computer Science in PDF only on Docsity!

Gibbs Recursive Sampler: finding transcription

factor binding sites

William Thompson 1,*, Eric C. Rouchka 2 and Charles E. Lawrence 1,

(^1) The Wadsworth Center, New York State Department of Health, Albany, NY 12201-0509, USA, (^2) Department of Computer Engineering and Computer Science, University of Louisville, Louisville, KY 40292,

USA and 3 Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USA

Received February 14, 2003; Revised and Accepted April 9, 2003

ABSTRACT

The Gibbs Motif Sampler is a software package for locating common elements in collections of biopoly- mer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition. Here we describe the basic operation of the web-based version of this sampler. The sampler may be acces- sed at http://bayesweb.wadsworth.org/gibbs/gibbs. html and at http://www.bioinfo.rpi.edu/applications/ bayesian/gibbs/gibbs.html. An online user guide is available at http://bayesweb.wadsworth.org/gibbs/ bernoulli.html and at http://www.bioinfo.rpi.edu/ applications/bayesian/gibbs/manual/bernoulli.html. Solaris, Solaris.x86 and Linux versions of the sampler are available as stand-alone programs for academic and not-for-profit users. Commercial licenses are also available. The Gibbs Recursive Sampler is distributed in accordance with the ISCB level 0 guidelines and a requirement for citation of use in scientific publications.

INTRODUCTION

Transcription regulation is arguably the most important foundation of cellular function, since it exerts the most fundamental control over the abundance of virtually all of a cell’s functional macromolecules. A predominant feature of transcription regulation is the binding of regulatory proteins, transcription factors (TFs), to cognate DNA binding sites known as transcription factor binding sites (TFBS) in the genome. The computational identification of TFBS through the analysis of DNA sequence data has emerged in the last decade as a major new technology for the elucidation of transcription regulatory networks. The Gibbs Motif Sampler

is a software package used to locate common elements in collections of biopolymer sequences. It has been applied to the analysis of protein sequences (1,2). Gibbs sampling has also been used extensively in the identification of TFBS (3,4) and an earlier version of this software has been available at this web location for some time. In this paper we describe a new variation, the Gibbs Recursive Sampler, designed to search for multiple TFBS simultaneously. It includes several features that are designed specifically for locating TFBS in unaligned DNA sequences. These features are based on characteristics of TF/DNA complexes or their components.

THE GIBBS RECURSIVE SAMPLER

Gibbs sampling is a Markov Chain Monte Carlo procedure that has seen wide application in the statistical community. It was first applied in bioinformatics as a tool for multiple sequence alignment in 1993 (1). Gibbs sampling techniques have subsequently seen numerous enhancements and applications (2,5). A key feature of sequence-based Gibbs sampling algo- rithms and related expectation maximization algorithms (6,7) is the use of motif models in the form of product multinomial models to capture sequence patterns common to the binding sites of each TF. The recursive sampler, described here, was specifically developed for the identification of TFBS in unaligned DNA sequences. It includes several features that are unique to this software: a rigorous Bayesian method for inferring the number and the locations of the TFBS for multiple TF motifs simultaneously; a background model of the heterogeneity in the composition of non-coding nucleotide sequence and the ability to use prior information of binding motifs. In addition, it includes features to allow the use of palindromic, direct repeat and concentrated alphabet models, preferred binding site locations, and a rigorous test of the statistical significance of the results, the Wilcoxon signed-rank test. In the following, we briefly describe how the algorithm incorporates these features and we provide instructions on the use of the algorithm and on the interpretation of its results.

*To whom correspondence should be addressed. Tel: þ1 5184867882; Fax: þ1 518 473 2900; Email: [email protected]

3580–3585 Nucleic Acids Research, 2003, Vol. 31, No. 13 DOI: 10.1093/nar/gkg

Nucleic Acids Research, Vol. 31, No. 13 # Oxford University Press 2003; all rights reserved

RECURSIVE DISCOVERY OF SITES

AND SITE COUNTS

Multiple TFs often bind in a combinatorial fashion to regulate transcription. These collections of factors may contain multiple TFBS for any or all of the factors. Furthermore, some of the DNA sequences in a data set may contain no sites at all, or no sites for one or more of the factors involved in regulating the rest. While the total number of sites in a given input sequence often spans a specifiable and relatively short range, the exact number of sites and the number of sites corresponding to each motif in any input sequence are unknown and often vary among the input sequences. To address these unknowns, the sampler uses recursive sums over all possible alignments of 0  k  Kmax sites in a sequence, to obtain Bayesian inferences on the number of sites for each motif and the total number of sites in each sequence. The recursion examines the placements of sites for k TFs, as represented by p different motifs, in each sequence. The algorithm infers for each sequence the total number of sites, the number of each of the p motifs and the alignments and orderings of these sites in the sequence. In its back sampling step it simultaneously samples sites in each sequence according to these inferences. As in previous Gibbs sampling algorithms, the widths of sites are inferred using a fragmentation algorithm (5). The sampling process iterates over the sequences one at a time, using currently sampled values in all other sequences, to guide the sampling process toward a converged result. After all of the sequences have been examined and a set of multiple motif positions has been determined, the log of the posterior alignment probability is calculated (5). The max- imum value of this probability, the MAP (maximum a posteriori probability), provides the optimal solution. The MAP value is measured relative to an empty or ‘null’ alignment, by taking the difference between the log of the probability of the alignment and the log of the probability of an empty alignment. A value greater than zero indicates that the alignment is more likely than unaligned background. The

process continues until a maximal number of iterations has been executed or until a certain number of iterations, the plateau period, has occurred without an increase in the estimated MAP. As a default we report the MAP solution. Alternatively, we provide a Bayesian inference of site frequencies based on continued sampling after convergence. These Bayesian inferences take the form of estimated probabilities for each predicted site. Both types of solution adjust inferences for the lengths of the input sequences and of sites and motifs in a Bayesian manner analogous to Webb (8).

HETEROGENEOUS BACKGROUND

As in previous Gibbs sampling models, the probability that a particular position in the sequence is sampled as a site is calculated as the ratio of the probability of the site under motif models to the probability under a background model that describes the sequence in the absence of TFBS. In previous implementations, background models assumed homogeneity in the composition of each sequence. However, variations in sequence composition in non-coding DNA are often complex. The most common approach to this complexity has been to model the background sequence using Markov chains (9,10). Non-coding sequence is also often heterogeneous in composition, particularly in eukaryotes (Fig. 1). This variation in local base composition can adversely affect sequence alignment (11,12). In addition, since TFBS are often A-T or G-C rich, masking algorithms are often not useful for reducing the effect of background variation (13). To address this hetero- geneity, the recursive sampler uses the Bayesian segmentation algorithm (14) to produce a position-specific background model. A two-step process is employed. First, the individual input sequences are analyzed for heterogeneity in composition using the Bayesian segmentation algorithm. This algorithm returns the probabilities of observing each of the four bases at each position in a sequence, based only on that specific sequence’s compositional heterogeneity and on the uncertainty in this heterogeneity. These probabilities are then used as position-specific background models in the above ratio.

PALINDROMES AND DIRECT REPEATS

Several TFs that bind as symmetric homodimers or homo- multimeric protein complexes have palindromic DNA binding motifs (6). Other regulatory TFs such as Escherichia coli PhoB and certain zinc finger proteins bind in directly repeating multimers and therefore have directly repeating binding patterns (15). Often, the spacing between the palindromic half-sites is unknown. The algorithm makes inferences in these cases, using a modified version of the fragmentation algorithm (5), by restricting fragmentation to the center positions. In a number of cases, AT/TA and GC/CG base pairs may be indistinguishable to TF binding (16). For example, bases in the minor groove of DNA are less accessible to the protein, which limits the ability to distinguish the bases. In addition, AT/TA pairs may allow the DNA greater flexibility to facilitate the DNA bending often associated with TF binding (6). To address all of these cases the recursive sampler includes motif models that use reduced alphabets.

Figure 1. The compositional variation of a 500 bp region upstream of the trans- lation start site of the YDR226W/ADK1 gene from Saccharomyces cerevisiae. The probability of each base at each position is calculated by the Bayesian seg- mentation algorithm (10). This algorithm returns the probabilities of observing each of the four bases at each position in the sequence.

for all advanced features, specification of alternative values on this page allows the user greater control over how the sampler searches for sites, and also allows adjustment of the program runtime parameters. For example, in a particular prokaryotic application, the user may decide to replace the default choice of palindromic sites with a non-palindromic model. A number of other parameters influence the operation of the program. Several items have been mentioned above and further information on formats and acceptable values can be obtained by clicking on the associated hyperlinks.

A FREQUENCY-BASED SOLUTION

By default, the algorithm returns the single best solution that it has found as measured by the MAP value. However, no prediction method is perfect, and not all of the predictions of this or other algorithms are equally compelling. For examina- tion of the creditability of each prediction, a Bayesian sampling solution is also available as an option. For this solution, after the MAP solution has been determined, the algorithm continues to sample, in order to explore variations in the models. The frequencies with which sites are sampled are recorded and reported to the user. Variations in models may alter estimates of site frequencies from the values obtained with the single MAP solution. Our experience shows that some sites are strong and are sampled 100% of the time, while others are weaker and are sampled with a lower frequency. These sampling frequencies are an estimate of the probability that the cognate TF binds at each predicted site. By default, only sites that have a frequency of at least 50% are reported.

PROGRAM OUTPUT

The web-based version of Gibbs Recursive Sampler produces the same output as does the stand-alone version. Normally, the output is returned via email. The user also has the option of

receiving the output online. Figure 5 illustrates typical program output. These results are for a set of 18 sequences extracted from E.coli regulatory regions. The sequences are known to contain binding sites for the protein CRP (6). The results were generated using the recursive sampler with the following parameters: a width of 16, a palindromic model, a maximum number of sites per sequence of two and heterogeneous background composition. Since we used a palindromic model, the search for sites in the reverse complement direction was turned off. The Wilcoxon signed-rank test was applied to test the statistical significance of the resulting alignment.

INTERPRETATION OF RESULTS

The first portion of the results in Figure 5 is a list of the options used for the current run. This is followed by a list of the FASTA headings for the input sequences. Next are the results from the MAP solution for each motif. The listing for each motif begins with a table having a column for each of the bases and a row for each of the positions in the motif. The numbers in the table indicate the estimated probability of occurrence of each base in each position within the motif, throughout the alignment. The

Figure 4. The Gibbs Recursive Sampler advanced entry page, showing modifiable parameters.

Figure 5. Sample output from Gibbs Recursive Sampler.

last column gives the information contribution to the model of each position, with a maximum of two bits. Thus, this table gives the estimated binding pattern of the motif and a measure of the degree of conservation of each position in bits. The next portion of the results is the alignment of the motif sites, along with the probability that each site belongs to the current motif model. Immediately below the listing of motif sites is a row containing asterisks. The asterisks indicate the conserved columns of the site. In the example in Figure 5, even though the initial width of the site was entered as 16, the program fragmented the sites to a total width of 22, 16 conserved columns and six non-conserved columns. Only the conserved columns are used in the alignments. The fragmenta- tion algorithm dynamically learns which columns are conserved. The p-value of 0.000396 from the Wilcoxon signed-rank test indicates that the solution is highly significant. In addition, the predicted sites match well with the reported crp binding sites for these sequences (6). The MAP calculation includes terms for each component of the model: the motif model, the fragmentation, the background model and the alignment. The individual contributions of the two motif-specific terms, the motif model term and the fragmentation term are listed for each individual motif. The final full log MAP minus the log of the null MAP (MAP of the sequence with no

Figure 5. Continued.