While
microarray-based expression profiling has facilitated
the use of computational methods to find potential TF binding sites, few current in silico
approaches can explicitly link computationally discovered regulatory motifs
with the transcription factors that bind them.
We have thus developed a TF-centric clustering (TFCC) algorithm that may
provide such missing information through incorporation of biological knowledge
about TFs.
TFCC is a semi-supervised clustering algorithm which relies on the
assumption that the expression profiles of some TFs may be related to those of
the genes under their control.
We examined this premise and found the vicinities of TFs in expression
space are often enriched with the genes they regulate.
So,
instead of clustering genes based on the mutual similarity of their expression
profiles to each other, we used TFs as seeds to group together genes whose
expression patterns correlate with that of a particular TF. Then
a Gibbs sampling algorithm was applied to search for shared cis-regulatory
elements in promoters of clustered genes. Our
working hypothesis was that if a TF-centric cluster indeed contains many targets
of the seeding TF, at least one of the discovered motifs would be the site bound
by the very same TF. We tested the
TFCC approach on eight cell cycle and sporulation regulating TFs whose binding
sites have been previously characterized in Saccharomyces cerevisiae, and correctly identified binding site
motifs for half of them. In
addition, we also made de novo predictions
for some unknown TF binding sites.