Frequently, occasionally, rarely asked and made-up questions
(NOTE: in general, we try to stand by our commitment to "never asked questions")
Q. (a real question from a real reader...)
I have been recently reading the paper
published from your lab in Nature
Genetics on October 2001. On the cell cycle combinogram you show that the
GMC containing both the motifs MCB and SFF' has a high expression coherence
(~0.6). I have interpreted this result to mean that genes which have the
MCB and SFF' motif in their promoter will preferentially be expressed during
the G1 phase of the cell cycle. I assume that this indicates a cell cycle
link between the MCB and SFF' motifs. However, I do not see a direct link
between the MCB and SFF' motif on the global motif synergy map. I would
like to know if I am interpreting the data incorrectly or if this is simply
an oversight on the map.
A. thanks for the interest in our paper, and for your correct question. The reason there's no
link between MCB and SFF' in the general graph is that the P-value on the hypothesis that this
is a 'synergistic' pair didn't pass the required threshold. This is a very strict threshold
that corrects for the test of multiple hypotheses and requires that the P-value of significant
observations will be smaller than the reciprocal of the number of hypotheses (in this case,
the number of motif pairs ~356^2).
But you are right in still asking how comes this pairs looks really good in the combinogram
and doesn't appear in the map. The reason for that is that there are relatively few genes that
have that combination, and our P-Value calculation (detailed in the Methods) takes that into
an account. If there were (even slightly) more genes in that column, the pair MCB-SFF' would
have passed the threshold.
Some statisticians now begin to argue (http://www.math.tau.ac.il/~ybenja/) that such
correction is too restrictive and gives rise to many false negatives, but we were more
concerned about false positives at the time. Incidentally, the implicitly proposed
interaction between the TFs that respectively bind these two motifs, Mbp1 and Forkhead
was in-fact observed experimentally (after the publication of our paper) in a large-scale
proteome interaction survey (Nature. 2002 Jan 10;415(6868):180-3.), and another experimental
support is cited in the paper. This suggests that our strict P-Value may have given rise to a
false negative in this case.
Q. What's the rational behind your approach
A. Identification of potential cis-regulatory motifs from un-aligned upstream regions of
potentially co-regulated genes has recently been attempted successfully by many researchers
(Ref). In most paradigms genes are first clustered based on similarity of their expression
profiles (Ref), or based on their belonging to metabolic pathways or common cellular processes
(Ref). Various motif-finding algorithms are then applied to search for motifs shared among
clustered genes. A striking observation in many of these experiments was that many specific
motifs could be found that are shared among many of the genes in the set, many of which are
know to be related to gene expression regulation, via transcription factor binding, of
chromatic remodeling.
Despite impressive success, it is still not clear what proportion of the new motifs
discovered by these methods are really biologically functional, and there is a great need to
develop computational tools that will help prioritize candidates prior to experimental
verification.
A straightforward way to assess the biological significance of new motifs would be to
reverse the above procedure, namely to test whether the genes in which the motif occur have
similar expression profiles and/or cellular role. In many cases the results indicate that sets
of genes that contain a given motif are not significantly similar in expression nor in
function, indicating that assignment of motifs to the genes in the genome usually results in a
large amount of false-positives. This observation holds even for experimentally-established
motifs, and is probably more sever in computationally-derived motif candidates.
A highly simplified probability calculation, that assumes 7 conserved positions in a typical
motif and kb of sequence in the upstream region of each of the yeast genes indicates that each
motif should be found in ~300 genes, while in reality it is believed that transcription
factors regulate a much smaller number of genes. In other words, in many cases the existence
of a motif in genešs promoters is at most necessary but not sufficient condition for
expression determination.
A likely hypothesis is that many motifs are part of a larger cluster of motifs that together
regulate gene expression. This is an appealing possibility since it may allow the integration
of multiple signals into the activation and repression of genes activity (Ref).
The notion of combinatorial regulation of gene expression is not new, and examples of
regulatory networks that are controlled by several transcription factors are known in multiple
species (Ref). Yet, at least in the yeast genome, relatively little is know about specific
systems that manifest combinatorial regulation.
We have thus launched a systematic whole-genome search for possible regulatory networks with
multiple control sites. A Gibbs sampling strategy was used to find motif sets that likely
control gene networks. We select for particular combinations that appear to control genes with
shared cellular function and expression patterns. We demonstrate that the existence of such
motifs may be characteristic signatures for such functions, and demonstrate how such
combinatorial analyses of yeast promoters may help annotate cellular function of genes that
can not be annotated based on sequence similarity in their coding regions.
Q. What's the intuition behind the expression coherence score ? could it be defined
only for
genes that shaere a give motif ?
A. Expression coherence score may be defined for any gene set for which you have expression
profiles, regardless of any motif information (in fact the genes for which expression
coherence scores are caluclated don't even have to share any
motif). It's just a measure of how clustered a set of genes is in expression space. Imagine
that you have an expression profile of N time points for M genes. Then each gene can be
thought of as a point in an N dimensional space, where the i-th dimension has the expression
level of the gene at the i-th time point. Given a set of genes you want to ask if they are
clustered tightly together or are they spread "all over the place". One way to do that could
be that you calculate the "center of mass" of the cloud of genes and then you either sum over
distances of each gene from it (and other variations may be to sum over squares of such
distances, take standard deviation around that mean etc). Yet, this measure has a clear
shortcoming - in case where the gene cluster is split, say to two, equally sized very tightly
clustered subsets, that are yet remote from each other, any deviation-from-mean score will be
low. But this is not quite what you want since there's still something very special about this
gene set - it's composed of two tight sub-clusters. The intuitive reason why this gene set is
'impressive' is that out of P=M*(M-1)*0.5 gene pairs in it p=(M/2)*((M/2)-1) pairs are close
(*to be defined below) to each other. So the ratio p/P is a good measure for how tight is the
cluster. That's our expression coherence score.
True, that score may be defined for all the genes that have a given motif/motif combination.
That does give a measure of the extent to which the motif may influence expression. It is more
so with synergistic combinations than with single motifs though. See picture for a more
intuitive explanation, and expression coherence
scores of individual
and pairs
of motifs
Q. What's the definition of a "short" distance between expression profiles of two genes
?
A. The threshold distance, D, for a given expression experiment is defined as follows: 100
genes are randomly sampled from the entire genome and the Euclidean distances between their
normalized expression profiles for all possible 100*99*0.5 gene pairs are calculated. D is
defined as the lowest 5th percentile of the distribution of these distances.
Q.Did you calculate the expression coherence score within a condition or cross conditions?
A. For each condition alone.
Q. A basic question. In figure 1B, what does the x-axis means? I was under the impression
that you treat discrete time points
how can you get continuous lines ?
A. That's right ! we analyze discrete time point series. We (and the entire community, I
guess) got used to plot those profiles as a continuous time profiles. All we do, say in
matlab, is to have a matrix of expression levels of genes by time points and then use 'plot'
to produce profiles such as in figure 1b. (a more accurate representation could have been
discrete bars).