Frequently, occasionally, rarely asked and made-up questions

(NOTE: in general, we try to stand by our commitment to "never asked questions")

Q. (a real question from a real reader...)
I have been recently reading the paper published from your lab in Nature Genetics on October 2001. On the cell cycle combinogram you show that the GMC containing both the motifs MCB and SFF' has a high expression coherence (~0.6). I have interpreted this result to mean that genes which have the MCB and SFF' motif in their promoter will preferentially be expressed during the G1 phase of the cell cycle. I assume that this indicates a cell cycle link between the MCB and SFF' motifs. However, I do not see a direct link between the MCB and SFF' motif on the global motif synergy map. I would like to know if I am interpreting the data incorrectly or if this is simply an oversight on the map.

A. thanks for the interest in our paper, and for your correct question. The reason there's no link between MCB and SFF' in the general graph is that the P-value on the hypothesis that this is a 'synergistic' pair didn't pass the required threshold. This is a very strict threshold that corrects for the test of multiple hypotheses and requires that the P-value of significant observations will be smaller than the reciprocal of the number of hypotheses (in this case, the number of motif pairs ~356^2). But you are right in still asking how comes this pairs looks really good in the combinogram and doesn't appear in the map. The reason for that is that there are relatively few genes that have that combination, and our P-Value calculation (detailed in the Methods) takes that into an account. If there were (even slightly) more genes in that column, the pair MCB-SFF' would have passed the threshold. Some statisticians now begin to argue (http://www.math.tau.ac.il/~ybenja/) that such correction is too restrictive and gives rise to many false negatives, but we were more concerned about false positives at the time. Incidentally, the implicitly proposed interaction between the TFs that respectively bind these two motifs, Mbp1 and Forkhead was in-fact observed experimentally (after the publication of our paper) in a large-scale proteome interaction survey (Nature. 2002 Jan 10;415(6868):180-3.), and another experimental support is cited in the paper. This suggests that our strict P-Value may have given rise to a false negative in this case.

Q. What's the rational behind your approach
A. Identification of potential cis-regulatory motifs from un-aligned upstream regions of potentially co-regulated genes has recently been attempted successfully by many researchers (Ref). In most paradigms genes are first clustered based on similarity of their expression profiles (Ref), or based on their belonging to metabolic pathways or common cellular processes (Ref). Various motif-finding algorithms are then applied to search for motifs shared among clustered genes. A striking observation in many of these experiments was that many specific motifs could be found that are shared among many of the genes in the set, many of which are know to be related to gene expression regulation, via transcription factor binding, of chromatic remodeling. Despite impressive success, it is still not clear what proportion of the new motifs discovered by these methods are really biologically functional, and there is a great need to develop computational tools that will help prioritize candidates prior to experimental verification. A straightforward way to assess the biological significance of new motifs would be to reverse the above procedure, namely to test whether the genes in which the motif occur have similar expression profiles and/or cellular role. In many cases the results indicate that sets of genes that contain a given motif are not significantly similar in expression nor in function, indicating that assignment of motifs to the genes in the genome usually results in a large amount of false-positives. This observation holds even for experimentally-established motifs, and is probably more sever in computationally-derived motif candidates. A highly simplified probability calculation, that assumes 7 conserved positions in a typical motif and kb of sequence in the upstream region of each of the yeast genes indicates that each motif should be found in ~300 genes, while in reality it is believed that transcription factors regulate a much smaller number of genes. In other words, in many cases the existence of a motif in genešs promoters is at most necessary but not sufficient condition for expression determination. A likely hypothesis is that many motifs are part of a larger cluster of motifs that together regulate gene expression. This is an appealing possibility since it may allow the integration of multiple signals into the activation and repression of genes activity (Ref). The notion of combinatorial regulation of gene expression is not new, and examples of regulatory networks that are controlled by several transcription factors are known in multiple species (Ref). Yet, at least in the yeast genome, relatively little is know about specific systems that manifest combinatorial regulation. We have thus launched a systematic whole-genome search for possible regulatory networks with multiple control sites. A Gibbs sampling strategy was used to find motif sets that likely control gene networks. We select for particular combinations that appear to control genes with shared cellular function and expression patterns. We demonstrate that the existence of such motifs may be characteristic signatures for such functions, and demonstrate how such combinatorial analyses of yeast promoters may help annotate cellular function of genes that can not be annotated based on sequence similarity in their coding regions.

Q. What's the intuition behind the expression coherence score ? could it be defined only for genes that shaere a give motif ?

A. Expression coherence score may be defined for any gene set for which you have expression profiles, regardless of any motif information (in fact the genes for which expression coherence scores are caluclated don't even have to share any motif). It's just a measure of how clustered a set of genes is in expression space. Imagine that you have an expression profile of N time points for M genes. Then each gene can be thought of as a point in an N dimensional space, where the i-th dimension has the expression level of the gene at the i-th time point. Given a set of genes you want to ask if they are clustered tightly together or are they spread "all over the place". One way to do that could be that you calculate the "center of mass" of the cloud of genes and then you either sum over distances of each gene from it (and other variations may be to sum over squares of such distances, take standard deviation around that mean etc). Yet, this measure has a clear shortcoming - in case where the gene cluster is split, say to two, equally sized very tightly clustered subsets, that are yet remote from each other, any deviation-from-mean score will be low. But this is not quite what you want since there's still something very special about this gene set - it's composed of two tight sub-clusters. The intuitive reason why this gene set is 'impressive' is that out of P=M*(M-1)*0.5 gene pairs in it p=(M/2)*((M/2)-1) pairs are close (*to be defined below) to each other. So the ratio p/P is a good measure for how tight is the cluster. That's our expression coherence score. True, that score may be defined for all the genes that have a given motif/motif combination. That does give a measure of the extent to which the motif may influence expression. It is more so with synergistic combinations than with single motifs though. See picture for a more intuitive explanation, and expression coherence scores of individual and pairs of motifs

Q. What's the definition of a "short" distance between expression profiles of two genes ?

A. The threshold distance, D, for a given expression experiment is defined as follows: 100 genes are randomly sampled from the entire genome and the Euclidean distances between their normalized expression profiles for all possible 100*99*0.5 gene pairs are calculated. D is defined as the lowest 5th percentile of the distribution of these distances.

Q.Did you calculate the expression coherence score within a condition or cross conditions?

A. For each condition alone.

Q. A basic question. In figure 1B, what does the x-axis means? I was under the impression that you treat discrete time points how can you get continuous lines ?

A. That's right ! we analyze discrete time point series. We (and the entire community, I guess) got used to plot those profiles as a continuous time profiles. All we do, say in matlab, is to have a matrix of expression levels of genes by time points and then use 'plot' to produce profiles such as in figure 1b. (a more accurate representation could have been discrete bars).