Supplementary web-site to BMC article

Identiﬁcation of couples and triples of co-occurring

transcription factor binding sites sequences from relative

distance and GO filter

This site supports:

Igor Pesando¹

1 Dept. of Theoretical Physics, University of Turin, Italy and INFN

Via P. Giuria 1 - I-10125 Torino, Italy
ipesando@to.infn.it

Abstract

Background: Transcription regulation in eukariotes is known to occur through the coordinated action of multiples transcription factors (TFs). The problem is then both to find the TF binding sites (TFBSs) and how TFs interact. The interaction can be either homotypic, involving binding of only one TF to multiple sites in an ORF upstream, or heterotypic, involving of more than one TF. Despite many cases are heterotypic most of the methods developed are targeting the homotypic case.

Results: We propose the following approach to cope with the previous problems. We have developed a distance distribution between TFBSs which takes also care of the multiple occurrences of the same TFBS in one single ORF upstream. We have then verified the accuracy of this distance distribution on randomized data. Since this distribution is entropyc in nature it works well when either we consider a word, i.e. a sequence of basis, and its reverse complement to be or not to be the same BS. Using this distance distribution we have selected non overlapping words couples of given length as well as non overlapping words triples (of more limited length, say 5/7 bp per word because of the computation time) with a FDR of about or less 0.01 with respect to the null model made of randomized upstreams. To these couples and triples we could associate sets of ORFs chosen accordingly to where the given couples/triples appear. These sets have been filtered by means of their GO annotation using an algorithm by Cora` et al. and only those with FDR less than 0.01 were kept. The FDR has been established by randomly generating samples where the fake sets have the same numbers of ORFs as the real data. In this way we could associate some putative GO terms to couples/triples. We will now explain in more details the previous steps. The selection operated using the distance distribution cannot be done considering directly the probability of a configuration of words since it depends on how many "interactions" there are between these words, i.e. on how many ORF upstreams they can be found at the same time, on how many occurrences the words have on each ORF upstream and how distant they are. We have then decided to consider as biological significant the number of interactions and therefore we have numerically determined which value of the probability corresponds to a FDR of about 0.01; this value has been chosen to have a quite reliable value for the probability cutoff. Afterwards we have verified that restricting the couples/triples to those with lesser distance probability the corresponding FDR for the ORFs sets passing the GO filter improves. We have also verified that better FDR are obtained when restricting our attention to more specific GO terms. We have then chosen the couples/triples so that the FDR for ORFs set was less than 0.01.

Conclusions: We have applied the previous approach to the yeast Saccharomyces cerevisiae. Using method describe before based on the distance distribution we have selected non overlapping words couples of length 6 and 7 bp as well non overlapping words triples of length of 5,6 and 7 bp with a FDR of about or less 0.01 from S, cerevisiae 500bp ORF upstreams. In the case of triples of words of 7 bp the space of possibilities is more than 5*10**10. The time required for the complete analysis of couples is of the order of some hours while it is of days for the 6 bp words triples. To assess our method we have compared the putative couples and triples with Lee`s ChIP-CHIP data and the cooperativity analysis already performed in literature. In the case of triples of 7bp length words the triples passing the distance filter with a FDR of 0.01 are 363088. To obtain a FDR of 0.01 in the GO filter we selected the 290049 best triples corresponding to the 80% of triples chosen according to the best distance probability. Of these triples 599 could be assigned to at least one GO term whose size was less than the 10% of ORFs and with a Pvalue of bigger than 10**(-6). Most of these triples are associated with rRNA processing, nucleolus and similar terms. Accordingly the only 22 out of 599 triples are associated with a Bonferroni corrected Pvalue less than 0.05 with TFs involved in the cell cycle using ChIP-CHIP data. By far more interesting results are obtained considering triples with 5bp words and couples with 6bp words. In the latter case starting with 4766 couples passing the distance filter we find 294 couples after the GO filter out of which 85 are associated to at least one TF using ChIP-CHIP data. The strongest signals of our analysis are associated with the TFs RAP1-FHL1, the cell cycle TFs MBP1-SWI6-SWI4 and HIR1-HIR2. Nevertheless we have associated 30 out of 107 TFs using the Bonferroni correction and about 60 using the Benjamini-Yekutieli one (which is stricter than the Benjamini-Hochberg).

Supplementary files

Additional file 1

500bp upstreams of Saccharomyces cerevisiae ORFs

After elimination of duplicated upstreams in order to avoid spurious statistical signales we ended up with 6037 genes. For each gene we considered as upstream region a sequence of at most 500 bp upstream of the translation starting site. If other genes were present in this region we truncated the sequence to avoid overlaps.

Additional file 2

Tar-ball of the comparisons.

We checked the proposed distance distribution both against a set made of 2000 200bp and 4000 500bp pure random upstreams and against a set obtained by randomizing the true upstreams.