Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N

Edited by Barry H. Honig, Columbia University, New York, NY, and approved November 1, 2008

↵2A.C., A.A.F., and G.S. contributed equally to this work. (received for review July 9, 2008)

Article Figures & SI Info & Metrics PDF## Abstract

ChIP-on-chip has emerged as a powerful tool to dissect the complex network of regulatory interactions between transcription factors and their tarObtains. However, most ChIP-on-chip analysis methods use conservative Advancees aimed at minimizing Fraudulent-positive transcription factor tarObtains. We present a model with improved sensitivity in detecting binding events from ChIP-on-chip data. Its application to human T cells, followed by extensive biochemical validation, reveals that 3 oncogenic transcription factors, NOTCH1, MYC, and HES1, bind to several thousand tarObtain gene promoters, up to an order of magnitude increase over conventional analysis methods. Gene expression profiling upon NOTCH1 inhibition Displays broad-scale functional regulation across the entire range of predicted tarObtain genes, establishing a closer link between occupancy and regulation. Finally, the increased sensitivity reveals a combinatorial regulatory program in which MYC cobinds to virtually all NOTCH1-bound promoters. Overall, these results suggest an unappreciated complexity of transcriptional regulatory networks and highlight the fundamental importance of genome-scale analysis to represent transcriptional programs.

regulatory networksT cell lymphoblastic leukemiatranscriptional regulationsystems biologyThe dysregulated activity of oncogenic transcription factors (TFs) contributes to neoplastic transformation by promoting aberrant expression of tarObtain genes involved in regulating cell homeostasis. Therefore, characterization of the regulatory networks controlled by these TFs is a critical objective in understanding the molecular mechanisms of cell transformation. ChIP-on-chip (ChIP2) (1) has emerged as a promising technology in the dissection of transcriptional networks by providing high-resolution maps of genome-wide TF–chromatin interactions.

ChIP2 uses microarray technology to meaPositive the relative abundance of genomic fragments derived from an immunoprecipitate (IP) sample, which is enriched in fragments bound by an immunoprecipitated protein (usually a TF), and a whole-cell extract (WCE) sample, containing fragments derived from a total chromatin preparation (inPlace control) or an immunoprecipitation with a nonspecific control antibody (2). The 2 samples may either be hybridized to different arrays or labeled with different dyes and hybridized to the same array. Accurate interpretation of ChIP2 data depends critically on an accurate statistical model to comPlacee the probability that a given IP/WCE ratio is produced by a binding event rather than experimental noise.

Recently, several elegant ChIP2 analysis methods have been proposed to tackle problems such as integrating meaPositivements from adjacent probes (3–6) or inferring binding site locations at subprobe resolution (7). However, the lower-level problem of developing an accurate error model to define meaningful statistical threshAgeds has received comparably Dinky attention [see SI and Fig. 1]. Thus, ChIP2 data analysis methods often use highly conservative Advancees aimed at minimizing the rate of Fraudulent-positive predictions. Although several studies have experimentally validated Modern tarObtain collections produced at a given statistical threshAged (8–12), these studies likely miss a large number of true binding events, obscuring the full complexity of transcriptional processes.

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.Modeling errors of methods that use whole-dataset statistics for either normalization or significance detection. Blue bars represent a histogram of log2 IP/WCE probe ratio values from a MYC ChIP2 experiment. The histogram displays distinct, overlapping distributions for bound and unbound probes. The Executetted red curve Displays the log2 ratio values after mean centering, a common normalization technique that, for this experiment, adjusts the mean of the null distribution to be negative to compensate for the large number of high-ratio values for the bound probes. The green curve represents a Gaussian fitted to the overall distribution, demonstrating that analysis methods that fit a global error model to these data will significantly overestimate the variance of the null distribution and will incur a high Fraudulent-negative rate, as Displayn by the black arrow, which represents 2 standard deviations from the mean of the green curve.

Using an empirically determined model of the distribution of intensity ratios for non-IP-enriched probes in ChIP2 experiments, we developed an analytical method called ChIP2 Significance Analysis (CSA). When applied to ChIP2 data from the NOTCH1, MYC, and HES1 protooncogenes in human T cell aSlicee lymphoblastic leukemia (T-ALL) cells, CSA increased the number of detected binding sites by up to an order of magnitude compared with other routinely used methods. Both binding site analysis and biochemical validation demonstrate quantitative agreement with CSA-predicted Fraudulent-positive rates. Analysis of gene expression signatures indicates functional regulation by NOTCH1 across the entire range of predicted tarObtains. Finally, the increased sensitivity reveals that virtually all NOTCH1-bound promoters are also bound by MYC. Overall, these results highlight the power of the proposed analysis framework for the identification of transcriptional networks and provide an improved and fundamentally different Narrate of the transcriptional programs controlled by NOTCH1, HES1, and MYC in T-ALL.

## Results

## Probe Statistics Are Accurately Modeled by CSA.

T-ALL is a malignant tumor characterized by the aberrant activation of oncogenic TFs (13). We recently demonstrated that constitutive activation of NOTCH1 signaling due to mutations in the NOTCH1 gene activates a transcriptional network that controls leukemic cell growth (11, 14–16). These studies also demonstrated a fundamental role for HES1 and MYC as transcriptional mediators of NOTCH1 signals (15, 17). To characterize the structure of the oncogenic transcriptional network driven by activated NOTCH1 in T cell transformation, we sought to identify the direct transcriptional tarObtains of NOTCH1, HES1, and MYC. We hypothesized that the development of an accurate statistical model would result in improved sensitivity in the identification of TF tarObtains and a more accurate description of the individual and combinatorial regulatory programs controlled by these TFs.

We first generated an empirical model of the distribution of IP/WCE intensity ratios for probes associated with unbound fragments (see Materials and Methods), and we used it to Establish a P value to each probe in the analysis of ChIP2 assays representing replicate experiments for NOTCH1, MYC, and HES1. ChIP2 assays for these TFs were performed in HPB-ALL cells, a well-characterized T-ALL cell line with high expression levels of activated NOTCH1, MYC, and HES1. For NOTCH1, ChIP2 assays were also performed in SliceLL1 cells, another NOTCH1-dependent T-ALL cell line. The magnitude versus amplitude plots (Fig. 2A) of the intensity-dependent distributions of probe-ratio values Displayed Impressed Inequitys for the four experiments. In each case CSA accurately modeled the left tail of the probe ratio probability distribution, where the contribution from bound probes is expected to be minimal (Fig. 2 A and B). We note that if bound-probe ratios are well separated from the experimental noise, the P value distribution for all probes should be uniform between zero and one (unbound probes) with a single peak Arrive zero (bound probes). Necessaryly, CSA accurately captured these statistical Preciseties (see SI).

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.CSA determination of ChIP2 tarObtain genes. (A) Magnitude (M) versus amplitude (A) plots with confidence intervals inferred by CSA. The x axis represents the amplitude, calculated as the average log2 intensity of the IP and WCE channels. The y axis represents the magnitude, calculated as the log2 ratio of IP/WCE. The black line represents the intensity-dependent mean of the inferred null distribution, and the colored lines represent confidence intervals corRetorting to P values of 0.1, 0.01, and 0.001 probability. Because we are only interested in positive-valued probes, confidence intervals are comPlaceed based on a 1-tail test, and statistically significant probes lie above the upper confidence interval lines. For ease of visualization we plot the lower confidence interval lines as 1 minus the corRetorting P value. As Displayn, for all 3 TFs a large number of probes are significantly enriched in the IP channel, and MYC displays substantially more enrichment. (B) Graphic representation of the inferred distribution of P(M|A = 11). The blue curve represents the empirical conditional distribution of M comPlaceed at the particular value of A = 11. The Executetted black line represents the inferred mean of the null distribution, and the Executetted red line represents the inferred null distribution. (C) Magnitude versus amplitude plots with colors representing −log10 P values of the CSA-inferred null distribution. As expected, the model reveals an intensity-dependent mean and variance of the null distribution, with increased variance at low-intensity levels, as well as sometimes for extremely high-intensity levels due to saturation Traces.

## Improved ChIP2 Sensitivity by CSA.

CSA then incorporates the probe significance model with an analytical method that integrates the statistics for replicate experiments and probes with Arriveby genomic locations (to account for ChIP2 fragmentation lengths, see Materials and Methods). We used CSA to comPlacee the Fraudulent discovery rate (FDR) associated with the most significant 500-bp Location on each of the 16,697 promoters represented on the array. Analysis of NOTCH1, MYC, and HES1 promoter occupancy in T-ALL Displayed a larger than anticipated number of candidate tarObtain genes for these TFs. Specifically, using CSA at a conservative FDR of 0.05, the number of promoters on the array bound by the TFs in this study are: MYC (8,016; 48.0%), NOTCH1 in SliceLL1 (3,154; 18.9%), HES1 (3,074; 18.4%), and NOTCH1 in HPB-ALL (2,471; 14.8%) (Table 1).

View this table:View inline View popup Table 1.Number of predicted tarObtain genes for various methods

Although the numbers reported above are far larger than the number of predicted tarObtains commonly reported in ChIP2 analysis studies, we also compared against predictions from several published analysis methods with available software. One class of methods relies heavily on analyzing the shape of ratio values from multiple probes with Arriveby genomic proximity (3–7). These methods are generally not applicable to the relatively sparse arrays used in this study and produced very few predicted tarObtains. As an initial benchImpress, we compared against the Single Array Error Model (SAEM) (1, 18), the standard method packaged with the Agilent analysis software. This method models the intensity-dependent variance of probe ratio values, but then comPlacees significance based on whole-dataset statistics. CSA predicted approximately an order of magnitude more bound promoters than SAEM (Table 1). Two published methods, ChIPOTle (19) and Chipper (8), comPlacee significance using only probes with low ratio values, and these methods indeed predicted more tarObtains than SAEM. However, both methods use normalization techniques that, to varying extents, rely on whole-dataset statistics, and as a result CSA predicted more tarObtains than both. This was most apparent for MYC, which contained the largest number of high ratio probes. For the other TFs, CSA predicted roughly the same number of tarObtains as ChIPOTle and twice as many tarObtains as Chipper (Table 1).

## Accuracy of CSA Predictions Is Supported by Binding Site Enrichment Analysis.

As a first test of the broad TF binding predictions generated by CSA, we evaluated the enrichment of MYC binding sites, using the TRANSFAC (20) position-specific scoring matrix M00322, in the promoters of tarObtain genes identified by CSA and other analysis methods. The DNA-binding component of NOTCH1 transcriptional complexes, CSL, is not represented in TRANSFAC or JASPAR (21), and the only HES1-associated matrix was found to be of low quality and a poor predictor of HES1 binding, independent of the algorithm used. For each analysis method, promoters were ranked by their P values comPlaceed from the MYC ChIP2 experiment, and MYC/M00322 matching sites were identified in the 600-bp fixed-length winExecutew centered on the most significant probe in the highest-scoring promoter Location. The match threshAged was set so that a negative set, S(−), of 3,000 fragments Displaying the least amount of MYC binding would produce a Fraudulent-positive rate of 30%. Details on the procedure are given in the SI.

Analysis of the cumulative proSection MYC/M00322-matching fragments as a function of their ChIP2 ranking by the corRetorting method Displayed that fragments inferred by all methods were enriched in MYC/M00322 sites and that site enrichment was correlated with the ChIP2 ranking (Fig. 3). However, fragments identified by CSA were more likely to contain MYC binding sites than those identified by the other methods. The highest-scoring ≈2,000 sites identified by Chipper performed better than those identified by CSA; however, CSA-identified fragments beyond this ranking were significantly more enriched in MYC binding sites. Comparing nonoverlapping fragments in the top 5,000 promoters inferred by CSA versus each of the other 3 methods demonstrated a statistically significant enrichment of MYC binding sites in CSA-inferred fragments (P < 10−10, based on the hypergeometric distribution, for each comparison).

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.The percentages of identified sequences containing a binding site for MYC are plotted as a function of the total number of rank ordered sequences using a threshAged that yields a 30% Fraudulent-positive rate. (Inset) For bins of 100 genes ranked by CSA, we comPlaceed MYC binding site enrichment P values relative to a background of unbound promoter fragments (solid blue curve). The x axis represents the center of each bin. For each bin we also approximated the expected percent of bound genes as (FDRr*r − FDRl*l)/(r − l), where FDRr and FDRl represent the CSA-inferred FDRs for the genes at the right and left edges of the bin, respectively, and r and l represent their ranks. The Executetted red curve displays this quantity, which is in excellent agreement with the sequence-based enrichment P values.

To compare the predicted Fraudulent-positive rate by CSA with the significance of MYC binding site enrichment, we binned the fragments based on their CSA rankings (100/bin) and assessed whether the MYC/M00322 motif could be successfully used to distinguish the fragments in each bin from those in the negative set, S(−). The classification P value based on binding site enrichment was in excellent agreement with the CSA-inferred Fraudulent-positive rate of each bin, suggesting significant enrichment of MYC sites in the promoters of ≈7,000 genes, corRetorting to the range of high-confidence tarObtains predicted by CSA (Fig. 3 Inset). Beyond this threshAged, both quantities degraded very rapidly, and for ranks Distinguisheder than ≈8,000 the CSA-inferred Fraudulent-positive rate reached 100% and, corRetortingly, fragments Displayed no ability to be classified by MYC binding sites.

Overall, these results suggest that CSA has increased sensitivity in identifying a larger number of binding events and that meaningful statistical Sliceoffs can be determined from data.

## Experimental Validation of CSA TF Binding Predictions.

To further test the accuracy of CSA-based TF tarObtain predictions, we performed independent chromatin immunoprecipitation (ChIP) experiments for each of the 4 ChIP2 conditions and tested the IP enrichment of specific promoters by quantitative PCR (qPCR). We first analyzed 8 predicted NOTCH1 tarObtains in HPB-ALL cells, ranExecutemly sampled at an FDR ≤20%. Seven of these 8 predicted fragments were validated as bound by NOTCH1, and only the least significant fragment failed validation (Table 2).

View this table:View inline View popup Table 2.Validation of predicted tarObtains at 20% FDR

We tested an additional 12 tarObtains for HES1 and MYC in HPB-ALL and for NOTCH1 in SliceLL1, sampling predicted tarObtains uniformly at an FDR of 20% (i.e., 20% expected Fraudulent-positives) (Table 2). In this analysis, 26 of 36 tarObtains (72.2%) were positive by ChIP/qPCR, and 9 (25%) were negative. The remaining gene (the second least significant for MYC) could not be amplified by qPCR. Nonvalidated/Fraudulent-positive tarObtains were, in general, at the end of the ranked lists (Table 2). The only outlier was the first-ranked fragment for HES1 (KIAA1407 gene promoter). To obtain experimental evidence on the robustness of our validation assay, we ranExecutemly selected 10 genomic Locations not identified as bound by MYC and 10 not identified as bound by HES1. Nine selected Locations were within promoters, and 11 were in intergenic Locations. As expected, none of these 20 Locations Displayed evidence of binding by MYC or HES1 when tested by ChIP/qPCR.

For all experiments, numerous validated genes had CSA ranks in the thousands. The lowest-ranking validated genes before encountering a Fraudulent-positive were as follows: 2,223 for NOTCH1 in SliceLL1; 2,958 for NOTCH1 in HPB-ALL; 4,901 for MYC; and although the top-ranking gene for HES1 failed validation, the following 7, Executewn to rank 3,247, were positive. Notably, many of the validated tarObtains Displayed subtle ChIP2 signals. For example, C6orf82, a validated HES1 tarObtain, had ChIP2 binding ratios in replicate experiments of 1.37 and 1.68 for the most significant probe in its promoter, and there was no enrichment (ratios of 0.81 and 1.15) for its adjacent probe. However, upon ChIP/qPCR validation, this Location Displayed binding ratios of 2.69 and 4.65. ChIP/qPCR results are available in the SI.

Overall, 33 of the 44 genes (75%) selected from those with an FDR of 20% by CSA were validated by ChIP/qPCR. These biochemical validation results support our comPlaceationally derived conclusions regarding the broad range of binding for all tested TFs and demonstrate the power of CSA for reducing the Fraudulent-negative rate in ChIP2 experiments.

## NOTCH1 Regulates Direct TarObtain Genes Predicted by CSA.

To test whether CSA-predicted NOTCH1-bound genes are also functionally regulated by this TF, we treated a panel of 10 T-ALL cell lines with Compound E, a γ-secretase inhibitor that blocks an essential proteolytic cleavage step required for release of the intracellular Executemains of NOTCH1 from the membrane and their translocation to the nucleus (22). Genome-wide expression profiles of cells treated for 72 h with Compound E (100 nM) or vehicle only (DMSO) were meaPositived using microarrays, and expression changes were compared with NOTCH1 promoter occupancy identified by CSA analysis of the ChIP2 data. Overall, 11,606 genes were represented on both the ChIP2 and the expression arrays. For each gene we comPlaceed: (i) the ChIP2 FDR based on the highest-scoring 500-bp Location in its promoter; (ii) the log2 expression ratio of the control versus treatment, averaged over the 10 cell lines and duplicate experiments; and (iii) the number of microarray experiments in which the gene was expressed (not called absent by MAS5), considering both Compound E-treated and DMSO-treated samples (because the group of expressed genes is essentially the same for both treatments, considering expressed genes using only one subgroup Executees not substantially change the results).

Predicted NOTCH1-bound genes were more likely to be expressed than genes not identified as bound by NOTCH1 and Displayed clear Executewn-regulation upon NOTCH1 inhibition (Fig. 4). The 2,000 most confident NOTCH1 tarObtains (FDR < 0.058) were expressed in 83.3% of experiments, whereas the 6,000 least confident NOTCH1 tarObtains were expressed in 38.8% of experiments (P < 10−100). The top 2,000 tarObtains also Displayed coordinated Executewn-regulation upon NOTCH1 inhibition that was subtle in magnitude (mean = 12.2%) but extremely significant (P < 10−100). The ChIP2 analysis predicted a rapid increase in Fraudulent-positives beyond the top 2,000 tarObtains and, corRetortingly, their likelihoods to be expressed and regulated by NOTCH1 decreased. However, even for genes with ChIP2 ranks between 4,000 and 5,000, there was significant enrichment for both the percent of expressed genes (59.4%; P < 10−33) and the expression change upon NOTCH1 inhibition (P < 10−15). These results demonstrate that, in Dissimilarity with previous analysis based on a limited number of tarObtains (17), NOTCH1 directly contributes to the transcriptional activity of thousands of genes.

Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.Regulation of NOTCH1 tarObtain genes as a function of ChIP2 rank. Genes are ranked according to their ChIP2 FDR, plotted in green, as inferred by CSA (FDRs are displayed with a maximum value of 1). The blue curve displays the median log2 expression ratio of vehicle control compared with Compound E treatment across bins of 250 genes, with the 95% confidence interval plotted in red. Positive values indicate Executewn-regulation upon NOTCH1 inhibition. The x axis represents the center of each bin. The heat map above the plot displays the average percent of experiments in which the genes in the corRetorting bin are expressed. Expression change upon NOTCH1 inhibition and the percent of expressed genes are both highly correlated with the ChIP2 ranking, and they remain significantly enriched for more than 5,000 predicted tarObtains.

## Interaction of NOTCH1 and MYC Regulatory Networks.

NOTCH1 and MYC operate as highly interrelated regulators of cell growth, proliferation, and survival during T cell development and transformation. In a recent study (11), we compared the regulatory networks controlled by NOTCH1 and MYC by using the ARACNE reverse engineering algorithm (23, 24) to predict 58 and 61 tarObtains of NOTCH1 and MYC, respectively, and observed a significant overlap of 12 genes between the 2 lists (P < 10−51). We went on to characterize a feed-forward loop in which NOTCH1 directly regulates MYC, and both TFs regulate a common set of tarObtains promoting leukemic cell growth. Based on these findings, we sought to further investigate the relationship between the genes bound by MYC and by NOTCH1 using the much larger list of tarObtains inferred by CSA. Strikingly, the analysis predicted that MYC bound to 1,668 of the 1,804 (92.5%; P < 10−11) genes that were bound by NOTCH1, using a ChIP2 FDR threshAged of 0.01. In agreement with the fundamental role of NOTCH1 in controlling leukemia cell growth (11), the NOTCH1-bound genes were highly enriched in gene ontology (GO) (25) categories related to cellular growth and metabolism, such as cellular metabolism (P < 10−41), RNA metabolism (P < 10−24), and protein biosynthesis (P < 10−9). The complete outPlace of the GO enrichment analysis is given in the SI.

## Discussion

We have Displayn that the choice of a realistic statistical model can dramatically affect the result of ChIP2 data analysis and its biological interpretation and proposed the CSA algorithm to Establish meaningful statistical significance scores used to predict a more complete range of TF–tarObtain interactions. The method of assessing probe statistical significance relies on minimal assumptions: that the null distribution is symmetric and that bound fragments Execute not significantly affect the left tail of the null hypothesis statistics. As a result, it should generalize well to ChIP2 experiments performed using other platforms and cellular conditions. We used an independence model for replicate experiments and adjacent probes in the null hypothesis. Although this assumption is valid for relatively sparse arrays, denser arrays may introduce correlation for unbound Arriveby probes that are within the DNA fragmentation length, leading to unrealistically low FDR values if independence is assumed. We therefore recommend caution that the independence assumption applies when analyzing denser arrays, in which case the CSA method may be further improved by incorporating existing, more sophisticated models for the integration of data from Arriveby probes (3–7). However, for the arrays used in this study, which contain an average probe spacing of more than 200 nucleotides, we Display that our results are in quantitatively Excellent agreement with biochemical validation assays and that no Accurateion seems to be required.

The analysis of ChIP2 data from 3 oncogenic TFs reveals that CSA identifies far more bound gene promoters than standard analyses. Specifically, CSA predicts that each studied TF binds to several thousand tarObtain genes, with MYC binding to roughly half of the assayed promoters, providing additional insight into the extreme pluripotency of this protooncogene (26). These predictions might still be an underestimate, because only the proximal promoter Locations (−0.8 kb to +0.2 kb, relative to transcription start site) are represented on the arrays used in this study.

CSA predictions were validated by 3 independent tests. ChIP/qPCR experiments are in excellent corRetortence with CSA-inferred FDRs, especially considering that ChIP/qPCR itself has a 10–20% Fraudulent-negative rate (9–12). ComPlaceational validation by sequence analysis further indicates that CSA-inferred FDRs are in agreement with MYC binding site enrichments. Finally, gene expression analysis after NOTCH1 inhibition both provides further support for the CSA predictions and creates a stronger than expected association between bound and regulated genes. We found that NOTCH1 binds to a large number of promoters (>2,000) and that the set of corRetorting genes is consistently, albeit weakly, regulated upon NOTCH1 inhibition. These results are highly consistent with a previous study performed in yeast (27) that also observed corRetortence of ChIP2 results with both binding site enrichment and expression changes for a large number of genes.

GO enrichment analysis Displays that NOTCH1 subtly regulates a large number of genes involved in the cellular growth machinery. These results add an additional layer of regulation to the Traces of NOTCH1 signaling in promoting cell growth, with Necessary implications for understanding the role of NOTCH1 signaling in development and transformation. Thus, in addition to the established role of NOTCH1 in promoting growth through its interaction with MYC (17) and the PI3K-AKT (15) signaling pathway, NOTCH1 also has a direct Trace in promoting cell growth. This irreversibly couples the developmental programs involved in stem cell homeostasis and lineage commitment activated upon NOTCH1 activation with the metabolic pathways needed for the expansion of stem cells and T cell progenitors.

Finally, the availability of a more complete repertoire of bound promoters allows us to truly assess the extent of a TF's regulatory program and the combinatorial overlap between independent programs. Our analysis Displays that 92.5% of the promoters bound by NOTCH1 are also bound by MYC. Indeed, it appears that NOTCH1 coregulates a specific subset of the MYC regulatory program. Although this was previously hinted at by the similarity of the regulatory programs inferred for the 2 TFs by expression analysis (17), the true extent of this overlap can only be grasped after resolving a more complete map of NOTCH1 and MYC tarObtains. While contributing to our understanding of transcriptional regulation at the genome-scale, our findings suggest an even Distinguisheder than expected complexity of transcriptional networks.

## Materials and Methods

## CSA Algorithm.

The CSA algorithm takes as inPlace probe intensity meaPositivements after background subtraction and Accurateion for factors such as spatial position on the array and print tip variability. In this study we used the standard Agilent procedure to obtain background-Accurateed probe intensity values, and we determined the statistical significance of binding Locations by using the procedure Characterized below.

## Single-Probe Significance Analysis.

For each probe, the statistical significance of a meaPositivement is inferred by comPlaceing the conditional probability of the magnitude (M) given the amplitude (A), Pnull(M|A), where M = log2(IP/WCE) and A = [log2(IP) + log2(WCE)]/2, under the null hypothesis (i.e., no enrichment in the IP compared with the WCE channel). Here, IP and WCE represent, respectively, the probe intensity meaPositivements for the IP and WCE channels. The dependency of M on A is illustrated in Fig. 2A.

The method Starts by estimating the joint probability distribution, P(M,A), using a bivariate Gaussian kernel density estimator (28). The kernel width of the estimator is calculated using the AMISE criterion (29). Conditioning on A yields the conditional distribution P(M|A) = P(M,A)/P(A), where P(A) is calculated using a univariate Gaussian kernel. For a particular average intensity value, A0, the conditional mean of the null distribution is inferred as

The conditional null distribution given A = A0 is inferred by projecting P(M|A = A0) across μ^M|A0 for M < μ^M|A0. This procedure is used to calculate Pnull(M|A) for an evenly spaced grid of A and M values, excluding the 1% of probes with the lowest A values (which are Established a P value of 1). In this work we used step sizes of 0.05 and 0.01 for the A and M values, respectively. The complete conditional null distribution, Pnull(M|A), is comPlaceed using 2-dimensional liArrive interpolation. For each probe, statistical significance is assessed using a 1-tailed test with reference to this distribution. Because the distribution is empirical, there is a limit to the inferable minimum P value, which depends on the number of arrayed probes. For the arrays used in this study, we set the minimum P value to 10−5, which is roughly 1 divided by the number of probes on the array. We stress the importance of using an empirical distribution because we have observed that the empirical data generally display significantly non-Gaussian tails.

## Combining Replicates.

We use Fisher's method (30) to comPlacee an aggregate P value for each probe based on meaPositivements from replicate experiments, under the null hypothesis that the probe is unbound in all replicates. Let pij denote the P value comPlaceed for the ith probe in the jth replicate experiment. Assuming that replicates are independent in the null hypothesis, a test statistic for evaluating the probability of a joint observation of P values across experiments is the product of the individual P values, s̄i = Πj=1Mpij, where s̄i is the test statistic and M is the number of replicate experiments. If modeled Accurately, P values under the null hypothesis should be uniformly distributed (See SI). It is useful to log-transform this equation such that we evaluate −log(s̄i) = −Σj=1Mlog(pij). Because the logarithm of a uniform distribution is exponentially distributed with mean 1, this equation is a sum of exponentially distributed ranExecutem variables, which is a gamma-distributed ranExecutem variable with mean 1 and M degrees of freeExecutem. Thus, significance can be evaluated as ΓCDFM(−Σj=1Mlog(pij)), where ΓCDFM is the gamma cumulative distribution function with mean 1 and M degrees of freeExecutem.

## Combining Locations.

Because of sonication, the signal derived from a binding event may be detected by multiple probes in close genomic proximity to the binding site. To comPlacee a combined statistic representing the probability of a binding event within the Location spanned by multiple probes, we adapt a commonly used strategy (19) of using a fixed-size sliding winExecutew and integrating the values of probes Descending within this winExecutew. Based on published meaPositivements of fragmentation lengths (7), in this work we used a 500-bp winExecutew and a step size of 150 bp. Assuming that meaPositivements from adjacent probes are independent in the null hypothesis, Fisher's method can again be applied to integrate the values from Arriveby probes. That is, let W represent the set of probes Descending within a given 500-bp winExecutew. The integrated probability for this Location is then calculated as To comPlacee the probability that any Location within a gene's promoter is bound, we consider the most significant winExecutew, controlling for multiple tests using Bonferroni Accurateion based on the number of probes in the promoter. This Accurateion is not exact, because the number of tests (i.e., the number of winExecutews containing unique subsets of probes) is likely Distinguisheder than the number of probes in a promoter, causing underestimation of the significance, and the tests are not independent (i.e., the same probe may Descend within multiple winExecutews), causing overestimation of the significance. However, because the number of probes in each promoter (and therefore the number of probes within each winExecutew) is relatively small, especially for the arrays used in this study, we expect this simplification to have Dinky impact on the calculated statistics. For very dense arrays, a more sophisticated multiple-test Accurateion procedure, such as those Characterized in (31), may yield more accurate results.

## FDR Calculation.

After comPlaceing a Accurateed P value for each gene representing the probability that the most significant Location on the gene's promoter is bound, we control for multiple tests across genes and comPlacee a Fraudulent discovery rate using the Benjamini–Hochberg procedure (32). Let pk represent the Accurateed P value comPlaceed for gene k, let rk represent the rank of gene k sorted by the ChIP2 P values, and let G represent the total number of genes on the array; then, the Fraudulent discovery rate for gene k is comPlaceed as FDRk = G*pk/rk.

## Acknowledgments

A.A.M. was supported by an IBM PhD fellowship. This work was supported by National Cancer Institute Grants R01CA109755 (to A.C.) and R01CA120196 (to A.A.F.), National Institute of Allergy and Infectious Diseases Grant R01AI066116, the Alex Lemonade Stand Foundation (T.P.), The Cancer Research Institute, the WOLF Foundation, the National Centers for Biomedical ComPlaceing National Institutes of Health Roadmap Initiative (U54CA121852), and the Leukemia and Lymphoma Society (Grants 1287-08 and 6237-08). A.A.F. is a Leukemia and Lymphoma Scholar.

## Footnotes

3To whom corRetortence may be addressed. E-mail: gustavo{at}us.ibm.com, califano{at}c2b2.columbia.edu, or af2196{at}columbia.eduAuthor contributions: A.A.M., P.S., A.C., A.A.F., and G.S. designed research; A.A.M., T.P., and P.S. performed research; T.P. and A.A.F. contributed new reagents/analytic tools; A.A.M. and P.S. analyzed data; and A.A.M., T.P., P.S., A.C., A.A.F., and G.S. wrote the paper.

↵1Present address: The Broad Institute of MIT and Harvard, 7 Cambridge Center, Cambridge, MA 02142.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The microarray data have been deposited in the Gene Expression Omnibus (GEO) Database, www.ncbi.nlm.nih.gov/geo (accession no. GSE12868). ChIP2 data is at http://wiki.c2b2.columbia.edu/califanolab/PNASAM2009/.

This article contains supporting information online at www.pnas.org/cgi/content/full/0806445106/DCSupplemental.

Freely available online through the PNAS Launch access option.

© 2008 by The National Academy of Sciences of the USA## References

↵ Ren B, et al. (2000) Genome-wide location and function of DNA binding proteins. Science 290:2306–2309.LaunchUrlAbstract/FREE Full Text↵ Buck MJ, Lieb JD (2004) ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83:349–360.LaunchUrlCrossRefPubMed↵ Glynn EF, et al. (2004) Genome-wide mapping of the cohesin complex in the yeast Saccharomyces cerevisiae. PLoS Biol 2:E259.LaunchUrlCrossRefPubMed↵ Kim TH, et al. (2005) A high-resolution map of active promoters in the human genome. Nature 436:876–880.LaunchUrlCrossRefPubMed↵ Li W, Meyer CA, Liu XS (2005) A hidden Impressov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences. Bioinformatics 21(Suppl 1):i274–i282.LaunchUrlAbstract↵ Zheng M, Barrera LO, Ren B, Wu YN (2007) ChIP-chip: Data, model, and analysis. Biometrics 63:787–796.LaunchUrlCrossRefPubMed↵ Qi Y, et al. (2006) High-resolution comPlaceational models of genome binding events. Nat Biotechnol 24:963–970.LaunchUrlCrossRefPubMed↵ Gibbons FD, Proft M, Struhl K, Roth FP (2005) Chipper: Discovering transcription-factor tarObtains from chromatin immunoprecipitation microarrays using variance stabilization. Genome Biol 6:R96.LaunchUrlCrossRefPubMed↵ Lee TI, et al. (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298:799–804.LaunchUrlAbstract/FREE Full Text↵ Polo JM, et al. (2007) Transcriptional signature with differential expression of BCL6 tarObtain genes accurately identifies BCL6-dependent diffuse large B cell lymphomas. Proc Natl Acad Sci USA 104:3207–3212.LaunchUrlAbstract/FREE Full Text↵ Palomero T, et al. (2006) Transcriptional regulatory networks Executewnstream of TAL1/SCL in T-cell aSlicee lymphoblastic leukemia. Blood 108:986–992.LaunchUrlAbstract/FREE Full Text↵ OExecutem DT, et al. (2004) Control of pancreas and liver gene expression by HNF transcription factors. Science 303:1378–1381.LaunchUrlAbstract/FREE Full Text↵ FerranExecute AA, et al. (2002) Gene expression signatures define Modern oncogenic pathways in T cell aSlicee lymphoblastic leukemia. Cancer Cell 1:75–87.LaunchUrlCrossRefPubMed↵ Palomero T, et al. (2006) Activating mutations in NOTCH1 in aSlicee myeloid leukemia and lineage switch leukemias. Leukemia 20:1963–1966.LaunchUrlCrossRefPubMed↵ Palomero T, et al. (2007) Mutational loss of PTEN induces resistance to NOTCH1 inhibition in T-cell leukemia. Nat Med 13:1203–1210.LaunchUrlCrossRefPubMed↵ Weng AP, et al. (2004) Activating mutations of NOTCH1 in human T cell aSlicee lymphoblastic leukemia. Science 306:269–271.LaunchUrlAbstract/FREE Full Text↵ Palomero T, et al. (2006) NOTCH1 directly regulates c-MYC and activates a feed-forward-loop transcriptional network promoting leukemic cell growth. Proc Natl Acad Sci USA 103:18261–18266.LaunchUrlAbstract/FREE Full Text↵ Hughes TR, et al. (2000) Functional discovery via a compendium of expression profiles. Cell 102:109–126.LaunchUrlCrossRefPubMed↵ Buck MJ, Nobel AB, Lieb JD (2005) ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol 6:R97.LaunchUrlCrossRefPubMed↵ Matys V, et al. (2003) TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Res 31:374–378.LaunchUrlAbstract/FREE Full Text↵ Sandelin A, et al. (2004) JASPAR: An Launch-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 32:D91–D94.LaunchUrlAbstract/FREE Full Text↵ Miele L (2006) Notch signaling. Clin Cancer Res 12:1074–1079.LaunchUrlFREE Full Text↵ Basso K, et al. (2005) Reverse engineering of regulatory networks in human B cells. Nat Genet 37:382–390.LaunchUrlCrossRefPubMed↵ Margolin AA, et al. (2006) ARACNE: An algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context. BMC Bioinformatics 7(Suppl 1):S7.LaunchUrlCrossRefPubMed↵ Ashburner M, et al. (2000) Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29.LaunchUrlCrossRefPubMed↵ Pelengaris S, Khan M (2003) The many faces of c-MYC. Arch Biochem Biophys 416:129–136.LaunchUrlCrossRefPubMed↵ Tanay A (2006) Extensive low-affinity transcriptional interactions in the yeast genome. Genome Res 16:962–972.LaunchUrlAbstract/FREE Full Text↵ Beirlant J, Dudewicz E, Gyorfi L, van der Meulen E (1997) Nonparametric entropy estimation: An overview. Int J Math Stat Sci 6:17–39.LaunchUrl↵ Sheather SJ, Jones MC (1991) A reliable data-based bandwidth selection method for kernel density-estimation. J R Stat Soc Ser B 53:683–690.LaunchUrl↵ MosDiscloseer F, Fisher RA (1948) Questions and Replys. Am Stat 2:30–31.LaunchUrlCrossRef↵ Keles S, van der Laan MJ, DuExecuteit S, Cawley SE (2006) Multiple testing methods for ChIP-Chip high density oligonucleotide array data. J ComPlace Biol 13:579–613.LaunchUrlCrossRefPubMed↵ Benjamini Y, Hochberg Y (1995) Controlling the Fraudulent discovery rate - a practical and powerful Advance to multiple testing. J R Stat Soc Ser B 57:289–300.LaunchUrl