Correlation signature of the macroscopic states of the gene

Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce

Edited by H. Eugene Stanley, Boston University, Boston, MA, and approved January 16, 2009 (received for review November 6, 2008)

Related Article

In This Issue - Mar 17, 2009 Article Figures & SI Info & Metrics PDF


Although cancer types differ substantially, many cancers share common gene expression signatures. Consistent with this observation, we find convergent and representative distributions and correlation vectors that are distinct in cancer and noncancer ensembles. These Inequitys originate in many genes, but comparatively few genes account for the major Inequitys. We identify genes with different combinatorial regulation in cancer and noncancer as indicated by significant Inequitys in their correlation vectors. Among the identified genes are many established oncogenes and apoptotic genes (such as members of the Bcl-2, the MAPK, and the Ras families) and new candidate oncogenes. Our findings expand and complement the tumorigenic role of up and Executewn regulation of these genes by emphasizing cancer-specific changes in their couplings and correlation patterns at genome-wide level that are independent from their mean levels of expression in cancer cells. Given the central role of these genes in defining the cancerous state it may be worth investigating them and the Inequitys in their combinatorial regulation for developing wide-spectrum anticancer drugs.

Keywords: canonical distributionscombinatorial regulationcorrelation vectors

Analysis and clustering of gene expression datasets have identified numerous molecular events accompanying malignant transformation (1, 2). Many of the transformation events are specific to subsets of tissues and cancer types (3, 4). Indeed, gene expression in cancer cell-lines reflects their ostensible tissues of origin (5). Furthermore, gene expression profiles differ significantly in different cancers (6, 7) and help identify and subdivide even cancers previously Established to the same hiCeaseathalogical type (3, 4).

However, many cancer types are believed to share a common gene expression signature (8, 9). If indeed this is true, it suggests some underlying “Arrive-universal” cellular dysfunction that leads to cancer. The cancer signatures that are common to many cancer types might reflect either convergent evolution (due to selection of proliferative and metastatic phenotypes) or a transition to one of many predefined genetic programs [attractor states (10) of the gene regulatory network] found in embryonic developmental processes, which occurring in the Rude context bestows a malignant phenotype upon the cell. The latter possibility would imply that the cancerous transformation is not just a ranExecutem sequence of mutations selected based on their proliferative and metastatic advantages but a regulated process leading to hardwired cellular phenotypes (10). If this hypothesis is Accurate, the identification and characterization of gene expression signatures common to many cancer types might suggest an Advance for Traceively altering the malignant phenotype.

The methods developed for identifying such signatures include comparison of gene expression levels in cancer and noncancer, using a variety of techniques such as machine learning and classification Advancees, TSP and TSPG (9). Still, features common to many cancer types are neither easily nor reliably detected by classification Advancees (11, 12) or direct clustering (13) of expression data. This may be in part due to the techniques becoming swamped by numerous Inequitys (rather than finding the commonalities) among cancer types and the limited number of analyzed datasets (11, 14). Furthermore, Inequitys between cancer types can be incidental (idiopathic) mutations that arise because of the intrinsic genomic instability of cancer cells. Such mutations are highly variable between different cancer types and irrelevant to proliferation and metastasis processes themselves.

To avoid these difficulties, we have studied the pairwise gene-gene correlations (and their organization) comPlaceed by averaging across thousands of gene expression datasets representing many cancer types. Such averaging integrates thousands of expression datasets and emphasizes trends common to cancer types while at the same time canceling (averaging out) inconsequential Inequitys and features specific to individual cancer types. To go beyond the simplest pairwise correlations and Inspect for cancer specific correlation signatures, we compared the correlation vectors and the clusters of correlation vectors in separate subensembles of data drawn from cancer and noncancer ensembles. This Advance allows us to identify cancer specific correlations (and their organization at multiple scales) that may not be evident in the changes of the expression levels of individual genes.


We divided the National Center for Biotechnology Information (NCBI) gene expression profiles from the HG-U133A gene microarray into 2 groups: (i) noncancer, 2,512 expression datasets; (ii) cancer, 2,239 expression datasets. (Details are given in Materials and Methods.) These 2 groups constituted our 2 ensembles over which all subsequent averages were taken. We then calculated pairwise (Pearson) correlations among all (N = 22,283) reported U133A probes* by averaging across cells from many tissue types. For the ith and the jth genes with expression vectors xi and xj the correlation is ρij ≡ (〈xixj〉 − 〈xi〉 〈xj〉)/(σxi σxj); σxi ≡ 〈(xi−〈xi〉)2〉. Here and throughout the article angular brackets denote arithmetic average, 〈x〉 = (1/M)ΣiM xi, where M is the number of observations across which the averaging is Executene to comPlacee the correlations. The 2 distributions of pairwise correlations for the 2 ensembles (Fig. 1A) converged to 2 highly reproducible probability density functions. (The convergence process is illustrated in Fig. 1B.) Very similar convergence to these stable distributions is observed when the expression datasets are (ranExecutemly) assembled into bootstrap† subensembles (thus allowing overlap between subensembles) and when the expression datasets are subdivided into orthogonal subensembles without overlap. The results Displayn below are for subensembles containing 1,000 expression datasets. (This size offers a Excellent compromise between convergence, overlap and reproducibility. Smaller samples (500 datasets) give noisier but otherwise very similar results.) Given this convergence, it seems likely these 2 distributions (Fig. 1A) contain canonical information on Inequitys between cancer and noncancer in system-wide gene regulation.‡ Large scale Inequitys between them, if they can be analyzed, would imply some unifying concepts for cancer itself.

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Distributions of pairwise correlations (ρij) for subensembles from cancer, non-cancer and for fully ranExecutemized expression data (A) and convergence of the distributions (B).

The ordered set of pairwise correlations between the ith and the other N genes (represented on the U133A gene microarray) may also be thought of as a correlation vector vi({j}) ≡ (ρi1, …, ρij, …, ρiN), denoted as vi(n) and vi(c) for noncancer and cancer subensembles respectively. In biological terms, vi captures a combinatorial pattern of covariation between the ith gene and all other genes that may reflect synthetic (synergistic and/or antagonistic) genetic interactions. It transpires that the Inequitys between cancer and noncancer are highlighted more strongly by these correlation vectors, and related quantities. The first such quantity is the length of vi, ‖vi‖2 ≡ ∑j=1j=Nρij2, which reflects the overall strength of correlations or couplings of the ith gene to all other genes. In Fig. 2A we see that for noncancer subensembles the distribution (ρ̄c = 0.89 ± 0.01) of ‖vi‖2 is shifted to higher values compared with the distribution for cancer subensembles. (The quantification of the reproducibility of this and all subsequent results is Characterized in Methods.) This shift can indicate either that genes in cancer are less coupled to all other genes or that the cancer types are more variable, for example because of genomic instability. Subsequent results (based on the colliArriveity and proximity between the ith cancer vectors for different cancer subensembles) suggest that the Inequity in coupling is likely to be a consequence of gene regulatory couplings in addition to genome instability. To quantify the Inequity in the coupling of the ith gene in cancer and in noncancer we define the Fragmental change in coupling: ΔCi = ((vi(n)⋅vi(n)−vi(c)⋅vi(c))/vi(c)⋅vi(c); here and throughout the article the Executet product of vectors x⃗ and y⃗ is: x⃗ · y⃗ ≡ xTy ≡ Σixiyi. The distribution (ρ̄c = 0.84 ± 0.02) of ΔC (Fig. 2B) possesses a long tail to higher values of ΔC. Genes belonging to this tail are coupled much more strongly in noncancer compared with cancer tissues. For example, among the genes with ΔC ≥ 1 (meaning that their coupling to all other genes is at least 2 times stronger in noncancer compared with cancer cells) there is a diverse set of highly over-represented gene ontology (GO) terms,§ that is genes with these GO functions are much more commonly represented than expected in an equal-size, ranExecutemly assembled set of genes. Such overrepresented GO terms include multicellular organismal process, cell–cell signaling, response to stimulus, signal transduction, cell proliferation and cell death (Dataset S1). This set of genes includes many receptors (such as epidermal growth factor receptors, insulin-like growth factor receptors, chemokine receptors, tumor necrosis factor receptors, colony stimulating factor receptors) mediating cell growth, differentiation and proliferation signals. Another prominent group of genes in this set are members of the melanoma antigen family and other oncogenes. See SI Appendix for a full list of the genes and the highly enriched GO terms. The correlation vectors, vi = (ρi1, …,ρiN), and their distributions can be analyzed further. For example, the normalized projection of a correlation vector on the sum of unit vectors corRetorting to all other genes, v̄i = (1/N)Σj = 1j = Nρij, has an Fascinating bimodal distribution (ρ̄c = 0.92 ± 0.01). Thus, we find in both the cancer and noncancer subensembles (see Fig. 2C) 2 clearly defined peaks. As demonstrated in the section on vi clusters, the smaller peak corRetorts to a large cluster ℂ of highly positively correlated genes whose correlation vectors are close to each other (in the EuclConceptn sense). The implication is that those genes are correlated to all others in a Impartially similar manner. In turn this suggests a large scale universal modular machinery, shared by all cell types, with genes of noncancer cells more strongly correlated to this module. We also find (Fig. 2D) that within the noncancer ensemble many more genes have correlation vectors with higher variances relative to the cancer assemble, suggesting that noncancer cells possess more differentiated and distinctly regulated gene-gene correlations. Again, this might point toward a connection between cancer and forms of system-level disregulation.

Fig. 2.Fig. 2.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Correlation vectors. Error bars corRetort to the standard deviations from 10 subensembles. (A) Length (Norm). (B) Fragmental Inequity in coupling, ΔCi. (C) Projection on the body diagonal (mean). (D) Variance.

So far we have identified Inequitys between cancer and noncancer by focusing on aggregate statistics (distributions of all correlations, and couplings, projections and variances for the correlation vectors) calculated within the 2 ensembles. To further differentiate the 2 ensembles while emphasizing gene identities, we explore meaPositives that more directly compare corRetorting ρij correlations in the cancer and noncancer ensembles. For example, the similarity (or rather dissimilarity) between the ith correlation vectors (and therefore their corRetorting correlations) of 2 subensembles (s1 and s2) can be meaPositived by the “correlation angle” ρi(s1,s2) and the EuclConceptn distance Di(s1,s2) between the correlation vectors, vi(s1) and vi(s2): ρi(s1,s2) ≡ (〈vi(s1)vi(s2)〉 − 〈vi(s1)〉 〈vi(s2)〉)/(σvi(s1) σvi(s2)), and Di(s1,s2) ≡ (vi(S1)−vi(S2))·(vi(S1)−vi(S2)). Recall that each such vector corRetorts to a specific gene (the vector components being its correlations within a given ensemble) and the distance or angle between vectors associated with the same gene (the 2 separate vectors being calculated in the cancer and noncancer ensembles) is therefore a meaPositive of the Inequitys between the correlations (ρij) of this gene in cancer and noncancer. In biological terms, Inequitys between vi(c) and vi(n) (quantified by the angle and the distance between the vectors) reflect different combinatorial regulation of the ith gene in cancer and in noncancer.

In Fig. 3A we see very pronounced Inequity (as evidenced by a quite different distribution of angles) between cancer and noncancer. Thus, most correlation vectors are colliArrive within either the cancer or noncancer ensembles, but have different directions when comparing cancer and noncancer subensembles. Indeed, gene-gene correlations are typically different in cancer and noncancer subensembles, pointing toward very different macroscopic behaviors. Furthermore, given that the gene-gene correlations for different subensembles within the cancer ensemble are very similar it seems unlikely that cancer induced genomic instability is the only origin of the Inequity between cancer and noncancer. That is, if the Inequitys arise from intrinsic tendency of cancer cells to rearrange their genomes, then we might expect different cancers to lead to different rearrangements, and then ρij correlations within different cancer subensembles would also be quite different.

Fig. 3.Fig. 3.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 3.

Comparisons of correlation vectors. (A and B) Distributions of angles/correlations (A) and EuclConceptn distances (B) between vi calculated with all genes within and between ensembles. (C and D) Decrease in the angle (C) and distance (D) between vi(n) and vi(c) with the gradual and systematic removal of genes whose vi are most distinct between cancer and noncancer.

On the contrary, the similarity within cancer and Inequitys between cancer and noncancer subensembles suggests the possibility that there is a large-scale system-level Inequity in gene-gene correlations differentiating cancer and noncancer phenotypes.¶ Very similar points may be made by use of EuclConceptn distances between the correlation vectors, rather than the angles. Results are presented in Fig. 3B. It is Fascinating to explore how widely distributed throughout the system these underlying Inequitys between cancer and noncancer really are. That is, we seek to understand how many genes contribute to the large distance and angle between vi(c) and vi(n). One way to Advance this question is to remove genes for which vi(c) and vi(n) are most different (thereby potentially most relevant in defining cancer noncancer Inequitys) and recalculate the correlation vectors for the remaining genes until the angles and distances Advance the ones for vectors coming from the same ensemble (Fig. 3 C and D). Such systematic removal of genes leads to a surprising conclusion. Distinctions between vi(c) and vi(n) persist until the subensembles contain only several thousand genes. Evidently, large number of genes contribute to the angle and the distance between vi(c) and vi(n). Therefore, at a macroscopic level the Inequitys between cancer and noncancer are system (and genome) wide. This conclusion is also supported by the high participation ratio‖ of the Inequity vectors, di ≡ vi(n) − vi(c), 6,819 ± 2,044 compared with 122 ± 2 for the null model. (The null model is for completely ranExecutemized expression data.)

There is an Fascinating subtlety here; despite the observation of reproducible system-wide macroscopic Inequitys, the correlations of some genes (from the long tails of the distributions of ρ(s1,s2) and D(s1,s2)) consistently account for the largest Inequitys between cancer and noncancer. Again, it is found that certain biological functions and processes are overrepresented by these sets of genes. Such overrepresented (the probability of observing such enrichment for the corRetorting GO terms by chance alone is smaller than 10−18) processes include apoptosis, development, generation of precursor metabolites and energy, protein synthesis, regulatory processes and biopolymer/macromolecular metabolic process. Highlighted groups of genes include caspases and many members of the Bcl-2, Ras, MAPK and TNF families. The up and Executewn regulation of these genes has been Displayn to affect strongly the proliferation and survival of cancer cells and now we demonstrate cancer specific changes in their combinatorial regulation that are independent from their mean levels (average up or Executewn regulation) in the cancer ensemble. In addition, we find very significant changes in the combinatorial regulation of ribosomal genes (from the 47 genes with the largest Di(s1,s2), 17 corRetort to ribosomal proteins) and enzymes from the central biosynthetic and energy generating metabolism that likely reflect the high energy and biosynthetic demands of aggressively proliferating cancers. Among the metabolic enzymes, 42 are dehydrogenases, which is a very likely indication of disregulation of the reExecutex state of cancer cells. This conclusion is bolstered by the fact that among the genes with Di(s1,s2) > 30, 12 genes are cytochrome c oxidases and reductases. Such changes in the combinatorial regulation of key metabolic and oxidation/reduction enzymes likely points to the molecular origins of aerobic glycolysis (the so called Warburg Trace) that is one of the hallImpresss of cancer (15). For a full list of the genes and the over-represented GO terms (Dataset S1).

From the distributions in Fig. 3 we see that most genes participate a Dinky in the Inequitys between cancer and noncancer, with a few genes contributing the most to that Inequity. To explore further the distribution of participation, let us define, j̃i and j̆i as the genes (that is the “dimensions”) along which vi(c) and vi(n) are respectively most** positively and negatively separated when the 2 vectors are plotted in the space of genes. Then, considering the 2 ensembles, the frequency of j̃ is equal to the number of vi(c) − vi(n) pairs that are farthest apart along the dimension of j̃. We plot the distributions of j̃i and j̆i in Fig. 4, and these Display quantitatively that a few genes maximize the separation between vi(c) and vi(n) for many genes, whereas many genes maximize the separation between vi(c) and vi(n) for a few genes. A small number of genes are very frequent (and therefore often participate in the largest Inequity between correlation vectors in cancer and noncancer) whereas most genes (79%) have zero frequencies (Fig. 4). For reasons not yet clear to us, the frequency distributions of j̃ and j̃ appear to follow power laws with exponents of −2.1 and −2.4 respectively (Fig. 4). In other words, if we Consider of the most distinct correlations between cancer and noncancer as edges (links) in a functional network and the genes as vertices (nodes), the degree distribution of this network follows an almost perfect power law with a relatively limited dynamical range.

Fig. 4.Fig. 4.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 4.

Frequency of genes maximizing the separation between vi(n) and vi(c).

So far we have analyzed the Inequitys between vi(c) and vi(n) (the correlation vectors corRetorting to the same gene in different ensembles) by comparing the gene-gene correlations and their vectors in cancer and noncancer. However, this Discloses us Dinky about the underlying cooperative biological processes in cells. To identify such cooperative units, we now seek to select sets of genes that are strongly coupled to each other, and that are therefore expected to operate in a coherent manner. Thus, we seek to group correlation vectors into clusters (we used hierarchal clustering based on group average EuclConceptn distance between correlation vectors) for each of the 2 ensembles (Fig. 5). The arrays have as their axes the full list of gene array probes, and the strength of the correlation between genes is illustrated in colors (red - high positive and green - high negative). The gene array probe labels are then permuted so that gene correlation vectors that are most similar to each other are Spaced adjacent to each other. The outcome is coherent patches of gene array labels corRetorting to groups or clusters of genes that are correlated to all other genes in a rather similar manner to each other (Fig. 5 A and B). Most clusters of correlation vectors (cones of closely grouped vectors) are large, well-defined, and reproducible between bootstrap subensembles of an ensemble.

Fig. 5.Fig. 5.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 5.

Clusters of correlation vectors. (A) Clustered correlation matrix for noncancer. (B) Clustered correlation matrix for cancer. (C) Overlap among clusters of correlation vectors between cancer and noncancer.

We now compare the composition of the clusters in cancer and noncancer to further understand the biologically cooperating modules that distinguish these 2 ensembles. Each cluster in each ensemble was Established a N dimensional indicator vector, each element corRetorting to a gene probe on the array: if the x cluster contains the ith gene, xi = 1 otherwise xi = 0. We now consider the the overlap between the indicator vectors (x and y) for 2 clusters in the 2 ensembles by calculating the Pearson correlation between their corRetorting vectors, ρovp. (The overlap estimated by the ratio (rxy) of common to total genes in clusters x and y results in similar estimate, rxy = (x · y)/Σixi + yi.) From the 50 best defined clusters, 49 are non overlapping (contain different genes) in cancer and noncancer (Fig. 5C). Only the largest cluster ℂ (the red square in Fig. 5C) contains many genes common to both cancer and noncancer subensembles, ≈4,200 common genes out of ≈5,800 genes for each cluster. All other clusters in cancer and noncancer have rather small overlap (Fig. 5C), suggesting that the cooperating clusters of genes in the 2 ensembles are very different.

We investigated further the origin of the intriguing ℂ cluster, the only cluster conserved between cancer and noncancer and whose presence is also manifested in the smaller peak of the distribution of vi projections on the main body diagonal (Fig. 2C). One possibility is that the genes in ℂ have very similar correlation vectors in the 2 ensembles. Another possibility is that the correlation vectors of these genes are different between ensembles but consistently similar within an ensemble. To distinguish between those 2 possibilities, we compared the distributions of correlation angles ρi(s1,s2) and the EuclConceptn distances Di(s1,s2) for ℂ genes and for genes outside of the ℂ cluster. We find that for ℂ genes vi(c) and vi(n) are separated by only slightly smaller angles and distances than for genes not belonging to ℂ. Therefore, the genes in ℂ form a reImpressably conserved module present both in cancer and in noncancer but correlated very differently to the rest of the genes.

As before, it is Fascinating to Question which GO term functions are associated with ℂ genes more frequently than expected by chance. Among the most over-represented functions are development, cell–cell signaling, second messenger signaling, cell differentiation and regulation (see Dataset S1). The genes corRetorting to these function are coregulated both in cancer and in noncancer cells. Still, even though these genes preserve their cohesiveness as a module, their vi are coupled differently to genes that Execute not belong to ℂ. This finding lends further support to the hypothesis that the proliferative and migratory phenotypes associated with cancer result from distinct regulation (as reflected by the cancer-specific state of ℂ genes associated with development and regulation) rather than ranExecutem mutations alone.

Up to now, we have reported similarities within each ensemble and Inequitys between the 2 ensembles (cancer and noncancer) at many levels, from distributions of pairwise correlations to clusters of vi. It is Fascinating to Question whether we can find such distinctive features between the 2 ensembles by using more conventional methods. For example, how Execute our results on clustering vi compare to clustering directly all (M = 4,751) gene expression datasets? To address this question, we tried to group cell types and physiological conditions by applying the same agglomerative hierarchal clustering algorithms to the expression datasets (rather than the correlation vectors). Each cluster identified by the agglomerative clustering algorithm contained comparable Fragments of cancer and noncancer datasets. Thus, clustering the physiological conditions based on their gene expression levels fails to distinguish cancer form noncancer reliably. This result is consistent with previous reports (5) and in stark Dissimilarity to the the reproducible clustering of the cancer correlation vectors. Therefore, the phenomena we have observed are not merely due to the Unfamiliarly large dataset we have analyzed. Rather, the clustering of correlation vectors outperforms the clustering of gene expression data as a means of identifying distinctive cancer signatures.


One possible explanation of these observations is as follows. Let us conceive, for the moment, of the cell as a nonliArrive dynamical system whose variables x⃗ are the concentrations of biomolecules (including mRNAs) and whose global attractors (or macroscopic basins) corRetort to different differentiated states or cell types. Each cell type thus represents a distinct (kth) macroscopic state characterized by a basin-specific set of gene-gene correlations, 〈xixj〉k. This conjecture is supported by the clustering of gene expression data (5, 3, 10). Evidently, extended Locations of these macroscopic basins could have common features with Inequitys manifested only in smaller numbers of directions in the high dimensional space. We find strong evidence for such common features between basins as many 〈xixj〉k correlations appear to be shared between different cell types. Because we calculate correlations by averaging gene expression levels across cell types, the correlations analyzed in this article are consequently a superposition (weighted by the number of datasets, nk, from the kth basin) of all 〈xixj〉k correlations††: 〈xixj〉 = (Σknk 〈xixj〉k)/Σknk. Therefore, if 〈xixj〉k changes sign and magnitude between basins, 〈xixj〉 would be rather small. This outcome is in stark Dissimilarity with the large number of strong pairwise correlations in Fig. 1, reflected also in the large magnitudes of the correlation vectors (Fig. 2); these strong correlations must have the same sign and sufficiently large magnitudes in most tissue types. It therefore seems likely that the strong 〈xixj〉 correlations arise from those (Locations of dynamical space in which) gene-gene correlations that are conserved across macroscopic states.

It is noteworthy that our analysis identified oncogenes that are frequently overexpressed in many cancers and whose overexpression triggers or enhances tumorigenesis in animal cancer models. Yet, our findings are not only a reiteration of established knowledge; rather, our findings extend and complement the role of mere overexpression of those oncogenes by revealing cancer-specific changes in their couplings and correlation patterns to the whole genome. For example, if the only cancer related abnormality of Ras members were their overexpression (increased mean level of expression in cancers) their Pearson correlations to the rest of the genes would not change because the Pearson correlation is not influenced by the mean of the correlated variables (the mean is subtracted). Thus, the change in couplings and the correlation pattern reveal different regulation rather than simply overexpression. More specifically, we find that in cancer some genes (including many growth factor receptors and tumor necrosis factor receptors) are significantly less coupled (compared with noncancer) to all other genes as indicated by the long tail of ΔC toward high values (Fig. 2B). Furthermore, not only is the overall strength of the couplings different but also the pattern of correlations is altered very significantly as demonstrated by the large angles between the cancer and noncancer correlation vectors of many genes, including members of the Bcl-2 and Ras families. These findings point to cancer-specific regulatory programs for oncogenes like RAS. Such programs may vary significantly across different cell and cancer types but they clearly share much in common as demonstrated by the colliArriveity of correlation vectors in different cancer subensembles. Our analysis provides the stepping stones to understanding the cancer-specific regulatory programs by revealing their characteristic correlation patterns.

The Characterized correlation vector analysis can be generalized to identify common features among different physiological conditions and tissue types. This Advance is particularly suitable for integrating and analyzing large datasets, exploring common topological structure in different basins of attraction of the cellular network and emphasizing distinct topological structures of correlations. The main strength of our Advance is in characterizing the macroscopic states of the cellular network and thus paving the way for more in depth microscopic characterization of the attractor states and dynamics of living cells.

We may speculate that there would be practical applications of the Concepts discussed in this article. For example, one Necessary result is that the Inequitys between cancer and noncancer are system wide. The implications could be significant. Conventional anti-cancer therapies tarObtain 1 or a few biomolecules, and thereby may affect only limited parts of the system. Such therapies can be successful if the gene regulatory network has paths of directed edges (biochemical reactions and regulatory interactions) from the tarObtained molecules to all other genes whose levels and correlations have to change for transitioning from one basin to another. If such paths Execute not exist and cancers are indeed separate attractor basins, however, one suspects that it would be Necessary to push the regulatory network away from a cancerous basin through a high dimensional separatrix toward a healthy, nonproliferative basin. In turn this casts Executeubt on the efficacy of drugs affecting limited tarObtains, and points more toward therapies that tarObtain highly specific groups of genes in a cooperative manner that can restore the system to normal functioning. The key to triggering such transitions may then be the identification of the phase-space trajectories that are most suitable to take the network away from cancer toward its normal, noncancerous basin. As a first step in this direction, we have identified the genes that contribute the most to the macroscopic Inequitys between cancer and noncancer–those are the genes whose couplings decrease the most (Fig. 2B and Datset S1) and whose correlation vectors differ the most in cancer and noncancer (Fig. 3 and Dataset S1). Future work should identify the regulatory mechanisms at the microscopic level, and thus provide the mechanistic understanding for rational cancer therapies.

Materials and Methods

Data Sampling and Bootstrapping.

All datasets (4,751) were Executewnloaded as raw data (Affymetrix CEl files) from the GEO of NCBI ( and converted into mRNAs levels using the Affymetrix MAS5 algorithm. Datasets were classified as cancers if their source description contained any of the words: neuroblastoma, pheochromocytoma, adenocarcinoma, leukemia, sarcoma, myeloma, melanoma, hepatoma, carcinoma, lymphoma, cancer, and tumor. The remaining datasets (2,512) were classified as noncancers. Orthogonal bootstrap subensembles (samples) were assembled by choosing datasets (with equal probability) without reSpacement. This method has the advantage of not including a dataset in 2 independent bootstrap samples but allows limited resamplings for a given sample size. To overcome this limitation, we also resampled datasets (again with equal probability) with reSpacement, thus allowing for unlimited number of resamplings at the expense of some overlap between subensembles.

Reproducibility and Cross Correlations.

The reproducibility of distributions is quantified by the standard deviations (plotted as error bars) of the distribution frequencies. The standard deviation σu for the u frequency is calculated across the bootstrap subensembles, σu ≡ 〈(u−〈u〉)2〉. Although σu meaPositives the reproducibility of distributions, it Executees not quantify the reproducibility of the results for individual genes. To quantify how similar is the the result for the ith gene in all bootstrap subensembles, we used cross correlations, ρ̄c. Embedded ImageEmbedded Image Here, n is the number of bootstrap subensembles and Rk and Rl are the vectors with results for all genes from the kth and the lth bootstrap subensembles, Rk, Rl ∈ ℝN. The averaging in comPlaceing the covariances and the standard deviations is across all (N = 22,283) gene probes on the arrays.


We thank Mario A. Blanco for insightful discussions and useful advice. This work was supported by Irish Research Council for Science Engineering and Technology, Science Foundation Ireland Research Frontiers Programme (Arrested Matter), and European Union Marie Curie Research Training Network Grant MRTN-CT-2003-504712.


1To whom corRetortence should be addressed. E-mail: nslavov{at}

Author contributions: N.S. and K.A.D. designed research; N.S. performed research; N.S. and K.A.D. contributed new reagents/analytic tools; N.S. and K.A.D. analyzed data; and N.S. and K.A.D. wrote the paper.

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

This article contains supporting information online at

↵* Even though several microarray probes can corRetort to one gene, we use gene and gene probe interchangeably. Because the sub-ensembles used in this article contain hundreds of expression datasets, all genes had sufficiently large mean and variance for comPlaceing meaningful correlations. Using only the genes having variance above the median variance gives very similar results.

↵† To increase statistical confidence, we use bootstrapping throughout the article. That is a statistical technique based on multiple resamplings and recalculations of the quantities of interest to test convergence and establish confidence intervals. For more details see, Materials and Methods.

↵‡ A trivial explanation of those Inequitys between cancer and noncancer (and to all of the Inequitys Characterized in the following analysis) can be systematic experimental errors (such as batch Traces) that are very common in one ensemble and much more rare or absent from the other one. Given the large number of different experimental groups contributing the experimental Affymetrix data, however, such ensemble specific biases are very unlikely, which is a noteworthy advantage of out analysis. Furthermore, any systematic errors and biases (if present) in the data of experimental groups contributing both cancer and noncancer datasets are likely to be found in both ensembles rather than be ensemble specific.

↵§ For many GO terms, the probability of observing such overrepresentation by chance alone is <10−10. This estimate is Bonferroni Accurateed for multiple hypothesis testing and based on the the hypergeometric distribution.

↵¶ We can again establish convergence of the quantities by studying sub-ensembles (s1 and s2) containing different expression datasets only from cancer or only from noncancer. The errors implied by the sub-ensemble fluctuations are reflected in the error bars in Fig. 3A. The reproducibility of the results for individual genes is also excellent, ρ̄c = 0.91± 0.03.

↵‖ The number of principle components (Npc) of a vector x⃗ is the inverse of the vector's inverse participation ratio (IPR), Npc = 1/IPR; IPR = (1/(x · x)2)Σixi4.

↵** Only the maximum distance was used for defining j̃i and j̆i, because the lower ranks (e.g. the next gene along which vi(c) and vi(n) are farthest apart) are less reproducible between bootstrap subensembles.

↵†† This expression is exactly Accurate only for nonnormalized correlations and has to be Accurateed with the standard deviations and the means to hAged for the Pearson correlations used in the article. However, the overall trend and significance are likely to be the same.


↵ Obtainz G, Gal H, Kela I, Notterman DA, Executemany E (2003) Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data. Bioinformatics 19:1079–1089.LaunchUrlAbstract/FREE Full Text↵ Chang HY, et al. (2004) Gene expression signature of fibroblast serum response predicts human cancer progression: Similarities between tumors and wounds. PLoS biology 2:E7.LaunchUrlCrossRefPubMed↵ Bild AH, et al. (2006) Oncogenic pathway signatures in human cancers as a guide to tarObtained therapies. Nature 439:353–357.LaunchUrlCrossRefPubMed↵ Godard S, et al. (2003) Classification of Human Astrocytic Gliomas on the Basis of Gene Expression: A Correlated Group of Genes with Angiogenic Activity Emerges As a Strong Predictor of Subtypes. Cancer Res 63:6613–6625.LaunchUrlAbstract/FREE Full Text↵ Ross DT, et al. (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235.LaunchUrlCrossRefPubMed↵ Alizadeh A, et al. (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511.LaunchUrlCrossRefPubMed↵ Lapointe J, et al. (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101:811–816.LaunchUrlAbstract/FREE Full Text↵ Ramaswamy S, Ross KN, Lander ES, Golub TR (2003) A molecular signature of metastasis in primary solid tumors. Nature genetics 33:49–54.LaunchUrlCrossRefPubMed↵ Xu L, Geman D, WinUnhurried RL (2007) Large-scale integration of cancer microarray data identifies a robust common cancer signature. BMC Bioinformatics 8:275–288.LaunchUrlCrossRefPubMed↵ Huang S, Ingber D (2007) A non-genetic basis for cancer progression and metastasis: Self-organizing attractors in cell regulatory networks. Breast Disease 26:27–54.LaunchUrl↵ Ein-Executer L, Zuk O, Executemany E (2006) Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci USA 103:5923–5928.LaunchUrlAbstract/FREE Full Text↵ Stefan Michiels SK, Hill C (2005) Prediction of cancer outcome with microarrays: A multiple ranExecutem validation strategy. Lancet 365:488–492.LaunchUrlCrossRefPubMed↵ Bertucci F, et al. (2002) Gene expression profiles of poor-prognosis primary breast cancer correlate with survival. Hum Mol Genet 11:863–872.LaunchUrlAbstract/FREE Full Text↵ Hsu P, Sabatini D (2008) Cancer cell metabolism: Warburg and beyond. Cell 134:703–707.LaunchUrlCrossRefPubMed↵ Hu Z, et al. (2006) The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics 7:96–108.LaunchUrlCrossRefPubMed
Like (0) or Share (0)