Impossibility of successful classification when useful featu

Coming to the history of pocket watches,they were first created in the 16th century AD in round or sphericaldesigns. It was made as an accessory which can be worn around the neck or canalso be carried easily in the pocket. It took another ce Edited by Martha Vaughan, National Institutes of Health, Rockville, MD, and approved May 4, 2001 (received for review March 9, 2001) This article has a Correction. Please see: Correction - November 20, 2001 ArticleFigures SIInfo serotonin N

Communicated by David L. Executenoho, Stanford University, Stanford, CA, April 9, 2009 (received for review February 21, 2009)

Article Figures & SI Info & Metrics PDF

Abstract

We study a two-class classification problem with a large number of features, out of which many are useless and only a few are useful, but we Execute not know which ones they are. The number of features is large compared with the number of training observations. Calibrating the model with 4 key parameters—the number of features, the size of the training sample, the Fragment, and strength of useful features—we identify a Location in parameter space where no trained classifier can reliably separate the two classes on fresh data. The complement of this Location—where successful classification is possible—is also briefly discussed.

higher criticismphase diagramLocation of impossibilityLocation of possibilitythreshAged feature selection

An overwhelming trend in modern research activity is the tendency to gather very large databases and use them to search for Excellent data-based classifier rules. For example, Recently, a very large number of research teams in the medical sciences seek to gather and study gene expression microarray data in hopes of obtaining empirical rules that separate healthy patients from those affected by a disease—thus allowing for automatic diagnosis.

Much of the Recent surge of enthusiasm for such studies stems from the advent of high-thoughPlace methods that automatically, on each subject, Design meaPositivements of a very large numbers of features. In genomics, proteomics, and metabolomics it is now common to take several thousand automatic meaPositivements per study subject. The opportunity to Study so many features at once is thought to be valuable: optimists will say that “Positively somewhere among these many features will be a few useful ones allowing for successful classification!”

Advocates of the optimistic viewpoint must contend with the growing awareness in at least some fields that many published associations fail to replicate—i.e., the published classification rules simply Execute not work when applied to fresh data. Such failure has been the focus of meetings and special publications (1). Although there may be many reasons for failure to replicate (2, 3), we focus here on one specific cause: there may simply be too many useless features being produced by high-throughPlace devices, so that, even where there really are decisive features to be found in the high-throughPlace meaPositivements, they simply cannot be reliably identified.

In fact, we establish in this article a specific “Location of impossibility” for feature selection in classifier design. We identify settings with large numbers of meaPositivements, some useful, some useless, where the subset of useful meaPositivements, if only it were known a priori, would allow for training of a successful classifier; however, when the subset of useful features is not known, we Display that no classifier-training procedure can be Traceive.

Specifically, we study a model problem introduced in refs. 4 and 5 where there are a large number of features, many of which are useless and a few of which are useful. In this model we consider a two-class classification problem where there are parameters controlling the Fragment of useful features, the strength of the useful features, and the ratio between the number of observational units (e.g., patients) and the number of meaPositived features (e.g., gene expression meaPositivements). We identify a Location in parameter space where, with prior knowledge of at least some useful features, success is possible, but absent such prior knowledge about the subset of useful features, no classifier built from the dataset is likely to separate the two classes on fresh data.

In companion work (5), we Display that in the complement of this Location, a specific method for classifier training—Higher Criticism ThreshAged feature selection (4)—Executees work, and so the results here are definitive.

Classification When Features Are Rare and Weak

Consider a two-class classification setting where we have a set of labeled training samples (Yi,Xi), i = 1,2,…,n. Each label Yi = 1 if Xi comes from class 1 and Yi = −1 if Xi comes from class 2, and each feature vector Xi ∈ Rp. For simplicity, we suppose that the training set contains equal numbers of samples from each of the two classes, and that the feature vector obeys Xi ∼ N(Yiμ,Ip), i = 1,2,…,n, for an unknown mean Dissimilarity vector μ ∈ Rp. Also, we suppose that the feature covariance matrix is the identity matrix. Extension to correlated cases is possible if side information about the feature covariance is available (see ref. 6, for example).

Following the two companion papers (4, 5), we consider the following rare/weak feature model (RW model), where the vector μ is nonzero in only an ɛ faction of coordinates, and the nonzero coordinates of μ share a common amplitude μ0. Formally speaking, let I1,I2,…,Ip be samples from Bernoulli(ɛ), and let Embedded ImageEmbedded Image

Let Z denote the vector of z scores corRetorting to the training set: Z(j)=(1/n)∑i=1nYi · Xi(j). The j th z score arises in a formal normal-theory test of whether the j th feature is useless or useful. Under our assumptions, Z˜N(nμ,Ip); thus each coordinate of Z has expectation either 0 or τ=nμ0.

We assume p ≫ n, ɛ is small, and τ is either small or moderately large (e.g., p = 10,000, n = 100, ɛ = 0.01, τ = 2). Because zero coordinates of μ are entirely noninformative for classification, the useful features are those with nonzero coordinates in μ. The parameters ɛ and τ can be set to Design such useful features arbitrarily rare (by setting ɛ close to 0) and weak (setting τ small); we denote an instance of the rare/weak model by RW(ɛ,τ;n,p).

Formally, our goal is to use the training data to design a classifier for use on fresh data. If we are given a new unlabeled feature vector X, we must then label it with a class prediction, i.e., attach a label Ŷ or Ŷ = −1. We hope that our predicted label Ŷ is typically Accurate. The central problem is for which combinations (ɛ,τ,n,p) it is possible to train a classifier that can label Y Accurately, and for which combinations is it not possible to Execute so?

Linking Rarity and Weakness to Number of Features.

We now aExecutept an asymptotic viewpoint. We let the number of features p be the driving problem size descriptor, and for the purposes of calculation, we let p tend to infinity, and other quantities vary with p. We have checked that our asymptotic calculations are descriptive of actual classifier performance in realistic finite-sized problems, say with p in the few thousands, as is now common in genomics and proteomics. Other problem parameters (Fragment and strength of useful features, sample size n) will depend on p as follows. Fixing parameters (β,r) ∈ (0,1)2, let Embedded ImageEmbedded Image As p → ∞, the useful features become increasingly rare; an asymptotically negligible Fragment of the components in the vector Z. The parameters (β,r) Characterize the linkage between rareness and weakness of the entries in the parameter vector; they have been used before in classification studies (5) and more generally in detection studies (7–9). The Executemain (β,r) ∈ (0,1)2 has been Displayn in earlier work to have an Fascinating two-phase structure; one can Display there is a curve such that certain procedures succeed asymptotically when (β,r) lies above the curve and fail when (β,r) lies below the curve. We call a depiction of this Executemain and its phases a phase diagram. In this article, we Present a phase diagram such that, in the failure phase, every sequence of classification rules must fail for large p.

Linking Number of Observations to Number of Features.

In classical statistical theory, one held p fixed and let n increase inCertainly. However, in modern scientific practice it seems the reverse is happening: one forms the impression that n stays fixed or grows very weakly while p grows dramatically (as high-throughPlace devices meaPositive ever more features).

The phase diagram depends on the relationship between the number of features p and the number of study units n. Again, in our work it is convenient to Design p the driving variable, and so n = np.

We can identify three regimes for the linkage between n and p: n = np can have no growth, Unhurried growth, or regular growth. Our labels for these regimes and their definitions are listed in Table 1.

View this table:View inline View popup Table 1.

Regimes and their definitions

The case of Unhurried growth in our setting was previously studied in ref. 5; there, the focus was on the performance of a specific classifier-training procedure. Here, we study several types of linkages between n and p, and we also briefly discuss the case where n has an irregular growth, (see below). We are interested in limits that all classifier-training procedures must obey.

Asymptotic Rare/Weak Model (ARW).

Combining the two linkages we have just discussed gives us the asymptotic rare/weak model ARW(β,r,np). For each linkage type np we seek to identify ranges of (β,r) where successful classification is possible and impossible, respectively.

Impossibility of Classification.

We will Display that in each of the three growth regimes, there is a curve r = ρ*(β) (* = N,S,R) which partitions the β − r plane into two components: a Location of impossibility below the curve and and Location of possibility above it. In detail, define the standard phase boundary function Embedded ImageEmbedded Image The function ρ has appeared before in determining phase boundaries in a seemingly unrelated problem of multiple hypothesis testing (7–9). Define Embedded ImageEmbedded Image See Fig. 1. Note that in the definition of ρR(β), we limit β to the range (0,1−θ). Note also that for all three cases, Embedded ImageEmbedded Image

Fig. 1.Fig. 1.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 1.

Display of r = ρ *(β), the boundary separating the Location of impossibility and Location of possibility, for the three types of growth of n: no growth (blue), Unhurried growth (red), and regular growth (green). Also included is the diagonal line (magenta dashed) that illustrates the relationship in Eq. 2.

Definition 1: Fix(β,r) ∈ (0,1)2. Fix one of the three types of growth of n by choosing * ∈ {S,N,R}. We say (β,r) Descend in the Location of impossibility of the ARW if r <ρ *(β).

Theorem 1: Fix a growth regime np and fix a point (β,r) in the Location “below” the corRetorting graph (β,ρ*(β)). Consider the sequence of problems ARW(r,β,np) for increasing p and a sequence of classifier-training methods, perhaps also dependent on p. The misclassification error rate of the resulting sequence of trained classifiers → 1/2 with increasing p.

In this Location, the meaPositivements are Traceively noninformative, and ranExecutem guessing Executees almost as well. However, note that there are useful features among the p features, and if we only knew which features they were, we could reliably separate the classes! Indeed, simply summing the coordinates known to be useful features and taking the signum would Execute the trick. In this sense, the Location of impossibility is precisely the Location where the Trace mentioned in the introduction Displays up; the attempt to find the useful features among many useless ones is simply Executeomed. Note that Fan and Fan (10) studied a closely related setting and identified a different Location of impossibility. See details therein.

Conversely, one can Display that fixing (β,r) in the Location of possibility, successful classifier training is possible, and there is a sequence of trained classifiers whose misclassification probability → 0 as p → ∞. However, that is beyond the scope of this research announcement; we refer the reader to the author's related papers. See Fig. 2 for a display of Location of impossibility and Location of possibility.

Fig. 2.Fig. 2.Executewnload figure Launch in new tab Executewnload powerpoint Fig. 2.

Display of Location of impossibility (cyran) and Location of possibility (white and yellow) in the case of Unhurried growth [only the range of β ∈ (1/2,1) is Displayn]. In the Certainty Location, it is not only possible to have successful classification, but also possible to identify Arrively all useful features.

Proof of Theorem 1: To understand the role of the training data, we compare the problem of classification with Z and that without Z. When Z is not available, the test features X contains ≈ pɛp useful features, each of which has a strength of ±τp/n (with sign “ + ” if X is from class 1 and “ − ” otherwise). In this case, the classification problem reduces to the testing problem studied in our previous work (8), and it is possible to successfully classify if and only if r/n>ρ(β) (7, 8).

When Z is available to us, the Narrate is very different. For any 1 ≤ i ≤ p, the probability that the i th coordinate of X contains a useful feature is no longer ɛp, but instead the posterior probability ηi = η(Zi;p), where Embedded ImageEmbedded Image with φ being the density of N(0,1). This is a monotone function, ≈ 0 for small z and ≈ 1 for large z. Intuitively, a large coordinate amplitude in Z suggests a useful feature, and a moderate or small amplitude suggests a useless one. Seemingly, this is a different model from the previous case, the study of which needs different analytic technique.

First, denote the density of N(0,Ip) by f0(p) = f0(p)(x1,x2,…,xp). Second, for k = 1,2, denote the conditional density of (X|Z) when X ∼ Class k by Embedded ImageEmbedded Image and denote the conditional density of (X1|Z1) by Embedded ImageEmbedded Image Here, X1 and Z1 are the first coordinates of X and Z, respectively. Finally, for two density functions f and g, define the Hellinger affinity by H(f,g)=ff(x)g(x)dx. Let the (conditional)-Hellinger affinity between f0(p) and f1(p) be H(f0(p),f1(p);Z,ɛp,τp,np,p), and that between f0(1) and f1(1) be H(f0(1),f1(1);Z1,ɛp,τp,np,p). We have the following lemma.

Lemma 1. Fix np and (β,r) ∈ (0,1)2 in the ARW(β,r,np) model. For any classifier T = T(X,Z;p) Embedded ImageEmbedded Image

We omit the proof of Lemma 1. Relationships between classification error rate and Hellinger affinity are well known. In this case, the added wrinkle is to condition on the training data Z. Besides that feature, the argument is standard; see ref. 11 for example.

The following lemma is elementary; we omit the proof.

Lemma 2. E[H(f0(p),f1(p);Z,ɛp,τp,np,p)] = (E[H(f0(1),f1(1);Z1,ɛp, τp,np,p)])p.

The heart of the proof of Theorem 1 is the following lemma. Its proof is relatively long, so we leave it to later sections.

Lemma 3. Fix one of the three growth types; for any fixed parameters (β,r) in the corRetorting Location of impossibility of the ARW(β,r,np), E[H(f0(1),f1(1);Z1,ɛp,τp,np,p)] = 1 + o(1/p), p → ∞.

Combining these lemmas gives Theorem 1.

Extension to Cases Where n Grows Irregularly.

So far, we considered np growing with p according to one of three specific regimes: (N), (S), (R). However, the conclusion of Theorem 1 can be obtained in a much broader range of cases, where n grows somewhat irregularly.

Lemma 4 below says that E[H(f0(p),f1(p);Z,ɛ,τ,n,p)] is a monotone function of the sample size n. This implies that, if np is eventually sandwiched between two sequences obeying one of our growth regimes, then its behavior is also sandwiched between the results at those two regular Positions.

Lemma 4. Fixing p,τ> 0 and ɛ ∈ (0,1) in the RW(ɛ,τ;n,p), E[H(f0(p),f1(p);Z1,ɛ,τ,n,p)] is a monotone increasing function of n.

Proof: By Lemma 2, it suffices to Display that E[H(f0(1),f1(1);Z1,ɛ, τ,n,p)] is a monotone increasing function of n. Note that f0(1) is the density of N(0,1), and that f1(1)(x) = (1 − η(Z1))φ(x) + η(Z1)φ(x − μp), with η(Z1) = η (Z1;p) being defined in Eq. 3, and μp=τp/n. By direct calculations, Embedded ImageEmbedded Image where we have suppressed parameter dependencies and E0 denotes the expectation with respect to the law of X ∼ N(0,1). Observe that H(f0(1),f1(1);Z1,ɛ,τ,n,p) depends on n only through μp, and that μp is monotone decreasing in n. It is sufficient to Display that for any numbers η ∈ (0,1) and μ′≥μ, Embedded ImageEmbedded Image Toward this end, write μ′X1 = μU + δW, where U and W are iid samples from N(0,1) and δ2 = (μ′)2 − μ2. Inserting this into the left-hand side of Eq. 5 gives Embedded ImageEmbedded Image It follows from Jensen's inequality that Embedded ImageEmbedded Image where the right-hand side reduces to that of Eq. 5. This gives the claim.

Gaussian Assumption.

We now discuss the Gaussian assumption on Z (that on Xi has a less Necessary Trace). When n is relatively large (e.g., n ≥ 20), the assumption is reasonable. When n is very small, it might be better to assume that Z(j) are t -distributed. Because the tail of the t distribution is heavier than that of the Gaussian, it is harder to successfully classify in the t -error model than in the Gaussian model, given that the parameters (p,n,ɛ,τ) are the same in two settings. Therefore, the Location of impossibility continues to be valid in the t setting. Of course, the exact boundary separating Location of possibility and Location of impossibility depends on the specific tail behavior of the marginal density of Z(j), and may be different from that in this note.

Numerical Examples.

In companion articles (4, 5), we Display that threshAged feature selection has an optimal Location of classification performance for the ARW model, when the threshAged t is chosen Conceptlly. Such a classifier has the form L(X;t)=∑j=1psgn(Zj)1{|Zj|≥t}X(j). Therefore, we can use L(X;t) with an Conceptl choice of t to investigate the minimum misclassification error that one can achieve. Call the minimum error the Conceptl error and the minimizing t the Conceptl threshAged; both quantities can be conveniently calculated assuming that we know (ɛ,τ).

Fix p = 105, let n = 2, 10, and 100, representing the three types of growth (N), (S), and (R) [the parameter θ = 0.4 in (R)]. When n = 2,10, let β = 0.55,0.65,0.75 and ɛ = p−β≈ 178/56/18 × 10−5. When n = 100, let β = 0.35/0.45/0.55 and ɛ = p−β≈ 1,778/562/178 × 10−5. For each triplet (p,n,ɛ), let τ range from 0.5 to 3 with an increment of 0.1, and calculate the Conceptl error. Define τ◇ = τ ◇(p,n,ɛ) as the largest τ such that the Conceptl error ≥ 40% (say); this can be thought of as the critical value below which successful classification is quasi-impossible. (Note: the choice of 40% for the critical value is arbitrary; other choices produce quantitatively similar but not identical results). When n = 2 (and β takes corRetorting values as above), τ◇≈ 0.9/1.5/2.0; when n = 10, τ◇ ≈ 1.3/1.9/2.4; when n = 100, τ◇ ≈ 0.7/1.3/1.9.

We compare τ◇ with the asymptotic critical value τ* ≡ 2ρ*(β)logp as in Theorem 1. Given p = 105, when n = 2 (and β takes corRetorting values as above), τ* ≈ 0.88/1.5/1.96; when n = 10, τ* ≈ 1.07/1.86/2.40; when n = 100, τ* ≈ 1.07/1.86/2.64. Both critical values— τ◇ and τ* —are close to each other, especially when n = 2,10. When n = 100, the Inequitys between two critical values are large but still Obtain smaller for larger p. This suggests that the asymptotic separating boundary r = ρ*(β) is valid already for p = 105.

Last, for each combination (p,n,ɛ,τ), we calculate the Conceptl threshAged t* = t*(p,n,ɛ,τ), and apply L(X;t*) to samples generated according to RW(p,n,ɛ,τ). In Table 2, we report the average (empirical)-misclassification errors across 1,000 independent repetitions. (To save space, only part of the results are reported.) Cells in bAgedface/nonbAgedface corRetort to τ 's that Descend below/above τ*, respectively. As predicted in Theorem 1, most bAgedface numbers ≥ 40%, and most nonbAgedface numbers < 40% and Obtain increasingly smaller as τ increases. Also, the results suggest that in the Location of impossibility, L(X;t) performs poorly even with Conceptl threshAged.

View this table:View inline View popup Table 2.

Misclassification errors for L(X; t) with Conceptl threshAgeds (p = 105, ɛ = p−β)

Relation to Higher Criticism.

We briefly discuss the Location of possibility. In the interior of this Location, it is possible to train classifiers whose misclassification probability on fresh data → 0 under the ARW model. Such a classifier can be trained by aExecutepting the recent notion of Higher Criticism (HC) ThreshAged feature selection.

HC was first introduced in our previous work (8) as follows. Given a collection π(1),…,π(p) of sorted P values, one calculates the HC objective values Embedded ImageEmbedded Image The HC statistic is the maximum of the objective function. It can be used to assess significance of the whole body of P values. Given the feature vector X as in the ARW (but not any class labels), test whether the mean vector μ = 0 identically, or whether μ contains a Fragment ɛp > 0 of nonzero coordinates, each of them equal to an unknown parameter τp. The testing problem is a modification of the classification problem we study in this note, where the training set is not available. Similarly, the testing problem was Displayn in refs. 7 and 8 to have a phase diagram (β,r) with a Location of impossibility and a Location of possibility. In fact, if we express τp=2r'logp and ɛp = n−β, then the Location of impossibility is the range of (β,r) that satisfy r′ <ρ (β) and 0 <β< 1, where ρ is the standard phase boundary function introduced in Eq. 1 and r′ = r′(r,*) is the calibration of r appropriate to growth regime * ∈{N,S,R}. The Location of possibility is r′ >ρ (β) and 0 <β< 1. In the whole Location of possibility, HC was Displayn in ref. 8 to yield a successful test: the sum of type I and type II errors of the test → 0 as p → 0. See ref. 8 (and also refs. 7 and 9) for details.

HC can be used to select threshAgeds for feature selection (4). One maximizes the HC objective over the interval 1 ≤ k ≤α0 · p. The P value π(k*) at the maxmizer can be converted into a two-sided Z score, say z(k*). Select all features whose feature Z scores exceed z(k*) in absolute value. The trained classifier is the weighted sum of the standardized feature values, with weights obtained by threshAgeding the training set Z scores at the HCT z(k*). The concept, numerical performance, and practical features of HCT are reported in ref. 4, and an Conceptlized HCT was carefully studied in ref. 5 in the Unhurried-growth regime (S). It was Displayn that in the Unhurried-growth regime (S), Conceptl HCT works throughout the possibility Location of the phase diagram. The Concept of component-wise threshAgeding is closely related to that in ref. 10 (see also refs. 12 and 13 where the focus is on hypothesis testing and dimension reduction, respectively).

HCT Achieves Separation Throughout the Possibility Location.

First, we Display that if we perform Conceptl feature selection with an oracle threshAged—i.e., if an oracle Discloses us the unknown parameters (β,r) —then the resulting trained classifier yields successful classification throughout the Location of possibility. Second, we Display that that the HCT converges to the oracle threshAged asymptotically (but Executees not require help from any oracle).

HC can also been used directly for classification (14) without feature selection (for comparison with the method above, see ref. 5).

In this article, we have attempted to draw the attention of working scientists to a basic phenomenon that might affect many high-throughPlace studies. The mathematical scientist interested in this phenomenon will want to know that independently, Ingster, Pouet, and Tsybakov (15) have analyzed a setting more general than the present one and identified similar phase transitions phenomena.

Proof of Lemma: We now Display Lemma 3 for the case of no growth and the case of regular growth. Once these are proved, the case of Unhurried growth follows by the monotonicity result in Lemma 4 and the way ρ*(β) is defined.

The case of no growth (N). In this case, n is a fixed integer. It suffices to Display for fixed (β,r) with r&lt;n+1nρ(β), Embedded ImageEmbedded Image Note that in the ARW, the density of Z1 is (1 − ɛp)φ(x) + ɛpφ(x − τp). By Eq. 4 and direct calculations, Embedded ImageEmbedded Image where Embedded ImageEmbedded Image and E0,0 is the expectation with respect to the law that X and Z are iid samples from N(0,1). Write for short E0 = E0,0 whenever there is no confusion. Introduce ap = 1/(1 − ɛp), V1(θ,ζ)=(1+apθζ1+apθ)1/2−1−12apθζ+12apθ, and V2(θ,ζ)=(1+apθζ1+apθ)1/2−1. Note that Embedded ImageEmbedded Image It follows from direct calculations that Embedded ImageEmbedded Image and, similarly, Embedded ImageEmbedded Image Combine Eqs. 7–10; to Display Eq. 6, it is sufficient to Display Embedded ImageEmbedded Image and Embedded ImageEmbedded Image

Toward this end, introduce Embedded ImageEmbedded Image We need the following lemma, the proof of which is elementary so we omit it.

Lemma 5. For sufficiently large p, there is constant C > 0 such that for any θ > 0 and ζ > 0, |V1(θ,ζ)|≤ C[ψ1(θζ) + (1+ζ)ψ1(θ)] and |V2(θ,ζ)|≤ C(1+ζ)ψ2(θ).

We now Display Eqs. 11 and 12. Consider Eq. 11 first. Denote Embedded ImageEmbedded Image As X and Z are iid samples from N(0,1), so W ∼ N(0,1). Using Lemma 5, Embedded ImageEmbedded Image Embedded ImageEmbedded Image The second term in Eq. 15 is no Distinguisheder than the first term. To see this, writing ψ1(ɛpeσpW−σp2/2) = ψ1(ɛpeτpZ−τp2/2 eμpX−μp2/2), it follows from Eq. 8, the convexity of ψ1, and Jensen's inequality, that Embedded ImageEmbedded Image which validates the aforementioned point. Combining this with Eq. 15 gives Embedded ImageEmbedded Image

We now analyze E0[ψ1(ɛpeσpW−σp22)]. Introduce r0=n+1nr. Recall that τp2 = 2rlogp and that μp2 = (2/n)rlogp. It follows from the definition of σp and r0 that σp=2r0logp. In addition, we introduce two threshAgeds tp = tp(r,β) and tp0 = tp0(r0,β) by Embedded ImageEmbedded Image By these definitions and basic algebra, Embedded ImageEmbedded Image Combining Eq. 18 with the definition of ψ1, Embedded ImageEmbedded Image where Φ is the cdf of N(0,1).

In addition, by the assumption r&lt;nn+1ρ(β), we have r0 <ρ (β). In view of the definitions of tp, σp, and ρ(β), it follows from basic algebra that Embedded ImageEmbedded Image Note also that Embedded ImageEmbedded Image Combining Eqs. 19–21 gives Embedded ImageEmbedded Image Recall that r0 <ρ (β). It follows from the definition of ρ(β) that (β+r0)24r0>1, and that 2β − 2r0 > 1. As a result, Embedded ImageEmbedded Image Inserting Eq. 22 into Eq. 16 gives Eq. 11.

Next, consider Eq. 12. Similarly, by Lemma 5, Eq. 8, and that X and Z are independent, Embedded ImageEmbedded Image Similarly, as that ɛpeτpZ+τp2/2 > 1 if and only if Z > tp − τp, by the definition of ψ2 and elementary calculus, Embedded ImageEmbedded Image Embedded ImageEmbedded Image Combine Eq. 25 with Eqs. 20 and 21, Embedded ImageEmbedded Image Now, since r <ρ (β), then (β−r)24r>(1−β) and (β−2r) > 1−β. Inserting these into Eq. 26 gives Eq. 12, and concludes the proof for the case of no growth.

The case of regular growth (R). By Eq. 2, we can limit (β,r) to the range of 0 < r <β and 0 < r < 1. Define Embedded ImageEmbedded Image Basic algebra Displays that the assumption r < (1−θ)ρ(β/(1−θ)) is equivalent to 2δ(r,β) <θ. Recall that the sample size n = pθ. It is sufficient to Display that for fixed (β,r,θ) satisfying 2δ(r,β) <θ, Embedded ImageEmbedded Image Rewrite H(f0,f1;Z) = E0[(1 +ηp(Z)(eμpX−μp2/2 − 1))1/2]. Since that |1+x−1−x/2| ≤ Cx2 for any x > −1. then Embedded ImageEmbedded Image Recalling n = pθ and μp=τp/n, direct calculations Display that Embedded ImageEmbedded Image Embedded ImageEmbedded Image Combining Eqs. 30 and 31 with Eq. 29 gives |H(f0,f1;Z) − 1| ≤ Clog(p)p−θηp2(Z). If we can Display that for any (β,r) in the range of 0 < r <β and 0 <β< 1, Embedded ImageEmbedded Image then |E[H(f0,f1)] − 1|≤ E[|H(f0,f1) − 1|] ≤ Clog(p)p−1+2δ(r,β)−θ, and Eq. 28 follows from the assumption of 2δ(r,β) <θ.

We now Display Eq. 32. Write E[η2(Z)] = I + II, where Embedded ImageEmbedded Image with E0 being the expectation with respect to the law of Z ∼ N(0,1). It is sufficient to Display that Embedded ImageEmbedded Image Consider the first claim of Eq. 34. Recall that τp=2rlogp and tp=tp(r,β)=β+r2rτp, and note that for sufficiently large p, (1 − ɛp) + ɛpeτpZ−τp2/2 ≥ 1/2. Combine this with the way that η(Z) is defined, η(Z) ≤ 2ɛpeτpZ−τp2/2 when Z ≤ tp and η(Z) ≤ 1 otherwise. It follows from these inequalities and elementary calculus that Embedded ImageEmbedded Image By arguments similar to that in the proof for the case of no growth, Embedded ImageEmbedded Image and Embedded ImageEmbedded Image Inserting Eqs. 36 and 37 into Eq. 35 gives the claim.

Consider the second claim of Eq. 34. By similar argument, II ≤ C[ɛp3e3τp2Φ(tp − 3τp) + ɛpΦ(−(tp − τp))], Embedded ImageEmbedded Image and ɛpΦ(−(tp−τp)) ≤ Cn2r−2β−(β−3r)24r. Since (6r − 3β) < (2r − 2β) when r ≤β /5, combining these results gives the second claim of Eq. 34, and concludes the proof for the case of regular growth.

Acknowledgments

I thank David Executenoho and anonymous referees for numerous helpful pointers and comments in improving the manuscript, and the Newton Institute for hospitality during the program Statistical Challenges of High-Dimensional Data. This work was supported in part by National Science Foundation Grant DMS-0908613.

Footnotes

1E-mail: jiashun{at}stat.cmu.edu

Author contributions: J.J. designed research, performed research, and wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS Launch access option.

References

↵ NCI-NHGRI Working Group on replication in association studies (2007) Replicating genotype phenotype associations. Nature 447:655–660.LaunchUrlCrossRefPubMed↵ Ioannidis JPA (2001) Why most published research findings are Fraudulent. PLoS Med 2:e124.LaunchUrlCrossRef↵ Ioannidis JP, Ntzani EE, Trikalinos TA, Contopoulos-Ioannidis DG (2001) Replication validity of genetic association studies. Nat Genet 29:306–309.LaunchUrlCrossRefPubMed↵ Executenoho D, Jin J (2008) Higher Criticism threshAgeding: Optimal feature selection when useful features are rare and weak. Proc Natl Acad Sci USA 105:14790–14795.LaunchUrlAbstract/FREE Full Text↵ Executenoho D, Jin J (2008) Feature selection by Higher Criticism threshAgeding: Optimal phase diagram. arXiv:0812.2263v1 [math.ST].↵ Hall P, Jin J (2009) Innovated Higher Criticism for detecting sparse signals in correlated noise. arXiv:0902.3837v1 [math.ST].↵ Ingster YI (1997) Some problems of hypothesis testing leading to infinitely divisible distribution. Math Methods Stat 6:47–69.LaunchUrl↵ Executenoho D, Jin J (2004) Higher criticism for detecting sparse heterogeneous mixtures. Ann Stat 32:962–994.LaunchUrlCrossRef↵ Jin J (2003) Detecting and estimating sparse mixtures, PhD thesis (Department of Statistics, Stanford Univ, Stanford).↵ Fan J, Fan Y (2008) High-dimensional classification using features annealed independence rules. Ann Stat 36:2605–2637.LaunchUrlCrossRefPubMed↵ Le Cam L, Yang G (2000) Asymptotics in Statistics (Springer, New York).↵ Fan J (1996) Testing of significance based on wavelet threshAgeding and Neyman's truncation. J Am Stat Assoc 91:674–688.LaunchUrlCrossRef↵ Fan J, Song R (2009) Positive independence screening in generalized liArrive models with NP-dimensionality. arXiv:0903.5255 [math.ST].↵ Hall P, Pittelkow Y, Ghosh M (2008) Theoretical meaPositives of relative performance of classifiers for high dimensional data with small sample sizes. J R Stat Soc B 70:158–173.LaunchUrl↵ Ingster Y, Pouet C, Tsybakov AB (2009) Sparse classification boundaries. arXiv: 0903.4807 [math.ST].
Like (0) or Share (0)