Edgeworth Expansion

An Edgeworth expansion of the distribution of Wn modifies the standard normal approximation such that the first r cumulants (typically 3 or 4) of the approximating distribution match those of Wn.

From: Philosophy of Statistics , 2011

Normal Approximations

Robert J. Boik , in Philosophy of Statistics, 2011

B.5 Theorem 11: Edgeworth Expansion

The Edgeworth expansion is based on properties of Hermite polynomials, defined in §8, and properties of characteristic and cumulant generating functions, defined in §B.1. Severini [2005, Theorem 3.8] verified that if Y is a random variable whose characteristic function, CF Y (t) satisfies

(17) | CF ( t ) | dt < , then the pdf of Y i s f Y ( y ) = 1 2 π e ity CF Y ( t ) dt .

Suppose that Z ∼ N(0,1). Then the characteristic function of Z is CF Z (t) = exp{− t 2/2}. It follows from (17) that

φ ( z ) = 1 2 π e itz e t 2 / 2 dt ,

where φ(z) = φ(z,0,1). Differentiating both sides of the above equality and using the definition of Hermite polynomials (see §8) reveals that

(18) d r φ ( z ) ( d z ) r = ( 1 ) r φ ( z ) H r ( z ) = 1 2 π e itz ( it ) r e 1 2 t 2 dt .

Also, if r ≥ 1, then

(19) H r ( u ) φ ( u ) du = H r 1 ( z ) φ ( z ) because H r ( u ) φ ( u ) du = ( 1 ) r φ ( u ) d r φ ( u ) ( du ) r φ ( u ) du = ( 1 ) r d r φ ( u ) ( du ) r du = ( 1 ) r d r 1 φ ( u ) ( du ) r 1 | z = ( 1 ) r ( 1 ) r 1 H r 1 ( u ) φ ( u ) | z = H r 1 ( z ) φ ( z ) .

To justify Theorem 11, first use (11) and (13) to expand the cumulant generatingfunction of W n . The result is

CGF W n ( t ) = it n ω 1 + ( it ) 2 2 ( 1 + ω 2 n ) + ( it ) 3 6 n ω 3 + ( it ) 4 24 n ω 4 + o ( n 1 ) .

Second, use the inversion formula (17) and the above expansion to obtain

f n ( w ) = 1 2 π e itw CF W n ( t ) dt = 1 2 π e itw exp { CGF W n ( t ) } dt = 1 2 π e itw exp { i t n ω 1 + ( i t ) 2 2 ( 1 + ω 2 n ) + ( i t ) 3 6 n ω 3 + ( i t ) 4 24 n ω 4 + o ( n 1 ) } dt .

Using (it)2 = − t 2 and expanding the exponential function yields

f n ( w ) = 1 2 π e itw e 1 2 t 2 exp { it n ω 1 + ( it ) 2 2 n ω 2 + ( it ) 3 6 n ω 3 + ( it ) 4 24 n ω 4 + o ( n 1 ) } dt = 1 2 π e itw e 1 2 t 2 [ 1 + it n ω 1 + ( it ) 2 2 n ω 2 + ( it ) 3 6 n ω 3 + ( it ) 4 24 n ω 4 + ( it ) 2 2 n ω 1 2 + ( it ) 4 6 n ω 1 ω 3 + ( it ) 6 72 n ω 3 2 + o ( n 1 ) ] dt = 1 2 π e itw e 1 2 t 2 [ 1 + it n ω 1 + ( it ) 2 2 n ( ω 2 + ω 1 2 ) + ( it ) 3 6 n ω 3 + ( it ) 4 24 n ( ω 4 + 4 ω 1 ω 3 ) + ( it ) 6 72 n ω 3 2 + o ( n 1 ) ] dt .

Lastly, use (18) to integrate term by term. The result is as follows:

f n ( w ) = φ ( w ) [ 1 + H 1 ( z ) n ω 1 + H 2 ( z ) 2 n ( ω 2 + ω 1 2 ) + H 3 ( z ) 6 n ω 3 + H 4 ( z ) 24 n ( ω 4 + 4 ω 1 ω 3 ) + H 6 ( z ) 72 n ω 3 2 + o ( n 1 ) ] .

The Edgeworth expansion for the cdf, F n (w) is obtained by using (19) to integrate the pdf expansion term by term.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444518620500320

Asymptotic Expansions of the Distributions of the Least Squares Estimators in Factor Analysis and Structural Equation Modeling

Haruhiko Ogasawara , in Handbook of Statistics, 2012

Abstract

Asymptotic distributions of the least squares estimators in factor analysis and structural equation modeling are derived using the Edgeworth expansions up to order O(1/ n) under nonnormality. The estimators dealt with in this chapter are those for unstandardized variables by normal theory generalized least squares, simple or scale-free least squares, least squares with powers of diagonals and unweighted least squares, and those by unweighted least squares for standardized variables. It is shown that the formulas also hold for the corresponding estimators by maximum likelihood. Simulations are performed to see the accuracy of the formulas in factor analysis. The case of the normal theory Studentized statistics under nonnormality is discussed.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444518750000075

Handbook of Statistics

P.K. Pathak , C.R. Rao , in Handbook of Statistics, 2013

5 Second-order correctness of the sequential bootstrap

The proof of the second-order correctness of the sequential bootstrap requires the Edgeworth expansion for dependent random variables. Along the lines of the Hall-Mammen work (1994), we first outline an approach based on cumulants. This approach assumes that a formal Edgeworth expansion is valid for pivot under the sequential bootstrap.

Let N i denote the number of times the ith observation x i from the original sample appears in the sequential bootstrap sample, 1 i n . Then

(59) N = N 1 + N 2 + + N n

in which N 1 , N 2 , are exchangeable random variables.

The probability distribution of N is given by

(60) P ( N = k ) = n - 1 m Δ m ( x n ) k

for k m , and in which Δ is the difference operator with unit increment. The moment generating function of N is given by

(61) M ( t ) = E ( e tN ) = n - 1 m Δ m n ( n - xe t ) .

The second-order correctness of the sequential bootstrap for linear statistics such as the sample sum is closely related to the behavior of the moments of the random variables { N i : 1 i n } . Among other things, the asymptotic distribution of each N i is Poisson with mean 1. In fact, it can be shown that

(62) E ( N 1 - 1 ) k 1 ( N i - 1 ) k i = j = 1 i ( e Δ - 1 - Δ ) ( x - 1 ) k j + O 1 n .

It follows from (62) that to order O ( n - 1 ) , the random variables { N i : 1 i n } are asymptotically independent. This implies that the Hall-Mammen-type (1994) conditions for the second-order correctness of the sequential bootstrap hold. This approach is based on the tacit assumption that formal Edgeworth-type expansions go through for the sequential bootstrap. A rigorous justification of such an approach is unavailable in the literature at the present time. Another approach which bypasses this difficulty altogether entails a slight modification of the sequential bootstrap. It is based on the observation that each N i in Eq. (59) is approximately a Poisson variate subject to the constraint:

(63) I ( N 1 > 0 ) + I ( N 2 > 0 ) + + I ( N n > 0 ) = m n ( 1 - e - 1 )

i.e., there are exactly m non-zero N i s , 1 i n . This observation enables us to modify the sequential bootstrap so that existing techniques on the Edgeworth expansion, such as those of Babu and Bai (1996), Bai and Rao (1991,1992), Babu and Singh (1989), and others, can be employed. We refer to this modified version as the Poisson bootstrap.

The Poisson Bootstrap: The original sample ( x 1 , , x n ) is assumed to be from R k for greater flexibility. Let α 1 , , α n denote n independent observations from P ( 1 ) , the Poisson distribution with unit mean. If there are exactly m = [ n ( 1 - e - 1 ) ] non-zero values among α 1 , , α n , take

(64) S ^ N = { ( x 1 , α 1 ) , , ( x n , α n ) }

otherwise reject the α s and repeat the procedure. This is the conceptual definition. The sample size N of the Poisson bootstrap admits the representation:

(65) N = α 1 + α 2 + + α n

in which α 1 , , α n are IID Poisson variates with mean = 1 and with the added restriction that exactly m of the n α s are non-zero, i.e., I ( α 1 > 0 ) + + I ( α n > 0 ) = m .

A simple way to implement the Poisson bootstrap in practice is to first draw a simple random sample without replacement (SRSWOR) of size m from the set of unit-indices { 1 , 2 , , n } , say ( i 1 , i 2 , , i m ) . Then assign respectively to these α i 1 , α i 2 , , α i m values independently drawn from the truncated Poisson distribution with λ = 1 and left-truncated at x = 0 (R-syntax: qpois (runif (m, dpois (0,1), 1), 1)) and set α = 0 for the remaining α s .

It can be shown that the moment generating function M N ( t ) of N = α 1 + α 2 + + α n is (Theorem 2.1 in Babu et al. (1999)):

(66) M N ( t ) = e ( e t - 1 ) - e - 1 ( 1 - e - 1 ) m

so that the distribution of N can be viewed as that of m IID random variables with a common moment generating function:

(67) m ( t ) = e ( e t - 1 ) - e - 1 ( 1 - e - 1 ) .

It is clear that m ( t ) is the moment generating function of the Poisson distribution with location parameter λ = 1 and truncated at x = 0 .

This modification of the sequential bootstrap enables us to develop a rigorous proof of the second-order correctness in the sequential case. Now let ( X 1 , , X n ) be IID random variables with mean μ and variance σ 2 . We assume that X 1 is strongly non-lattice, i.e., it satisfies Cramér's condition:

(68) lim sup | t | | E ( exp ( itX 1 ) ) | < 1 .

Let { Y j : j 1 } be a sequence of IID Poisson random variables with mean 1. We now state three main results, furnishing a rigorous justification for the second-order correctness of the sequential bootstrap. These results follow from conditional Edgeworth expansions for weighted means of multivariate random vectors (cf Babu et al., 1999).

Theorem 4

Suppose that E X 1 5 < and that the characteristic function of X 1 satisfies Cramér's condition (68). If m - n ( 1 - e - 1 ) is bounded, then

(69) P 1 N i = 1 n ( X i - X ¯ ) Y i xs n | T n = m ; X 1 , , X n - P 1 n i = 1 n ( X i - E ( X 1 ) ) x σ

= O p ( n - 1 )

uniformly in x, given T n i = 1 n I ( Y i > 0 ) = m .

Smooth Functional Model: An extension of Theorem 4 to the multivariate case goes through to statistics which can be expressed as smooth functions of multivariate means. Now let X 1 , , X n be a sequence of IID random vectors with mean μ and dispersion matrix Σ . Let Σ n denote the corresponding sample dispersion matrix. Then the following results hold.

Theorem 5

Suppose that X 1 is strongly non-lattice and E X 1 4 < . Let H be a three-times continuously differentiable function in a neighborhood of μ . Let l ( y ) denote the vector of first-order partial derivatives at y and suppose that l ( μ ) 0 . If m - n ( 1 - e - 1 ) is bounded, then for almost all sample sequences { X i : 1 i n } , we have

(70) n P N ( H ( N - 1 i = 1 n X i Y i ) - H ( X ¯ n ) ) l ( X ¯ n ) Σ n l ( X ¯ n ) x | T n = m ; X 1 , , X n

- P n ( H ( X ¯ n ) - H ( μ ) ) l ( μ ) Σ l ( μ ) x = o ( 1 )

in which | · | denotes the sup-norm over x .

The following result is well suited for applications to studentized statistics.

Theorem 6

Let { X i : 1 i n } satisfy the conditions of Theorem 5. Suppose that the function H is three-times continuously differentiable in the neighborhood of the origin and H ( 0 ) = 0 . If m - n ( 1 - e - 1 ) is bounded, then for almost all sample sequences { X i : 1 i n } , we have

(71) n P N ( H ( N - 1 i = 1 n ( X i - X ¯ ) Y i ) ) l ( 0 ) Σ n l ( 0 ) z | T n = m ; X 1 , , X n

- P n ( H ( X ¯ n - μ ) ) l ( 0 ) Σ l ( 0 ) z = o ( 1 )

For example, an immediate consequence of Theorem 6 is the second-order correctness of the following sequential bootstrap pivot:

(72) π ˆ N = N ( i = 1 n ( X i - X ¯ ) Y j ) / s n

given that T n i = 1 n I ( Y i > 0 ) = m .

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444538598000011

Blind Source Separation: The Sparsity Revolution

Jerome Bobin , ... Mohamed Jalal Fadili , in Advances in Imaging and Electron Physics, 2008

C The Algorithmic Viewpoint

a. Approximating Independence . In the ICA setting, the mixing matrix is square and invertible. Solving a BSS problem is equivalent to looking for a demixing matrix B that maximizes the independence of the estimated sources: S ˜ = BX . In that setting, maximizing the independence of the sources (with respect to the KL divergence) is equivalent to maximizing the non-Gaussianity of the sources. Since the seminal article by Comon (1994), a variety of ICA algorithms have been proposed. They all merely differ in the way they devise assessable quantitative measures of independence. Some popular approaches that have given "measures" of independence are presented below:

Information maximization (see Bell and Sejnowski, 1995; Nadal and Parga, 1994): Bell and Sejnowski showed that maximizing the information of the sources is equivalent to minimizing the measure of independence based on the KL divergence in Eq.(5).

Maximum likelihood: Maximum likelihood (ML) has also been proposed to solve the BSS issue. The ML approach (Cardoso, 1997; Parra and Pearlmutter, 1997; Pham et al., 1992) has been showed to be equivalent to information maximization (InfoMax) in the ICA framework.

Higher-order statistics: As noted previously, maximizing the independence of the sources is equivalent to maximizing their non-Gaussianity under a strict decorrelation constraint. Because Gaussian random variables have vanishing higher-order cumulants, devising a separation algorithm based on higher-order cumulants should provide a way of accounting for the non-Gaussianity of the sources. A wide range of algorithms have been proposed based on the use of higher-order statistics (Hyvarinen et al., 2001; Belouchrani et al., 1997; Cardoso, 1999, and references therein). Historical papers (see Comon, 1994 ) proposed ICA algorithms that use approximations of the KL divergence (based on truncated edgeworth expansions). Interestingly, those approximations explicitly involve higher-order statistics.

Lee et al. (1998) showed that most ICA-based algorithms are similar in theory and in practice.

b. Limits of ICA . Despite its theoretical strength and elegance, ICA has several limitations:

Probability density assumption: Even implicit, ICA algorithm requires information on the sources distribution. As stated in Lee et al. (1998), whatever the contrast function to minimize (mutual information, ML, higher-order statistics), most ICA algorithms can be equivalently restated in a natural gradient form (Amari, 1999; Amari and Cardoso, 1996). In such a setting, the "demixing" matrix B is estimated iteratively: B  B  + μΔB where the natural gradient of B is given by:

(8) Δ B I h S ˜ S ˜ T B ,

where the function h is applied elementwise: h ( S ˜ ) = [ h ( s ˜ i j ) ] and S ˜ is the current estimate of S : S ˜ = BX . Interestingly, the so-called score function h in Eq.(8) is closely related to the assumed pdf of the sources (see Amari and Cardoso, 1996; Amari and Cichocki, 2002). Assuming that all the sources are generated from the same probability density function f S , the so-called score function h is defined as follows:

(9) h S ˜ = log f S S ˜ S ˜ .

As expected, the way the "demixing" matrix (and thus the sources) is estimated closely depends on the way the sources are modeled (from a statistical point of view). For instance, separating platykurtic (distribution with negative kurtosis) or leptokurtic (distribution with positive kurtosis) sources requires completely different score functions. Even if ICAis shown in Amari and Cardoso to be quite robust to "mismodeling," the choice of the score function is crucial with respect to the convergence (and rate of convergence) of ICA algorithms. Some ICA-based techniques (see Koldovsky and Oja, 2006) emphasized adapting the popular FastICA algorithm to adjust the score function to the distribution of the sources. They particularly emphasize modeling sources the distribution of which belongs to specific parametric classes of distributions such as generalized Gaussian: f S (S)     ij   exp(− μ|s ij | θ ). 1

Noisy ICA: Only a few works have already investigated the problem of noisy ICA (see Davies, 2004; Koldovsky and Tichavsky, 2006). As pointed out by Davies (2004), noise clearly degenerates the ICA model: it is not fully identifiable. In the case of additive Gaussian noise as stated in Eq.(2), using higher-order statistics yields an efficient estimate of the mixing matrix A   =   B   1 (higher-order statistics are blind to additive Gaussian noise; this property does not hold for non-Gaussian noise). Further, in the noisy ICA setting, applying the demixing matrix to the data does not yield an efficient estimate of the sources. Furthermore, most ICA algorithms assume the mixing matrix A to be square. When there are more observations than sources (m  > n), a dimension reduction step is preprocessed. When noise perturbs the data, this subspace projection step can dramatically deteriorate the performance of the separation stage.

The next section introduces a new way of modeling the data to avoid most of the aforementioned limitations of ICA.

1 Sparsity in Blind Source Separation

In the above paragraph, we pointed out that BSS is overwhelmingly a question of contrast and diversity. Indeed, devising a source separation technique consists of finding an effective way of disentangling between the sources. From this viewpoint, statistical independence is a kind of "measure" of diversity between signals. Within this paradigm, we can wonder if independence is a natural way of differentiating between signals.

As a statistical property, independence is a non-sense in a non-asymptotic study. In practice, one must deal with finite-length signals, sometimes with a few samples. Furthermore, most real-world data are modeled by stationary stochastic processes. Let us consider the images in Figure 1.

Figure 1. Examples of natural images.

Natural pictures are clearly nonstationary. As these pictures are slightly correlated, independence fails in differentiating between them. Hopefully, the human eye (more precisely the different levels of the human visual cortex) is able to distinguish between those two images. Then, what makes the eye so effective in discerning between visual "signals"?

The answer may come from neurosciences. Indeed, for a decades, many researchers (Barlow, 1961; Field, 1999; Hubel and Wiesel, 1981; 2 Olshausen and Field, 2006; Simoncelli and Olshausen, 2001, and references therein) in this field have endeavored to provide some exciting answers: the mammalian visual cortex seems to have learned via the natural selection of individuals, an effective way of coding the information in natural scenes. Indeed, the first level of the mammalian visual cortex (termed V1) seems to verify several interesting properties: (1) it tends to "decorrelate" the responses of visual receptive fields (following Simoncelli and Olshausen, 2001; an efficient coding cannot duplicate information in more than one neuron), (2) owing to a kind of "economy/compression principle," saving neurons' activity yields a sparse activation of neurons for a given stimulus (this property can be considered as a way of compressing information).

Furthermore, the primary visual cortex is sensitive to particular stimuli (visual features) that surprisingly look like oriented Gabor-like wavelets (see Field, 1999). It gives support to the crucial part played by contours in natural scenes. Furthermore, each stimulus tends to be coded by a few neurons. Such a way of coding information is often referred to as sparse coding. These few elements of neuroscience motivate the use of sparsity as an effective way of compressing signal's information, thus extracting its very essence.

Inspired by the behavior of our visual cortex, seeking a sparse code may provide an effective way of differentiating between "different" signals. Here, "different" signals are signals with different sparse representations.

a. A Pioneering Work in Sparse BSS . The seminal paper of Zibulevsky and Pearlmutter (2001) introduced sparsity as an alternative to standard contrast functions in ICA. In this work, the authors proposed to estimate the mixing matrix A and the sources S in a fully Bayesian framework. Each source {si } i  =   1,…,n is assumed to be sparsely represented in the basis Ф:

(10) i = 1 , , n ; s i = k = 1 t α i k ϕ k .

As the sources are assumed to be sparse, the distribution of their coefficients in Ф is a "sparse" (i.e., leptokurtic) prior distribution:

(11) f S α i k e μ i g γ α i k ,

where gγ (αi [k])   =   |αi [k]| γ with γ    1. 3 Zibulevsky proposed to estimate A and S via a maximum a posteriori (MAP) estimator. The optimization task is then run using a Newton-like algorithm: the relative newton algorithm (RNA; see Zibulevski, 2003 for more details). This new sparsity-based method paved the way for the use of sparsity in BSS. Note that several other works emphasized the use of sparsity in a parametric Bayesian approach (Hyvarinen et al., 2001 and references therein). Recently, sparsity has emerged as an effective tool for solving underdetermined source separation issues (Bronstein et al., 2005; Georgiev et al., 2005; Li et al., 2006; Vincent, 2007 and references therein). This chapter concentrates on overdetermined BSS (m  n). Inspired by the work of Zibulevsky, we present a novel sparsity-based source separation framework providing new insights into BSS.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S1076567008006058

Higher-order approximations for interval estimation in binomial settings

Ana-Maria Staicu , in Journal of Statistical Planning and Inference, 2009

To obtain the Edgeworth expansion for the coverage probability of the one-sided r * confidence intervals it requires a two-step approach. First, we write the stochastic expansion of r * in powers of the score statistic, Z n . Then we expand the coverage probability of the corresponding upper limit confidence intervals, by using the expansion for the cumulative distribution function of Z n ; see the Appendix for more details. Alternatively one could use the expansion of r * with respect to W ˜ n found in Section 2.1, followed by the corresponding Edgeworth expansion of the distribution of W ˜ n . Continuing the theorems presented in Brown et al. (2002) and Cai (2005) we state the following proposition.

Proposition 2.1

The coverage probability of the 100 ( 1 - α ) % upper limit confidence interval CI r * u satisfies

P r * ( θ ) = Pr ( θ CI r * u ) = ( 1 - α ) + Q 1 ( θ , z r * u ) σ - 1 φ ( κ ) n - 1 / 2 + { 1 3 ( 1 - 2 θ ) Q 1 ( θ , z r * u ) - Q 2 ( θ , z r * u ) } σ - 1 κ φ ( κ ) n - 1 + O ( n - 3 / 2 ) ,

where σ 2 = θ ( 1 - θ ) , z r * u is defined by (A.3) in the Appendix, Q 1 ( θ , z ) = g ( θ , z ) - 1 2 and Q 2 ( θ , z ) = 1 2 g 2 ( θ , z ) - 1 2 g ( θ , z ) + 1 12 . Here g ( θ , z ) denotes the fractional part of n θ + n 1 / 2 σ z and we assume that n θ + n 1 / 2 σ z r * u is not an integer.

Read full article

URL:

https://www.sciencedirect.com/science/article/pii/S037837580900086X