correlation coefficient 相关系数

Pearson product-moment correlation coefficient In statistics, the Pearson product-moment correlation coefficient (sometimes referred to as the PMCC , and typically denoted by r ) is a measure of the correlation (linear dependence) between two variables X and Y , giving a value between +1 and −1 inclusive. It is widely used in the sciences as a measure of the strength of linear dependence between two variables. It was developed by Karl Pearson from a similar but slightly different idea introduced by Francis Galton in the 1880s.[1] [2] The correlation coefficient is sometimes called "Pearson's r."Several sets of (x , y ) points, with the correlation coefficient of x and y for each set. Note
that the correlation reflects the noisiness and direction of a linear relationship (top row),
but not the slope of that relationship (middle), nor many aspects of nonlinear relationships
(bottom). N.B.: the figure in the center has a slope of 0 but in that case the correlation
coefficient is undefined because the variance of Y
is zero.
Definition
Pearson's correlation coefficient
between two variables is defined as the
covariance of the two variables divided
by the product of their standard
deviations:
The above formula defines the population correlation coefficient, commonly represented by the Greek letter ρ (rho).Substituting estimates of the covariances and variances based on a sample gives the sample correlation coefficient ,commonly denoted r
:
An equivalent expression gives the correlation coefficient as the mean of the products of the standard scores. Based on a sample of paired data (X i , Y i
园林通), the sample Pearson correlation coefficient is
where , and are the standard score, sample mean, and sample standard deviation respectively.
Mathematical properties
The absolute value of both the sample and population Pearson correlation coefficients are less than or equal to 1.Correlations equal to 1 or -1 correspond to data points lying exactly on a line (in the case of the sample correlation),or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson correlation coefficient is symmetric: corr (X ,Y ) = corr (Y ,X ).
A key mathematical property of the Pearson correlation coefficient is that it is invariant to separate changes in location and scale in the two variables. That is, we may transform X to a  + bX and transform Y to c  + dY , where a , b ,c , and d are constants, without changing the correlation coefficient (this fact holds for both the population and sample Pearson correlation coefficients). Note that more general linear transformations do change the correlation:see a later section for an application of this.
The Pearson correlation can be expressed in terms of uncentered moments. Since μX = E(X ), σX 2 = E[(X  − E(X ))2]= E(X 2) − E 2(X ) and likewise for Y , and since
the correlation can also be written as
Alternative formulae for the sample Pearson correlation coefficient are also available:
The above formula conveniently suggests a single-pass algorithm for calculating sample correlations, but, depending on the numbers involved, it can sometimes be numerically unstable.
Interpretation
The correlation coefficient ranges from −1 to 1. A value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases. A value of −1implies that all data points lie on a line for which Y decreases as X increases. A value of 0 implies that there is no linear correlation between the variables.
More generally, note that (X i  − X )(Y i  − Y ) is positive if and only if X i and Y i lie on the same side of their respective means. Thus the correlation coefficient is positive if X i and Y i tend to be simultaneously greater than, or simultaneously less than, their respective means. The correlation coefficient is negative if X i and Y i tend to lie on opposite sides of their respective means.
Geometric interpretation
Regression lines for y=g x (x) [red] and x=g y
(y) [blue ]石墨烯膜
For uncentered data, the correlation coefficient
corresponds with the the cosine of the angle
between both possible regression lines y=g x (x) andpolyview
x=g y
(y).
For centered data (i.e., data which have been
shifted by the sample mean so as to have an
average of zero), the correlation coefficient can
also be viewed as the cosine of the angle
卷取机
between the two vectors of samples drawn from
the two random variables (see below).
Some practitioners prefer an uncentered
(non-Pearson-compliant) correlation coefficient.
See the example below for a comparison.
As an example, suppose five countries are found to
have gross national products of 1, 2, 3, 5, and 8
billion dollars, respectively. Suppose these same five countries (in the same order) are found to have 11%, 12%,13%, 15%, and 18% poverty. Then let x and y be ordered 5-element vectors containing the above data: x = (1, 2, 3,5, 8) and y = (0.11, 0.12, 0.13, 0.15, 0.18).By the usual procedure for finding the angle between two vectors (see dot product), the uncentered correlation
coefficient is:
Note that the above data were deliberately chosen to be perfectly correlated: y = 0.10 + 0.01 x . The Pearson correlation coefficient must therefore be exactly one. Centering the data (shifting x by E(x ) = 3.8 and y by E(y ) =0.138) yields x = (−2.8, −1.8, −0.8, 1.2, 4.2) and y = (−0.028, −0.018, −
0.008, 0.012, 0.042), from which
as expected.
Interpretation of the size of a correlation  Correlation
Negative  Positive None
−0.09 to 0.00.0 to 0.09Small
口疮灵
−0.3 to −0.10.1 to 0.3Medium
−0.5 to −0.30.3 to 0.5Large −1.0 to −0.50.5 to 1.0长江经济带
Several authors [3] have offered guidelines for the interpretation of a correlation coefficient. Cohen (1988),[3] has observed, however, that all such criteria are in some ways arbitrary and should not be observed too strictly. The interpretation of a correlation coefficient depends on the context and purposes. A correlation of 0.9 may be very low if one is verifying a physical law using high-quality instruments, but may be regarded as very high in the social sciences where there may be a greater contribution from complicating factors.
Inference
A graph showing the minimum value of Pearson's correlation coefficient that is
significantly different from zero at the 0.05 level, for a given sample size.Statistical inference based on Pearson's
correlation coefficient often focuses on one
of the following two aims. One aim is to test
the null hypothesis that the true correlation
coefficient is ρ, based on the value of the
sample correlation coefficient r . The other
aim is to construct a confidence interval
around r that has a given probability of
containing ρ.
Randomization approaches
Permutation tests provide a direct approach
to performing hypothesis tests and
constructing confidence intervals. A
permutation test for Pearson's correlation
coefficient involves the following two steps:(i) using the original paired data (x i , y i ),
randomly redefine the pairs to create a new
data set (x i , y i ′), where the i ′ are a permutation of the set {1,...,n }. The permutation i ′ is selected randomly, with equal probabilities placed on all n ! possible permutations. This is equivalent to drawing the i ′ randomly "without replacement" from the set {1,..., n }. A closely-related and equally-justified (bootstrapping) approach is to separately draw the i and the i ′ "with replacement" from {1,..., n }; (ii) Construct a correlation coefficient r from the randomized data. To perform the permutation test, repeat (i) and (ii) a large number of times. The p-value for the permutation test is one minus the proportion of the r values generated in step (ii) that are larger than the Pearson correlation coefficient that was calculated from the original data. Here "larger" can mean either that the value is l
arger in magnitude, or larger in signed value, depending on whether a two-sided or one-sided test is desired.
The bootstrap can be used to construct confidence intervals for Pearson's correlation coefficient. In the "non-parametric" bootstrap, n pairs (x i , y i ) are resampled "with replacement" from the observed set of n pairs, and the correlation coefficient r is calculated based on the resampled data. This process is repeated a large number of times,and the empirical distribution of the resampled r values are used to approximate the sampling distribution of the statistic. A 95% confidence interval for ρ can be defined as the interval spanning from the 2.5th to the 97.5th percentile of the resampled r values.
Approaches based on mathematical approximations
For approximately Gaussian data, the sampling distribution of Pearson's correlation coefficient approximately follows Student's t-distribution with degrees of freedom N  − 2. Specifically, if the underlying variables have a
bivariate normal distribution, the variable
has a Student's t-distribution in the null case (zero correlation).[4] This also holds approximately even if the observed values are non-normal, provided sample sizes are not very small.[5] For constructing confidence intervals and performing power analyses, the inverse of this transformation is also needed:
Alternatively, large sample approaches can be used.
Early work on the distribution of the sample correlation coefficient was carried out by R. A. Fisher[6][7] and A. K. Gayen.[8] Another early paper[9] provides graphs and tables for general values of ρ, for small sample sizes, and discusses computational approaches.
Fisher Transformation
In practice, confidence intervals and hypothesis tests relating to ρ are usually carried out using the Fisher transformation:
If F(r) is the Fisher transformation of r, and n is the sample size, then F(r) approximately follows a normal distribution with
and standard error
Thus, a z-score is
under the null hypothesis of that , given the assumption that the sample pairs are independent and identically distributed and follow a bivariate normal distribution. Thus an approximate p-value can be obtained from a normal probability table. For example, if z = 2.2 is observed and a two-sided p-value is desired to test the null hypothesis that , the p-value is 2·Φ(−2.2) = 0.028, where Φ is the standard normal cumulative distribution function.
Confidence Intervals
To obtain a confidence interval for ρ, we first compute a confidence interval for F( ):
The inverse Fisher transformation bring the interval back to the correlation scale.
For example, suppose we observe r = 0.3 with a sample size of n=50, and we wish to obtain a 95% confidence interval for ρ. The transformed value is artanh(r) = 0.30952, so the confidence interval on the transformed scale is 0.30952 ± 1.96/√47, or (0.023624, 0.595415). Converting back to the correlation scale yields (0.024, 0.534).

本文发布于:2024-09-22 08:23:22,感谢您对本站的认可!

本文链接:https://www.17tex.com/xueshu/263519.html

版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。

标签:石墨   长江   烯膜   经济带
留言与评论(共有 0 条评论)
   
验证码:
Copyright ©2019-2024 Comsenz Inc.Powered by © 易纺专利技术学习网 豫ICP备2022007602号 豫公网安备41160202000603 站长QQ:729038198 关于我们 投诉建议