在置信区间下置信值的计算_使用自举计算置信区间

在置信区间下置信值的计算_使⽤⾃举计算置信区间

在置信区间下置信值的计算

嗨，⼤家好， (Hi everyone,)

In this article, I will attempt to explain how we can find a confidence interval by using Bootstrap Method. Statistics and Python knowledge are needed for better understanding.

在本⽂中，我将尝试解释如何使⽤Bootstrap⽅法到置信区间。需要统计信息和Python知识才能更好地理解。

Before diving into the method, let’s remember some statistical concepts.

在深⼊探讨该⽅法之前，让我们记住⼀些统计概念。

Variance: It is obtained by the sum of squared distances between a data point and the mean for each data point divided by the number of data points.

⽅差：通过将数据点与每个数据点的平均值之间的平⽅距离之和除以数据点数⽽获得。

Sample variance

样本⽅差

Standard Deviation: It is a measurement that shows us how our data points spread out from the mean. It is obtained by taking the square root of the variance

标准差：这是⼀项度量，它向我们显⽰了数据点如何从均值散布。通过求⽅差的平⽅根获得

Sample standard deviation

样品标准偏差

Cumulative Distribution Function: It can be used on any kind of variable X(discrete, continuous, etc.). It shows us the probability distribution of a variable. Therefore allowing us to interpret the probability of a value less than or equal to x from a given probability distribution

累积分布函数：可⽤于任何类型的变量X(离散，连续等)。它向我们展⽰了变量的概率分布。因此，允许我们根据给定的概率分布来解释⼩于或等于x的值的概率

Empirical Cumulative Distribution Function: Also known as Empirical Distribution Function. The only difference between CDF and ECDF is, while the former shows us the hypothetical distribution of any given population, the latter is based on our observed data.

经验累积分布函数：也称为经验分布函数。 CDF和ECDF之间的唯⼀区别是，前者向我们展⽰了任何给定总体的假设分布，⽽后者则基于我们的观察数据。

党英杰For example, how can we interpret the ECDF of the data shown on the chart above? We can say that 40% of heights are

less than or equal to 160cm. Likewise, the percentage of people with heights of less than or equal to 180 cm is 99.3%

例如，我们如何解释上表所⽰数据的ECDF？可以说40％的⾼度⼩于或等于160cm。同样，⾝⾼⼩于或等于180厘⽶的⼈的百分⽐是99.3％

Probability Density Function: It shows us the distribution of continuous variables. The area under the curve gives us the probability so that the area must always be equal to 1

概率密度函数：它向我们展⽰了连续变量的分布。曲线下的⾯积为我们提供了概率，因此该⾯积必须始终等于1

Normal Distribution: Also known as Gaussian Distribution. It is the most important probability distribution function in statistics which is bell-shaped and symmetric.

正态分布：也称为⾼斯分布。它是钟形和对称的统计中最重要的概率分布函数。

Normal (Gaussian) Distribution

正态(⾼斯)分布

Confidence Interval: It is the range in which the values likely to exist in the population. It is estimated from the original sample and usually defined as 95% confidence but it may differ. You can consider the figure below which indicates a 95% confidence interval. The lower and upper limits of confidence interval defined by the values corresponding to the first and last 2.5th percentiles.

置信区间：这是总体中可能存在的值的范围。它是根据原始样本估算的，通常定义为95％置信度，但可能有所不同。您可以考虑下图，它表⽰置信区间为95％。置信区间的上限和下限由与第⼀个和最后⼀个第2.5个百分点相对应的值定义。

Image by author作者提供的图⽚

什么是Bootstrap⽅法？ (What is Bootstrap Method?)

Bootstrap Method is a resampling method that is commonly used in Data Science. It has been introduced by Bradley Efron in 1979. Mainly, it consists of the resampling our original sample with replacement (Bootstrap Sample) and generating Bootstrap replicates by using Summary Statistics.

Bootstrap⽅法是数据科学中常⽤的重采样⽅法。它由布拉德利·埃夫隆(Bradley Efron)在1979年推出。主要包括重新采样原始样本并进⾏替换( Bootstrap Sample )，并使⽤Summary Statistics⽣成Bootstrap副本。

⼈们⾝⾼的置信区间 (Confidence Interval of people heights)

In this article, we are going to work with one of the datasets in . It is Weight-Height data sets. It contains height (in inches) and weight (in pounds) information of 10.000 people separated by gender.

在本⽂中，我们将使⽤⼀个数据集。它是重量-⾼度数据集。它包含按性别分隔的10.000⼈的⾝⾼(英⼨)和体重(磅)信息。

霍夫曼

抚顺育才中学If you would like to see the whole code, you can find the IPython notebook via this .

如果您想查看整个代码，可以通过此到 IPython笔记本。

We are going to use only heights of 500 randomly selected people and compute a 95% confidence interval by using Bootstrap Method

我们将仅使⽤500个随机选择的⼈员的⾝⾼，并使⽤Bootstrap⽅法计算95％的置信区间

Let’s start with importing the libraries that we will need.

让我们从导⼊所需的库开始。

The first five rows of the DataFrame like following

DataFrame的前五⾏如下所⽰

Apparently, heights are in inches, let’s convert heights from inches to centimeters and store in a new column Height(cm).

显然，⾼度以英⼨为单位，让我们将⾼度从英⼨转换为厘⽶，并存储在新列Height(cm)中。

As we can see above, the maximum and minimum height in the data set are 137.8 cm and 200.6 cm respectively.

从上⾯我们可以看到，数据集中的最⼤⾼度和最⼩⾼度分别为137.8 cm和200.6 cm。

We can use pandas.DataFrame’s sample method to select 500 randomly selected heights. After that, we will print the summary statistics.

我们可以使⽤pandas.DataFrame的样本⽅法来选择500个随机选择的⾼度。之后，我们将打印摘要统计信息。

According to the output, our sample has 145 cm as minimum height and 198 cm as the maximum height.

根据输出，我们的样本的最⼩⾼度为145厘⽶，最⼤⾼度为198厘⽶。

Let’s look at how ECDF and PDF look like?

让我们看看ECDF和PDF的外观如何？

ECDF, Image by author ECDF，作者提供的图⽚

Empirical CDF demonstrates that 50% of people in our sample have 162 cm or less height.

经验CDF证明样本中50％的⼈⾝⾼在162厘⽶以下。

What about PDF?

那PDF呢？

PDF, Image by author PDF，作者提供的图⽚

PDF shows us the heights’ distribution is too close to the normal distribution. Do not forget that the area under the curve gives us the probability.

PDF显⽰⾼度的分布与正态分布过于接近。不要忘记曲线下⽅的⾯积给了我们概率。

Now, take a moment to think. We have only 500 observations in our sample, but there are billions of people in the world who we cannot measure their heights. Therefore, our sample does not give inference to the population. If we did the same measurements for different samples again and again, what would be the mean of heights?

现在，花点时间思考。我们的样本中只有500个观测值，但是世界上有数⼗亿⼈⽆法测量他们的⾝⾼。因此，我们的样本⽆法推断总体。如果我们⼀次⼜⼀次地对不同的样品进⾏相同的测量，那么⾼度的平均值是多少？

For instance, assume that we did the same measurements with the same number of people (500) for 1000 times and plot the ECDF for each in a way that overlays the first observation’s ECDF. It will look like the following.

例如，假设我们⽤相同的⼈数(500)进⾏了1000次相同的测量，并以覆盖第⼀个观测值的ECDF的⽅式绘制了每个ECDF的图。它将如下所⽰。

ECDF, Image by author ECDF，作者提供的图⽚

As we can see above, we got different heights, but we can easily detect that the points are spreading in a specific range. That’s the confidence interval that we want to learn

雷可德正如我们在上⾯看到的，我们得到了不同的⾼度，但是我们可以轻松地检测到这些点在特定范围内扩展。那就是我们要学习的置信区间

You may say that it is impossible to repeat the experiment so many times, you are not wrong. The exact reason why we use the Bootstrap Method. It helps us to simulate the same experiment thousands or even billions of times.

您可能会说不可能重复这么多次实验，您是对的。我们使⽤Bootstrap⽅法的确切原因。它可以帮助我们模拟同⼀实验数千甚⾄数⼗亿次。

How?

怎么样？

In fact, the Bootstrap Method is quite straightforward and easy to understand. First, it generates bootstrap samples from our original sample by randomly choosing among the original sample. After that, it applies a summary statistics such as variation, standard deviation, mean, and so forth to get replicates. We will use ‘mean’ to generate our bootstrap replicates.

鸡西大学学报实际上，Bootstrap⽅法⾮常简单易懂。⾸先，它通过从原始样本中随机选择来从我们的原始样本中⽣成引导样本。之后，它应⽤摘要统计信息(例如变异，标准偏差，均值等)来获得重复数据。我们将使⽤“均值”来⽣成引导程序副本。

To understand the method, let’s apply it to a small sample that contains only 5 heights. We can generate our bootstrap samples like the following. Do not forget the fact that we can choose any observation more than once (resampling with replacement)

为了理解该⽅法，让我们将其应⽤于仅包含5个⾼度的⼩样本。我们可以⽣成如下所⽰的引导程序⽰例。不要忘记我们可以多次选择任何观测值的事实(通过替换进⾏重采样)

Resampling, Image by author重采样，作者提供图⽚

As we can see above we create 4 bootstrap samples and after that calculate their means. We will call these means our bootstrap replicates. Instead of ‘mean’ we could choose variance, standard deviation, median, or anything else.

正如我们在上⾯看到的，我们创建了4个引导程序样本，然后计算它们的均值。我们将这些称为“引导复制”。除了“均值”，我们可以选择⽅差，标准差，中位数或其他任何值。

Come back to our project. The next step, we are going to generate our bootstrap sample from our original sample and we will apply to mean to get bootstrap replicate. We will repeat this process 15.000 times (drawing) in a for loop and store the replicates in an array. To do this we can define a function like following

回到我们的项⽬。下⼀步，我们将从原始样本中⽣成引导样本，并将应⽤于以获得引导复制。我们将在for循环中重复此过程15.000次(绘制)，并将重复项存储在数组中。为此，我们可以定义如下函数

田家英After we got 15.000 replicates by calling the function, we can compare between the means of the original sample and the bootstrap replicates

通过调⽤函数获得15.000个复制后，我们可以在原始样本的均值和引导复制之间进⾏⽐较

Their means are too close.

他们的⼿段太接近了。

So, what are we going to do to calculate a 95% confidence interval?

那么，我们要怎么做才能计算出95％的置信区间？

After obtaining bootstrap replicates, the rest is so simple. As we know, our lower and upper limits are the values correspond to the 2.5th and 97.5th percentiles.

本文发布于:2024-09-20 22:40:14，感谢您对本站的认可！

本文链接：https://www.17tex.com/xueshu/516458.html

上一篇：鲁教版英语(五四制)八年级上册_【拓展阅读】:Inventions_

下一篇：5%米诺地尔酊联合高能窄谱640 nm红光斑秃的临床观察

标签：样本数据统计

留言与评论（共有 0 条评论）