What is the confidence interval for? Samples and Confidence Intervals

One of the methods for solving statistical problems is the calculation of the confidence interval. It is used as a preferred alternative to point estimation in small sample sizes. It should be noted that the process of calculating the confidence interval itself is rather complicated. But the tools of the Excel program allow you to simplify it somewhat. Let's find out how this is done in practice.

This method is used for interval estimation of various statistical quantities. The main task of this calculation is to get rid of the uncertainties of the point estimate.

In Excel, there are two main options for performing calculations using this method: when the variance is known, and when it is unknown. In the first case, the function is used for calculations TRUST.NORM, and in the second - CONFIDENCE STUDENT.

Method 1: function CONFIDENCE NORMAL

Operator TRUST.NORM, which belongs to the statistical group of functions, first appeared in Excel 2010. Earlier versions of this program use its counterpart TRUST... The task of this operator is to calculate the confidence interval with a normal distribution for the mean the general population.

Its syntax is as follows:

TRUST.NORM (alpha; standard_dev; size)

"Alpha"- an argument indicating the level of significance that is used to calculate the confidence level. The confidence level is equal to the following expression:

(1- "Alpha") * 100

"Standard deviation" Is an argument, the essence of which is clear from the name. This is the standard deviation of the proposed sample.

"The size"- an argument defining the sample size.

All arguments to this operator are required.

Function TRUST has exactly the same arguments and possibilities as the previous one. Its syntax is as follows:

TRUST (alpha; standard_dev; size)

As you can see, the differences are only in the name of the operator. The specified function is retained in Excel 2010 and newer in a special category for compatibility reasons. "Compatibility"... In versions of Excel 2007 and earlier, it is present in the main group of statistical operators.

The boundary of the confidence interval is determined using a formula of the following form:

X + (-) TRUST.NORMAL

Where X Is the mean of the sampled value, which is located in the middle of the selected range.

Now let's look at how to calculate the confidence interval for specific example... 12 tests were carried out, as a result of which various results were obtained, which are listed in the table. This is our totality. The standard deviation is 8. We need to calculate the confidence interval at a 97% confidence level.

Select the cell where the result of data processing will be displayed. Click on the button "Insert function".

Appears Function wizard... Go to the category "Statistical" and highlight the name "TRUST.NORM"... After that, click on the button "OK".

The arguments window opens. Its fields naturally correspond to the names of the arguments.
We set the cursor to the first field - "Alpha"... Here we should indicate the level of significance. As we remember, our level of trust is 97%. At the same time, we said that it is calculated in this way:
(1-level of trust) / 100

That is, substituting the value, we get:

By simple calculations, we find out that the argument "Alpha" is equal to 0,03 ... Introduce given value in field.

As you know, by condition, the standard deviation is 8 ... Therefore, in the field "Standard deviation" just write down this number.

In field "The size" you need to enter the number of elements of the tests performed. As we remember, their 12 ... But in order to automate the formula and not edit it every time a new test is carried out, let's set this value not with an ordinary number, but using the operator CHECK... So, we place the cursor in the field "The size", and then click on the triangle, which is located to the left of the formula bar.

A list of recently used functions appears. If the operator CHECK has been used recently, then it should be on this list. In this case, you just need to click on its name. In the opposite case, if you do not find it, then go to item "Other functions ...".

The already familiar to us appears Function wizard... Move to the group again "Statistical"... Highlight the name there "CHECK"... Click on the button "OK".

The argument window for the above operator appears. This function is designed to calculate the number of cells in the specified range that contain numeric values. Its syntax is as follows:
COUNT (value1; value2; ...)

Argument group "Values" is a reference to the range in which you want to calculate the number of cells filled with numeric data. There can be up to 255 such arguments in total, but in our case we only need one.

Place the cursor in the field "Value1" and, holding down the left mouse button, select on the sheet the range that contains our collection. Then its address will be displayed in the field. Click on the button "OK".

After that, the application will perform the calculation and display the result in the cell where it is located. In our particular case, the formula turned out to be like this:
CONFIDENT.NORM (0,03; 8; COUNT (B2: B13))

The total calculation result was 5,011609 .

But that is not all. As we remember, the boundary of the confidence interval is calculated by adding and subtracting from the mean of the sampled value of the calculation result TRUST.NORM... In this way, the right and left boundaries of the confidence interval are calculated, respectively. The sample mean itself can be calculated using the operator AVERAGE.
This operator is designed to calculate the arithmetic mean of the selected range of numbers. It has the following rather simple syntax:

AVERAGE (number1; number2; ...)

Argument "Number" can be either a single numeric value or a reference to cells or even entire ranges that contain them.

So, select the cell in which the calculation of the average value will be displayed, and click on the button "Insert function".

Opens Function wizard... Go to the category again "Statistical" and select the name from the list "AVERAGE"... As always, click on the button "OK".

The arguments window is launched. Place the cursor in the field "Number1" and with the left mouse button pressed, select the entire range of values. After the coordinates are displayed in the field, click on the button "OK".

After that AVERAGE outputs the result of the calculation to a sheet element.

We calculate the right border of the confidence interval. To do this, select a separate cell, put the sign «=» and add the contents of the sheet elements in which the results of function calculations are located AVERAGE and TRUST.NORM... In order to perform the calculation, press the button Enter... In our case, we got the following formula:
Calculation result: 6,953276

In the same way, we calculate the left border of the confidence interval, only this time from the result of the calculation AVERAGE subtract the result of calculating the operator TRUST.NORM... The result is a formula for our example of the following type:
Calculation result: -3,06994

We tried to describe in detail all the steps for calculating the confidence interval, so we described each formula in detail. But you can combine all the actions in one formula. The calculation of the right border of the confidence interval can be written as follows:
AVERAGE (B2: B13) + CONFIDENTIAL NORM (0.03; 8; COUNT (B2: B13))

A similar calculation of the left border would look like this:
AVERAGE (B2: B13) -TRUST.NORM (0.03; 8; COUNT (B2: B13))

Method 2: CONFIDENCE STUDENT function

In addition, Excel has one more function that is related to the calculation of the confidence interval - CONFIDENCE STUDENT... It has only appeared since Excel 2010. This operator calculates the confidence interval of a population using the Student's t distribution. It is very convenient to use when the variance and, accordingly, the standard deviation are unknown. The syntax of the operator is as follows:

TRUST.STUDENT (alpha; standard_dev; size)

As you can see, the names of the operators remained unchanged in this case.

Let's see how to calculate the boundaries of the confidence interval with an unknown standard deviation using the example of the same population that we considered in the previous method. The level of trust, like last time, is 97%.

Select the cell in which the calculation will be made. Click on the button "Insert function".

In the opened Function wizard go to the category "Statistical"... Choosing a name "CONFIDENCE STUDENT"... Click on the button "OK".

The argument window for the specified operator is launched.
In field "Alpha", given that the confidence level is 97%, we write down the number 0,03 ... We will not dwell on the principles of calculating this parameter for the second time.

After that, we place the cursor in the field "Standard deviation"... This time, this indicator is unknown to us and we need to calculate it. This is done using a special function - STDEV.B.... To open the window of this operator, click on the triangle to the left of the formula bar. If we do not find the desired name in the list that opens, then go to item "Other functions ...".

Starts up Function wizard... Moving to the category "Statistical" and mark the name in it "STDEV.V"... Then we click on the button "OK".

The arguments window opens. The operator's task STDEV.B. is the definition of the standard deviation of the sample. Its syntax looks like this:
STDEV.B (number1; number2; ...)

It is not hard to guess that the argument "Number" Is the address of the sample item. If the selection is placed in a single array, then using only one argument, you can give a reference to this range.

Place the cursor in the field "Number1" and, as always, holding down the left mouse button, select the population. After the coordinates have entered the field, do not rush to press the button "OK", since the result will be incorrect. First, we need to return to the operator arguments window CONFIDENCE STUDENT to make the last argument. To do this, click on the appropriate name in the formula bar.

The arguments window for the familiar function opens again. Place the cursor in the field "The size"... Again, click on the triangle already familiar to us to go to the choice of operators. As you understood, we need a name "CHECK"... Since we used this function in the calculations in the previous method, it is present in this list, so just click on it. If you do not find it, then proceed according to the algorithm described in the first method.

Once in the arguments window CHECK, put the cursor in the field "Number1" and holding down the mouse button select the population. Then we click on the button "OK".

After that, the program calculates and displays the value of the confidence interval.

To define the boundaries, we will again need to calculate the average of the sample. But, given that the calculation algorithm using the formula AVERAGE the same as in the previous method, and even the result has not changed, we will not dwell on this in detail the second time.

By adding the results of the calculation AVERAGE and CONFIDENCE STUDENT, we get the right border of the confidence interval.

Subtracting from the results of the calculation of the operator AVERAGE calculation result CONFIDENCE STUDENT, we have the left border of the confidence interval.

If the calculation is written in one formula, then the calculation of the right border in our case will look like this:
AVERAGE (B2: B13) + TRUSTED STUDENT (0.03; STDEV.B (B2: B13); COUNT (B2: B13))

Accordingly, the formula for calculating the left border will look like this:
AVERAGE (B2: B13) - TRUSTED STUDENT (0.03; STDEV.B (B2: B13); COUNT (B2: B13))

As you can see, Excel tools make it much easier to calculate the confidence interval and its boundaries. For these purposes, separate operators are used for samples for which the variance is known and unknown.

And others. All of them are estimates of their theoretical analogs, which could be obtained if not a sample, but the general population were available. But alas, the general population is very expensive and often inaccessible.

Understanding interval grading

Any sample estimate has some scatter, since is a random variable depending on the values in a particular sample. Therefore, for more reliable statistical conclusions, one should know not only the point estimate, but also the interval, which with a high probability γ (gamma) covers the estimated indicator θ (theta).

Formally, these are two such values (statistics) T 1 (X) and T 2 (X), what T 1< T 2 for which at a given level of probability γ the condition is met:

In short, with the probability γ or more, the true figure is between the points T 1 (X) and T 2 (X) which are called the lower and upper bounds confidence interval.

One of the conditions for constructing confidence intervals is its maximum narrowness, i.e. it should be as short as possible. Desire is quite natural, because the researcher tries to more accurately localize the finding of the desired parameter.

It follows that the confidence interval should cover the maximum distribution probabilities. and the assessment itself is in the center.

That is, the probability of deviation (of the true indicator from the assessment) upward is equal to the probability of deviation downward. It should also be noted that for asymmetric distributions, the interval on the right is not equal to the interval on the left.

The figure above clearly shows that the greater the confidence level, the wider the interval - a direct relationship.

This was a small introduction to the theory of interval estimation of unknown parameters. Let's move on to finding the confidence bounds for the mathematical expectation.

Confidence interval for expected value

If the original data are distributed over, then the average will be a normal value. This follows from the rule that a linear combination of normal values also has a normal distribution. Therefore, to calculate the probabilities, we could use the mathematical apparatus of the normal distribution law.

However, this requires knowing two parameters - expectation and variance, which are usually not known. You can, of course, use estimates instead of parameters (arithmetic mean and), but then the distribution of the mean will not be entirely normal, it will be slightly flattened downward. This fact was cleverly noted by citizen William Gosset of Ireland, who published his discovery in the March 1908 issue of Biometrica. For conspiracy purposes, Gosset signed himself as Student. This is how the Student's t-distribution appeared.

However, the normal distribution of data used by K. Gauss when analyzing the errors of astronomical observations is extremely rare in earthly life and it is rather difficult to establish this (for high precision about 2 thousand observations are needed). Therefore, it is best to discard the normality assumption and use methods that are independent of the distribution of the original data.

The question arises: what is the distribution of the arithmetic mean if it is calculated from the data of an unknown distribution? The answer is given by the well-known in probability theory Central limit theorem(TSPT). In mathematics, there are several variants of it (throughout years formulations were refined), but all of them, roughly speaking, boil down to the statement that the sum of a large number of independent random variables obeys the normal distribution law.

When calculating the arithmetic mean, the sum of random variables is used. Hence, it turns out that the arithmetic mean has a normal distribution, in which the mean is the mean of the original data, and the variance is.

Smart people know how to prove the CLT, but we will be convinced of this with the help of an experiment conducted in Excel. Let's simulate a sample of 50 uniformly distributed random variables (using the Excel function RANDBETWEEN). Then we will make 1000 such samples and calculate the arithmetic mean for each. Let's look at their distribution.

It is seen that the distribution of the mean is close to the normal law. If the volume of samples and their number are made even larger, then the similarity will be even better.

Now that we have personally convinced of the validity of the CLT, we can, using, calculate the confidence intervals for the arithmetic mean, which, with a given probability, cover the true mean or mathematical expectation.

To establish the upper and lower bounds, you need to know the parameters of the normal distribution. As a rule, they are not there, therefore, estimates are used: arithmetic mean and sample variance... Again, this method gives a good approximation only for large samples. When the samples are small, it is often recommended to use the Student's t distribution. Do not believe it! The Student's distribution for the mean occurs only when the original data have a normal distribution, that is, almost never. Therefore, it is better to immediately set the minimum bar for the amount of required data and use asymptotically correct methods. They say that 30 observations are enough. Take 50 - you can't go wrong.

T 1.2- lower and upper limits of the confidence interval

- sample arithmetic mean

s 0- sample standard deviation (unbiased)

n - sample size

γ - confidence level (usually 0.9, 0.95 or 0.99)

c γ = Φ -1 ((1 + γ) / 2) Is the inverse of the standard normal distribution function. In simple terms, this is the number of standard errors from the arithmetic mean to the lower or upper bound (the indicated three probabilities correspond to the values 1.64, 1.96 and 2.58).

The essence of the formula is that the arithmetic mean is taken and then a certain amount is deposited from it ( with γ) standard errors ( s 0 / √n). Everything is known, take it and count it.

Before the mass use of a personal computer to obtain the values of the normal distribution function and its inverse, they used. They are still used now, but it is more efficient to turn to ready-made Excel formulas. All elements from the formula above (, and) can be easily calculated in Excel. But there is also a ready-made formula for calculating the confidence interval - TRUST.NORM... Its syntax is as follows.

TRUST.NORM (alpha; standard_dev; size)

alpha- the level of significance or confidence level, which in the above notation is equal to 1-γ, i.e. probability that the mathematicalthe expectation will be outside the confidence interval. At a confidence level of 0.95, alpha is 0.05, etc.

standard_dev Is the standard deviation of the sample data. You don't need to calculate the standard error, Excel will divide it by the root of n.

the size- sample size (n).

The result of the CONFIDENCE.NORM function is the second term from the formula for calculating the confidence interval, i.e. half-interval. Accordingly, the lower and upper points are the mean ± the obtained value.

Thus, it is possible to build a universal algorithm for calculating the confidence intervals for the arithmetic mean, which does not depend on the distribution of the initial data. The price for universality is its asymptoticity, i.e. the need to use relatively large samples. However, in the century modern technologies to collect the right amount data is usually not difficult.

Testing Statistical Hypotheses Using Confidence Intervals

(module 111)

One of the main tasks solved in statistics is. Its essence is briefly as follows. It is suggested, for example, that the expected value of the general population is equal to some value. Then the distribution of sample averages is plotted, which can be observed with a given expectation. Next, they look where the real average is located in this conditional distribution. If it goes beyond the permissible limits, then the appearance of such an average is very unlikely, and with a single repetition of the experiment, it is almost impossible, which contradicts the hypothesis put forward, which is successfully rejected. If the mean does not go beyond the critical level, then the hypothesis is not rejected (but also not proven!).

So, using the confidence intervals, in our case for the expectation, you can also test some hypotheses. It's very easy to do. Suppose the arithmetic mean over a certain sample is equal to 100. The hypothesis is tested that the expectation is equal, say, 90. the average was equal to 100?

To answer this question, you will additionally need information about the average squared deviation and sample size. Let's say the standard deviation is 30, and the number of observations is 64 (to easily extract the root). Then the standard error of the mean is 30/8 or 3.75. To calculate the 95% confidence interval, it will be necessary to postpone two standard errors (more precisely, 1.96 each) on both sides of the mean. Confidence interval you get about 100 ± 7.5, or from 92.5 to 107.5.

Further, the reasoning is as follows. If the tested value falls within the confidence interval, then it does not contradict the hypothesis, since fits within the limits of random fluctuations (with a probability of 95%). If the point being checked is outside the confidence interval, then the probability of such an event is very small, at least below acceptable level... Hence, the hypothesis is rejected as contradicting the observed data. In our case, the hypothesis about expectation is outside the confidence interval (the tested value of 90 is not included in the interval 100 ± 7.5), so it should be rejected. Answering the primitive question above, one should say: no, it cannot, in any case, this happens extremely rarely. At the same time, they often indicate the specific probability of erroneous rejection of the hypothesis (p-level), and not the specified level, according to which the confidence interval was built, but more on that another time.

As you can see, it is not difficult to construct a confidence interval for the mean (or mathematical expectation). The main thing is to grasp the essence, and then things will go. In practice, in most cases, a 95% confidence interval is used, which is approximately two standard errors wide on either side of the mean.

That's all for now. All the best!

Instructions

Please note that interval(l1 or l2), the central area of which will be the estimate l *, and also in which the true value of the parameter is contained with probability, will be the confidence interval ohm or the corresponding confidence level alpha. In this case, l * itself will refer to point estimates. For example, based on the results of any sample values of the random value X (x1, x2, ..., xn), it is necessary to calculate the unknown parameter of the index l, on which the distribution will depend. In this case, obtaining an estimate of a given parameter l * will consist in the fact that for each sample it will be necessary to put a certain value of the parameter in correspondence, that is, to create a function of the observation results of the indicator Q, the value of which will be taken equal to the estimated value of the parameter l * in the form of a formula : l * = Q * (x1, x2, ..., xn).

Note that any function based on observation is called statistics. Moreover, if it fully describes the considered parameter (phenomenon), then it is called sufficient statistics. And because the observation results are random, then l * will also be a random variable. The task of calculating statistics should be carried out taking into account the criteria of its quality. Here it is necessary to take into account that the distribution law of the estimate is quite definite, the probability density distribution W (x, l).

Can you calculate the confidential interval simple enough if you know the distribution law of the score. For example, a confidential interval estimates in relation to the mathematical expectation ( average size random value) mx * = (1 / n) * (x1 + x2 +… + xn). This estimate will be unbiased, that is, the mathematical expectation or the average value of the indicator will be equal to the true value of the parameter (M (mx *) = mx).

You can establish that the variance of the estimate by the mathematical expectation: bx * ^ 2 = Dx / n. Based on the central limit theorem, we can conclude that the distribution law of this estimate is Gaussian (normal). Therefore, for calculations, you can use the index Ф (z) - the integral of probabilities. In this case, select the length of the trust interval a 2ld, so you get: alpha = P (mx-ld (using the property of the integral of probabilities according to the formula: Ф (-z) = 1- Ф (z)).

Build a confidential interval estimates of the mathematical expectation: - find the value of the formula (alpha + 1) / 2; - select the value equal to ld / sqrt (Dx / n) from the probability integral table; - take the estimate of the true variance: Dx * = (1 / n) * ( (x1 - mx *) ^ 2+ (x2 - mx *) ^ 2 + ... + (xn - mx *) ^ 2); - determine ld; - find the confidence interval according to the formula: (mx * -ld, mx * + ld).

Often an appraiser has to analyze the real estate market of the segment in which the appraisal object is located. If the market is developed, it can be difficult to analyze the entire set of presented objects, therefore, a sample of objects is used for analysis. This sample does not always turn out to be homogeneous, sometimes it is necessary to clear it of extremes - too high or too low market offers. For this purpose applies confidence interval... The purpose of this study is to conduct a comparative analysis of two methods for calculating the confidence interval and choose the best option calculation when working with different samples in the estimatica.pro system.

Confidence interval is an interval of characteristic values calculated on the basis of a sample, which, with a known probability, contains the estimated parameter of the general population.

The meaning of calculating the confidence interval is to construct, based on the sample data, such an interval so that it can be asserted with a given probability that the value of the estimated parameter is in this interval. In other words, the confidence interval contains, with a certain probability, the unknown value of the estimated value. The wider the interval, the higher the inaccuracy.

There are different methods for determining the confidence interval. In this article, we will consider 2 ways:

through the median and standard deviation;
through the critical value of t-statistics (Student's coefficient).

Stages comparative analysis different ways CI calculation:

1. we form a sample of data;

2. we process it by statistical methods: calculate the mean, median, variance, etc .;

3. we calculate the confidence interval in two ways;

4. we analyze the cleaned samples and the obtained confidence intervals.

Stage 1. Data sampling

The sample was formed using the estimatica.pro system. The sample included 91 offers for sale 1 room apartments in the 3rd price zone with the layout type "Khrushchev".

Table 1. Initial sample

	Price for 1 sq.m., d.e.

Fig. 1. Initial sample

Stage 2. Processing of the original sample

The processing of a sample by statistical methods requires the calculation of the following values:

1. Arithmetic mean

2. Median - a number characterizing the sample: exactly half of the sample is greater than the median, the other half is less than the median

(for a sample with an odd number of values)

3. Range - the difference between the maximum and minimum values in the sample

4. Variance - used for more accurate estimation of data variation

5. The sample standard deviation (hereinafter - RMS) is the most common indicator of the dispersion of adjustment values around the arithmetic mean.

6. Coefficient of variation - reflects the degree of dispersion of the adjustment values

7.Oscillation coefficient - reflects the relative fluctuation of the extreme values of prices in the sample around the average

Table 2. Statistical indicators of the original sample

The coefficient of variation, which characterizes the homogeneity of the data, is 12.29%, but the coefficient of oscillation is too large. Thus, we can argue that the original sample is not homogeneous, so let's move on to calculating the confidence interval.

Step 3. Calculation of the confidence interval

Method 1. Calculation through the median and standard deviation.

The confidence interval is determined as follows: the minimum value - the standard deviation is subtracted from the median; maximum value - standard deviation is added to the median.

Thus, the confidence interval (CU 47179; CU 60689)

Rice. 2. Values that fall within the confidence interval 1.

Method 2. Construction of the confidence interval through the critical value of the t-statistics (Student's coefficient)

S.V. Gribovsky in his book "Mathematical Methods for Assessing the Value of Property" describes a method for calculating the confidence interval through the Student's coefficient. When calculating by this method, the evaluator himself must set the level of significance ∝, which determines the probability with which the confidence interval will be constructed. Significance levels of 0.1 are commonly used; 0.05 and 0.01. Confidence probabilities of 0.9 correspond to them; 0.95 and 0.99. With this method, the true values of the mathematical expectation and variance are assumed to be practically unknown (which is almost always true when solving practical estimation problems).

Confidence Interval Formula:

n is the sample size;

The critical value of t-statistics (Student's distribution) with a significance level ∝, the number of degrees of freedom n-1, which is determined using special statistical tables or using MS Excel (→ "Statistical" → STYUDRASPOBR);

∝ - the level of significance, we take ∝ = 0.01.

Rice. 2. Values that fall within the confidence interval 2.

Stage 4. Analysis of different methods of calculating the confidence interval

Two methods of calculating the confidence interval - through the median and the Student's coefficient - led to different meanings intervals. Accordingly, we got two different cleaned samples.

Table 3. Statistical indicators for three samples.

Index	Initial sample	Option 1	Option 2
Mean


Dispersion

Coef. variations
Coef. oscillations
Number of retired objects, pcs.

Based on the calculations performed, we can say that the obtained different methods the values of the confidence intervals overlap, so you can use any of the calculation methods at the discretion of the evaluator.

However, we believe that when working in the estimatica.pro system, it is advisable to choose a method for calculating the confidence interval depending on the degree of market development:

if the market is undeveloped, apply the method of calculation through the median and standard deviation, since the number of retired objects in this case is small;
if the market is developed, apply the calculation through the critical value of the t-statistics (Student's coefficient), since it is possible to form a large initial sample.

In preparing the article, the following were used:

1. Gribovsky S.V., Sivets S.A., Levykina I.A. Mathematical methods for assessing the value of property. Moscow, 2014

2. Data of the estimatica.pro system

Estimating Confidence Intervals

Learning objectives

Statistics considers the following two main tasks:

We have some estimate based on sample data, and we want to make some probabilistic statement about where the true value of the parameter being estimated is.

We have a specific hypothesis that needs to be tested based on sample data.

In this topic, we consider the first task. We also introduce the definition of the confidence interval.

The confidence interval is an interval that is built around the estimated parameter value and shows where the true value of the estimated parameter is located with a priori given probability.

Having studied the material on this topic, you:

find out what the confidence interval of the estimate is;

learn to classify statistical tasks;

master the technique of constructing confidence intervals, both according to statistical formulas and using software tools;

learn to define required dimensions samples to achieve certain parameters the accuracy of statistical estimates.

Distributions of sample characteristics

T-distribution

As discussed above distribution random variable close to the standardized normal distribution with parameters 0 and 1. Since we do not know the value of σ, we replace it with some estimate s. The quantity already has a different distribution, namely, or Student's t distribution, which is determined by the parameter n -1 (the number of degrees of freedom). This distribution is close to the normal distribution (the larger n, the closer the distributions).

In fig. 95
the Student's distribution with 30 degrees of freedom is presented. As you can see, it is very close to the normal distribution.

Similarly to the functions for working with the normal distribution NORMDIST and NORMINV, there are functions for working with the t-distribution - TDIST and TINV... An example of using these functions can be found in the TDIST.XLS file (template and solution) and in Fig. 96
.

Distributions of other characteristics

As we already know, to determine the accuracy of the estimation of the mathematical expectation, we need the t-distribution. To estimate other parameters, such as variance, different distributions are required. Two of them are the F-distribution and x 2 -distribution.

Confidence interval for mean

Confidence interval is an interval that is built around the estimated parameter value and shows where the true value of the estimated parameter is located with a priori given probability.

The construction of the confidence interval for the mean occurs in the following way:

Example

The fast food restaurant is planning to expand its assortment with a new type of sandwich. In order to assess the demand for it, the manager plans to randomly select 40 visitors from those who have already tried it and invite them to rate their attitude to the new product in points from 1 to 10. The manager wants to estimate the expected number of points that will receive New Product and build a 95% confidence interval for this estimate. How can this be done? (see file SANDWICH1.XLS (template and solution).

Solution

To solve this problem, you can use. The results are shown in Fig. 97
.

Confidence interval for cumulative value

Sometimes, based on sample data, it is required to estimate not the mathematical expectation, but the total sum of values. For example, in a situation with an auditor, it may not be of interest to estimate the average value of an account, but the sum of all accounts.

Let N - the total number of elements, n - the sample size, T 3 - the sum of the values in the sample, T "- the estimate for the sum over the entire population, then , and the confidence interval is calculated by the formula, where s is the estimate of the standard deviation for the sample, is the estimate of the mean for the sample.

Example

Let's say some tax office wants to estimate the total tax refunds for 10,000 taxpayers. The taxpayer either receives a refund or pays additional taxes. Find the 95% confidence interval for the refund amount assuming the sample size is 500 people (see RETURNS SUM.XLS file (template and solution).

Solution

There is no special procedure in StatPro for this case, however, you will notice that the bounds can be obtained from the bounds for the mean based on the above formulas (Fig. 98
).

Confidence interval for proportion

Let p be the mathematical expectation of the share of customers, and p in the estimate of this share obtained from a sample of size n. It can be shown that for sufficiently large the distribution of the estimate will be close to normal with mean p and standard deviation ... The standard error of the estimate in in this case expressed as , and the confidence interval as .

Example

The fast food restaurant is planning to expand its assortment with a new type of sandwich. In order to estimate the demand for it, the manager randomly selected 40 visitors from those who have already tried it and invited them to rate their attitude to the new product in points from 1 to 10. The manager wants to estimate the expected share of customers who rate the new product at least than 6 points (he expects these customers to be the consumers of the new product).

Solution

Initially, we create a new column based on 1 if the client's score was more than 6 points and 0 otherwise (see the file SANDWICH2.XLS (template and solution).

Method 1

Counting the quantity 1, we estimate the share, and then we use the formulas.

The z cr value is taken from special tables of the normal distribution (for example, 1.96 for the 95% confidence interval).

Using this approach and specific data to construct a 95% interval, we obtain the following results (Fig. 99
). The critical value of the parameter z cr is 1.96. The standard error of the estimate is 0.077. The lower limit of the confidence interval is 0.475. The upper limit of the confidence interval is 0.775. Thus, the manager has the right to assume with 95% confidence that the percentage of customers who rated the new product 6 points or higher will be between 47.5 and 77.5.

Method 2

This task can be solved using standard StatPro tools. To do this, it is enough to note that the share in this case coincides with the average value of the Type column. Next, let's apply StatPro / Statistical Inference / One-Sample Analysis to build the confidence interval of the mean (estimate of the expected value) for the Type column. The result obtained in this case will be very close to the result of the 1st method (Fig. 99).

Confidence interval for standard deviation

As an estimate of the standard deviation, s is used (the formula is given in Section 1). The density function of the estimate s is the chi-square function, which, like the t-distribution, has n-1 degrees of freedom. There are special functions for working with this CHIDIST and CHIINV distribution.

The confidence interval in this case will no longer be symmetrical. Conditional scheme boundaries is shown in Fig. 100 .

Example

The machine must produce parts with a diameter of 10 cm. However, due to various circumstances, errors occur. The quality inspector is concerned about two things: first, the average should be 10 cm; secondly, even in this case, if the deviations are large, then many parts will be rejected. Every day he makes a sample of 50 parts (see the file QUALITY CONTROL.XLS (template and solution). What conclusions can such a sample give?

Solution

Plot the 95% confidence intervals for the mean and standard deviation using StatPro / Statistical Inference / One-Sample Analysis(fig. 101
).

Further, using the assumption of a normal distribution of diameters, we calculate the proportion of defective products, setting a maximum deviation of 0.065. Using the capabilities of the substitution table (the case of two parameters), we will construct the dependence of the share of rejects on the mean and standard deviation (Fig. 102
).

Confidence interval for the difference between two means

This is one of the most important applications of statistical methods. Examples of situations.

A clothing store manager would like to know how much more or less an average female shopper spends in a store than a man.

The two airlines fly similar routes. The consumer organization would like to compare the difference between the average expected flight delays for both airlines.

The company sends coupons for certain types of goods in one city and does not send in another. Managers want to compare the average purchase volumes of these items over the next two months.

The car dealer often deals with married couples at presentations. Couples are often interviewed separately to understand their personal reactions to a presentation. The manager wants to assess the difference in ratings reported by men and women.

Independent Samples Case

The difference between the means will have a t-distribution with n 1 + n 2 - 2 degrees of freedom. The confidence interval for μ 1 - μ 2 is expressed by the ratio:

This task can be solved not only by the above formulas, but also by standard StatPro tools. To do this, it is enough to apply

Confidence interval for the difference between proportions

Let be the mathematical expectation of the shares. Let be their sample estimates constructed from samples of size n 1 and n 2, respectively. Then is the estimate for the difference. Therefore, the confidence interval for this difference is expressed as:

Here z cr is the value obtained from the normal distribution according to special tables (for example, 1.96 for the 95% confidence interval).

The standard error of the estimate is expressed in this case by the ratio:

Example

The store has undertaken the following market research in preparation for the big sale. Were selected 300 best buyers, which in turn were randomly divided into two groups of 150 members each. All of the selected buyers were sent invitations to participate in the sale, but only members of the first group were accompanied by a coupon entitling them to a 5% discount. During the sale, purchases of all 300 selected buyers were recorded. How can a manager interpret the results and conclude on the effectiveness of coupon delivery? (see file COUPONS.XLS (template and solution)).

Solution

For our particular case, out of 150 buyers who received a discount coupon, 55 made a purchase at a sale, and among 150 who did not receive a coupon, only 35 made a purchase (Fig. 103
). Then the values of the sample proportions are 0.3667 and 0.2333, respectively. And the sample difference between them is 0.1333, respectively. Assuming the confidence interval to be 95%, we find z cr = 1.96 from the normal distribution table. The computation of the standard error of the sampled difference is 0.0524. Finally, we find that the lower limit of the 95% confidence interval is 0.0307, and the upper limit is 0.2359, respectively. The results can be interpreted to mean that for every 100 customers who receive a discount coupon, you can expect from 3 to 23 new customers. However, it should be borne in mind that this conclusion in itself does not mean the effectiveness of the use of coupons (since, by providing a discount, we lose in profit!). Let's demonstrate this using specific data. Suppose that the average purchase size is 400 rubles, of which 50 rubles. there is a profit of the store. Then the expected profit per 100 buyers who did not receive the coupon is:

50 0.2333 100 = 1166.50 rubles.

Similar calculations for 100 buyers who received the coupon give:

30 0.3667 100 = 1100.10 rubles.

The decrease in the average profit to 30 is due to the fact that, using the discount, customers who received the coupon will, on average, make a purchase for 380 rubles.

Thus, the final conclusion speaks of the ineffectiveness of using such coupons in this particular situation.

Comment. This task can be solved using standard StatPro tools. To do this, it suffices to reduce this problem to the problem of estimating the difference of two means by the method, and then apply StatPro / Statistical Inference / Two-Sample Analysis to build a confidence interval for the difference between two mean values.

Confidence interval length control

The length of the confidence interval depends on following conditions :

direct data (standard deviation);

significance level;

sample size.

Sample size for estimating the mean

First, consider the problem in the general case. Let us designate the value of half the length of the confidence interval given to us as B (Fig. 104
). We know that the confidence interval for the mean value of some random variable X is expressed as , where ... Assuming:

and expressing n, we get.

Unfortunately, we do not know the exact value of the variance of the random variable X. In addition, we do not know the value of t cr, since it depends on n through the number of degrees of freedom. In this situation, we can proceed as follows. Instead of the variance s, we use an estimate of the variance based on any available realizations of the random variable under study. Instead of the t cr value, we use the z cr value for the normal distribution. This is quite acceptable, since the distribution density functions for the normal and t-distribution are very close (except for the case of small n). Thus, the sought formula takes the form:

Since the formula gives, generally speaking, non-integer results, the desired sample size is taken to be the excess rounding of the result.

Example

The fast food restaurant is planning to expand its assortment with a new type of sandwich. In order to assess the demand for it, the manager plans to randomly select a certain number of visitors from those who have already tried it, and invite them to rate their attitude to the new product in points from 1 to 10. The manager wants to estimate the expected number of points that the new one will receive. product and build a 95% confidence interval for this estimate. At the same time, he wants half the width of the confidence interval to not exceed 0.3. How many visitors should he interview?

as follows:

Here r ots is the estimate of the fraction p, and B is the given half of the length of the confidence interval. An overestimate for n can be obtained using the value r ots= 0.5. In this case, the length of the confidence interval will not exceed the given value of B for any true value of p.

Example

Let the manager from the previous example plan to estimate the proportion of customers who preferred a new type of product. He wants to construct a 90% confidence interval half the length of which does not exceed 0.05. How many clients should be included in the random sample?

Solution

In our case, the value of z cr = 1.645. Therefore, the required amount is calculated as .

If the manager had reason to believe that the desired value of p is, for example, about 0.3, then, substituting this value into the above formula, we would get a smaller value of the random sample, namely 228.

Formula for determining random sample sizes in case of difference between two means written as:

Example

Some computer company has a customer service center. Recently, the number of customer complaints about poor quality of service has increased. V service center mainly employees of two types work: great experience, but who have completed special preparatory courses, and have extensive practical experience, but have not completed special courses. The company wants to analyze customer complaints over the past six months and compare their average numbers for each of the two groups of employees. It is assumed that the quantities in the samples for both groups will be the same. How many employees should be included in the sample to get a 95% interval with a half length of no more than 2?

Solution

Here σ оц is an estimate of the standard deviation of both random variables under the assumption that they are close. Thus, in our task, we need to somehow get this estimate. This can be done, for example, as follows. Having looked at the data on customer complaints over the past six months, a manager may notice that for each employee, there are generally from 6 to 36 complaints. Knowing that for a normal distribution, almost all values are removed from the mean by no more than three standard deviations, he can reasonably believe that:

, whence σ оц = 5.

Substituting this value in the formula, we get .

Formula for determining the size of the random sample in the case of estimating the difference between the shares looks like:

Example

A certain company has two factories producing similar products. A company manager wants to compare the proportion of defective products in both factories. According to the available information, the scrap rate at both factories is between 3 and 5%. It is supposed to build a 99% confidence interval with half the length of no more than 0.005 (or 0.5%). How many items should be taken from each factory?

Solution

Here p 1ots and p 2ots are estimates of two unknown scrap rates at the 1st and 2nd factories. If we put p 1ots = p 2ots = 0.5, then we get an overestimated value for n. But since in our case we have some a priori information about these shares, we take the upper estimate of these shares, namely 0.05. We get

When some parameters of a population are estimated from sample data, it is useful to give not only a point estimate of the parameter, but also to indicate a confidence interval that shows where the exact value of the estimated parameter may be located.

In this chapter, we also got acquainted with the quantitative ratios that allow us to construct such intervals for various parameters; learned how to control the length of the confidence interval.

Note also that the problem of estimating the sample size (the problem of planning an experiment) can be solved using the standard StatPro tools, namely StatPro / Statistical Inference / Sample Size Selection.