Part of the objects in the general population is called. General and sample populations

Modern mathematical statistics develops methods for determining the number of necessary tests before starting a study (sequential analysis) and solves many other problems. It is defined as the science of decision making under conditions of uncertainty.

So, common task mathematical statistics consists of creating methods for collecting and processing statistical data to obtain scientific and practical conclusions.

Let it be necessary to study a set of homogeneous objects with respect to some qualitative or quantitative feature that characterizes these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

Sometimes a complete examination is carried out, i.e. examine each of the objects in the population regarding the characteristic in which they are interested. In practice, however, continuous examination is used relatively rarely. For example, if a population contains a very large number of objects, then it is physically impossible to conduct a comprehensive survey. If the survey of an object is associated with its destruction or requires large material costs, then conducting a complete survey makes practically no sense. In such cases, a limited number of objects are randomly selected from the entire population and subjected to study.

Sample population or simply a sample is a collection of randomly selected objects.

General population called a collection of objects from which a random selection is made.

Volume population (sample or general) is the number of objects in this population. For example, if out of 1000 parts 100 parts are selected for examination, then the volume population N = 1000, and sample size P = 100.

When compiling a sample, there are two ways to proceed: after an object is selected and observed, it may or may not be returned to the population. In accordance with the above, samples are divided into repeated and non-repeated.

Repeat called a sample in which the selected object (before selecting the next one) is returned to the population.

Repeatless called a sample in which the selected object is not returned to the population.

In practice, repeatless random sampling is usually used.

In order to be able to judge with sufficient confidence about the characteristic of the population of interest based on sample data, it is necessary that the sample objects correctly represent it. In other words, the sample must correctly represent the proportions of the population. This requirement is briefly formulated as follows: the sample must be representative (representative). This means that personal motives and other psychologically meaningful and unconscious factors should be excluded when selecting objects. It is necessary to strictly observe the randomness of the selection of objects, to ensure that each object has the same probability of being included in the sample as the others.

In practice they are used various ways selection. Fundamentally, these methods can be divided into two types:

1. Selection that does not require dividing the general population into parts. These include:

a) simple random non-repetitive selection;

b) simple random repeated selection.

2. Selection, in which the population is divided into parts. These include:

a) typical selection;

b) mechanical selection;

c) serial selection.

Simple random is called a selection in which objects are selected one at a time from the entire population. Simple selection can be carried out in various ways. For example, to extract P objects from the general population of volume N do this: write down numbers from 1 to N on cards that are thoroughly mixed and one card is taken out at random; an object that has the same number as the removed card is examined, then the card is returned to the pack and the process is repeated, i.e. the cards are shuffled, one of them is taken out at random, etc. That's what they do P times, we end up with a simple random repeat sampling volume P.

If the removed cards are not returned to the pack, then the selection is simple random repeatable.

If the general population is divided into typical parts, for example, cartridges are divided by caliber, then the selection is made not from all cartridges, but separately by caliber. This selection is called typical.

Selection of the type “every fifth item in the population is taken” is called mechanical.

Serial called selection in which objects are selected from the general population not one at a time, but in “series” that are subjected to a continuous survey. For example, if products are manufactured by a large group of automatic machines, then the products of only a few machines are subjected to a comprehensive examination. Serial selection is used when the trait being examined varies slightly in different series.

We emphasize that in practice it is often used combined selection that combines the above methods. For example, sometimes the population is divided into series of the same size, then several series are selected by simple random sampling, and, finally, individual objects are extracted from each series by simple random sampling.

Lecture 6. Elements of mathematical statistics

Questions to control knowledge and summarize the lecture given

1. Define random variable.

2.Write formulas for the mathematical expectation and variance of discrete and continuous random variables.

3. Define Laplace’s local integral limit theorem

4. Write formulas that define the binomial distribution, hypergeometric distribution, Poisson distribution, uniform distribution and normal distribution.

Goal: To study the basic concepts of mathematical statistics

1. Population and sample

2. Statistical distribution of the sample. Polygon. bar chart .

3. Estimates of parameters of the general population based on its sample

4. General and sample averages. Methods for their calculation.

5. General and sample variances.

6. Questions to control knowledge and summarize the lecture given

We begin to study the elements of mathematical statistics, which develops scientifically based methods for collecting statistical data and processing them.

1. General population and sample. Let it be necessary to study a set of homogeneous objects (this set is called statistical aggregate) regarding some qualitative or quantitative feature characterizing these objects. For example, if there is a batch of parts, then the standard of the part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

It is best to carry out a complete examination, i.e. examine each object. However, in most cases, for various reasons, this cannot be done. A large number of objects and their inaccessibility can hinder a comprehensive survey. If, for example, you need to know average depth craters from the explosion of a shell from an experimental batch, then by carrying out a complete examination, we will destroy the entire batch.

If a complete survey is not possible, then a portion of the objects is selected from the entire population for study.

The statistical population from which part of the objects is selected is called the general population. A set of objects randomly selected from a population is called sampling.

The number of objects in the population and sample is called respectively volume general population and volume samples.

Example 10.1. The fruits of one tree (200 pieces) are examined for the presence of a taste specific to this variety. For this purpose, 10 pieces are selected. Here 200 is the size of the population, and 10 is the size of the sample.

If a sample is selected from one object, which is examined and returned to the population, then the sample is called repeated. If the sample objects are no longer returned to the population, then the sample is called repeatable.



In practice, non-repetitive sampling is more often used. If the sample size is a small fraction of the population size, then the difference between repeated and non-replicated samples is negligible.

The properties of the objects in the sample must correctly reflect the properties of the objects in the population, or, as they say, the sample must be representative(representative). A sample is considered to be representative if all objects in the population have the same probability of being included in the sample, i.e. the selection is made randomly. For example, in order to estimate the future harvest, you can make a sample from the general population of fruits that have not yet ripened and examine their characteristics (weight, quality, etc.). If the entire sample is taken from one tree, it will not be representative. A representative sample should consist of randomly selected fruits from randomly selected trees.

2. Statistical distribution of the sample. Polygon. Bar chart. Let a sample be drawn from the general population, and X 1 observed n 1 time, X 2 - n 2 once, ..., x k - n k times and n 1 +n 2 +…+ n k= P - sample size. Observed values x 1 , x 2 , …, x k called options, and the variant sequence, written in ascending order, is variation series. Numbers of observations n 1 , n 2 , …, n k called frequencies, and their relationship to the sample size , , …, - relative frequencies. Note that the sum of the relative frequencies is equal to unity: .

Statistical sample distribution call a list of options and their corresponding frequencies or relative frequencies. The statistical distribution can also be specified as a sequence of intervals and their corresponding frequencies (continuous distribution). The sum of frequencies of the variants falling within this interval is taken as the frequency corresponding to the interval. To graphically display the statistical distribution, use polygons And histograms.

To construct a polygon on an axis Oh defer values ​​option X i, on the axis OU - frequency values P i (relative frequencies).

Example 10.2. In Fig. 10.1 shows the polygon of the following distribution

The landfill is usually used in cases where large number option. In the case of a large number of variants and in the case of a continuous distribution of the attribute, histograms are often constructed. To do this, the interval in which all observed values ​​of the attribute are contained is divided into several partial intervals of length h and find for each partial interval n i, - the sum of frequencies of the variant included in i-interval. Then, on these intervals, as on bases, rectangles with heights are built (or, where P - sample size).

Square i partial rectangle is equal to , (or ).

Consequently, the area of ​​the histogram is equal to the sum of all frequencies (or relative frequencies), i.e. sample size (or unit).

Example 10.3. In Fig. Figure 10.2 shows a histogram of a continuous volume distribution n= 100 given in the following table.

Statistical population- a set of units that have mass character, typicality, qualitative homogeneity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Unit of the population— each specific unit of a statistical population.

The same statistical population can be homogeneous in one characteristic and heterogeneous in another.

Qualitative uniformity- similarity of all units of the population on some basis and dissimilarity on all others.

In a statistical population, the differences between one population unit and another are often of a quantitative nature. Quantitative changes in the values ​​of a characteristic of different units of a population are called variation.

Variation of a trait- a quantitative change in a characteristic (for a quantitative characteristic) during the transition from one unit of the population to another.

Sign- this is a property characteristic or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a characteristic in individual units of a population is called variation.

Attributive (qualitative) characteristics cannot be expressed numerically (population composition by gender). Quantitative characteristics have a numerical expression (population composition by age).

Index- this is a generalizing quantitative and qualitative characteristic of any property of units or aggregates as a whole under specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon being studied.

For example, salary is studied:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each employee
  • Qualitative homogeneity - accrued wages
  • Variation of a sign - a series of numbers

Population and sample from it

The basis is a set of data obtained as a result of measuring one or more characteristics. A truly observed set of objects, statistically represented by a number of observations of a random variable, is sampling, and the hypothetically existing (conjectural) - general population. The population may be finite (number of observations N = const) or infinite ( N = ∞), and a sample from a population is always the result of a limited number of observations. The number of observations forming a sample is called sample size. If the sample size is large enough ( n → ∞) the sample is considered big, otherwise it is called sampling limited volume. The sample is considered small, if when measuring a one-dimensional random variable the sample size does not exceed 30 ( n<= 30 ), and when measuring several simultaneously ( k) features in multidimensional relation space n To k does not exceed 10 (n/k< 10) . The sample forms variation series, if its members are ordinal statistics, i.e. sample values ​​of the random variable X are ordered in ascending order (ranked), the values ​​of the characteristic are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as as a sample from the commercial banks of the country and etc.

Basic methods of organizing sampling

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the representation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of a population can be organized in two ways: using continuous And not continuous. Continuous observation provides for examination of all units studied totality, A partial (selective) observation- only parts of it.

There are five main ways to organize sample observation:

1. simple random selection, in which objects are randomly selected from a population of objects (for example, using a table or random number generator), with each of the possible samples having equal probability. Such samples are called actually random;

2. simple selection using a regular procedure is carried out using a mechanical component (for example, date, day of the week, apartment number, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of the volume is divided into subpopulations or layers (strata) of the volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age groups or social class; enterprises - by industry). In this case, the samples are called stratified(otherwise, stratified, typical, regionalized);

4. methods serial selection are used to form serial or nest samples. They are convenient if it is necessary to survey a “block” or a series of objects at once (for example, a batch of goods, products of a certain series, or the population in the territorial and administrative division of the country). The selection of series can be done purely randomly or mechanically. In this case, a complete inspection of a certain batch of goods, or an entire territorial unit (a residential building or block), is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Types of selection

By mind individual, group and combined selection are distinguished. At individual selection individual units of the general population are selected into the sample population, with group selection- qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection is distinguished repeated and non-repetitive sample.

Repeatless called selection in which a unit included in the sample does not return to the original population and does not participate in further selection; while the number of units in the general population N is reduced during the selection process. At repeated selection caught in the sample, a unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in a further selection procedure; while the number of units in the general population N remains unchanged (the method is rarely used in socio-economic research). However, with large N (N → ∞) formulas for repeatable selection approaches those for repeated selection and the latter are practically more often used ( N = const).

Basic characteristics of the parameters of the general and sample population

The statistical conclusions of the study are based on the distribution of the random variable, and the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is sample size). The distribution of a random variable in the general population is of a theoretical, ideal nature, and its sample analogue is empirical distribution. Some theoretical distributions are specified analytically, i.e. their options determine the value of the distribution function at each point in the space of possible values ​​of the random variable. For a sample, the distribution function is difficult and sometimes impossible to determine, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be either statistically correct or erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and variance.

By their nature, distributions are continuous And discrete. The best known continuous distribution is normal. Sample analogues of the parameters and for it are: mean value and empirical variance. Among discrete ones in socio-economic research, the most frequently used alternative (dichotomous) distribution. The mathematical expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic being studied (it is indicated by the letter); the proportion of the population that does not have this characteristic is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analogue.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for theoretical and empirical distributions are given in table. 9.1.

Sample fraction k n The ratio of the number of units in the sample population to the number of units in the general population is called:

kn = n/N.

Sample fraction w is the ratio of units possessing the characteristic being studied x to sample size n:

w = n n /n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample share k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample defect rate w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 9.1 Main parameters of the general and sample populations

Sampling errors

In any case (continuous and selective), errors of two types may occur: registration and representativeness. Errors registration can have random And systematic character. Random errors consist of many different uncontrollable causes, are unintentional and usually balance each other out (for example, changes in device performance due to temperature fluctuations in the room).

Systematic errors are biased because they violate the rules for selecting objects for the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social situation of the population in the city, it is planned to survey 25% of families. If the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will provide a systematic error and distort the results; choosing an apartment number by lot is more preferable, since the error will be random.

Representativeness errors are inherent only in sample observation, they cannot be avoided and they arise as a result of the fact that the sample population does not completely reproduce the general population. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained through continuous observation).

Sampling bias is the difference between the parameter value in the population and its sample value. For the average value of a quantitative characteristic it is equal to: , and for the share (alternative characteristic) - .

Sampling errors are inherent only to sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples and therefore it is customary to calculate average error.

Average sampling error is a quantity expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the characteristic: the greater and the smaller the variation of the characteristic (and therefore the value), the smaller the average sampling error. The relationship between the variances of the general and sample populations is expressed by the formula:

those. when large enough, we can assume that . The average sampling error shows possible deviations of the sample population parameter from the general population parameter. In table Table 9.2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 9.2 Average error (m) of sample mean and proportion for different types of samples

Where is the average of the within-group sample variances for a continuous attribute;

Average of the within-group variances of the proportion;

— number of selected series, — total number of series;

,

where is the average of the th series;

— the overall average for the entire sample population for a continuous characteristic;

,

where is the share of the characteristic in the th series;

— the total share of the characteristic across the entire sample population.

However, the magnitude of the average error can only be judged with a certain probability P (P ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and therefore their deviations from the general mean, for a sufficiently large number approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the average is expressed as:

and for the share, expression (1) will take the form:

Where - There is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity coefficient is the Student's test ("confidence coefficient"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are equal to:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and general average will not exceed one value of the average error m(t=1), with probability P = 0.954 (95.4%)- that it will not exceed the value of two average errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the average error is determined by error level and amounts to no more 0,3% .

In table 9.3 shows formulas for calculating the maximum sampling error.

Table 9.3 Marginal error (D) of the sample for the mean and proportion (p) for different types of sample observation

Generalization of sample results to the population

The ultimate goal of sample observation is to characterize the general population. With small sample sizes, empirical estimates of parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, there is a need to establish boundaries within which the true values ​​( and ) lie for the sample values ​​of the parameters ( and ).

Confidence interval of any parameter θ of the general population is the random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

Marginal error samples Δ allows you to determine the limiting values ​​of the characteristics of the general population and their confidence intervals, which are equal:

Bottom line confidence interval obtained by subtraction maximum error from the sample mean (share), and the upper one by adding it.

Confidence interval for the average it uses the maximum sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the average lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for three standard confidence levels P = 95%, P = 99% and P = 99.9% the value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 And 3,29 . Thus, the marginal sampling error allows us to determine the limiting values ​​of the characteristics of the population and their confidence intervals:

The distribution of the results of sample observation to the general population in socio-economic research has its own characteristics, since it requires complete representation of all its types and groups. The basis for the possibility of such distribution is the calculation relative error:

Where Δ % - relative maximum sampling error; , .

There are two main methods for extending a sample observation to a population: direct recalculation and coefficient method.

Essence direct conversion consists of multiplying the sample mean!!\overline(x) by the size of the population.

Example. Let the average number of toddlers in the city be estimated by the sampling method and amount to one person. If there are 1000 young families in the city, then the number of required places in municipal nurseries is obtained by multiplying this average by the size of the general population N = 1000, i.e. will have 1200 seats.

Odds method It is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

The following formula is used:

where all variables are the population size:

Required sample size

Table 9.4 Required sample size (n) for different types of sample observation organization

When planning a sample observation with a predetermined value of the permissible sampling error, it is necessary to correctly estimate the required sample size. This volume can be determined on the basis of the permissible error during sample observation based on a given probability that guarantees the permissible value of the error level (taking into account the method of organizing the observation). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the maximum sampling error. So, from the expression for the marginal error:

sample size is directly determined n:

This formula shows that as the maximum sampling error decreases Δ the required sample size increases significantly, which is proportional to the variance and the square of the Student's t test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in table. 9.4.

Practical calculation examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors, a random sample of 10 payment documents was carried out at the bank. Their values ​​turned out to be equal (in days): 10; 3; 15; 15; 22; 7; 8; 1; 19; 20.

Necessary with probability P = 0.954 determine the marginal error Δ sample mean and confidence limits of mean calculation time.

Solution. The average value is calculated using the formula from table. 9.1 for the sample population

The variance is calculated using the formula from table. 9.1.

Mean square error of the day.

The average error is calculated using the formula:

those. the average is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

We calculate the maximum error using the formula from table. 9.3 for repeated sampling, since the population size is unknown, and for P = 0.954 level of confidence.

Thus, the average value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Using a Student's t-table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom, the obtained value is reliable with a significance level of a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimation of probability (general share) p.

During a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 determine the indicator R low-income families throughout the region.

Solution. Based on the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t = 3(see formula 3). Marginal error of fraction w determine by the formula from the table. 9.3 for non-repetitive sampling (mechanical sampling is always non-repetitive):

Maximum relative sampling error in % will be:

The probability (general share) of low-income families in the region will be р=w±Δw, and confidence limits p are calculated based on the double inequality:

w — Δ w ≤ p ≤ w — Δ w, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997 it can be stated that the share of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3. Calculation of the mean value and confidence interval for a discrete characteristic specified by an interval series.

In table 9.5. the distribution of applications for the production of orders according to the timing of their implementation by the enterprise is specified.

Table 9.5 Distribution of observations by time of appearance

Solution. The average time for completing orders is calculated using the formula:

The average period will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months.

We get the same answer if we use the data on p i from the penultimate column of the table. 9.5, using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The variance is calculated using the formula

Where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4), and the mean square error is .

The average error is calculated using the monthly formula, i.e. the average value is!!\overline(x) ± m = 23.1 ± 13.4.

We calculate the maximum error using the formula from table. 9.3 for repeated selection, since the population size is unknown, for a 0.954 confidence level:

So the average is:

those. its true value lies in the range from 0 to 50 months.

Example 4. To determine the speed of settlements with creditors of N = 500 corporation enterprises in a commercial bank, it is necessary to conduct a sample study using a random non-repetitive selection method. Determine the required sample size n so that with probability P = 0.954 the error of the sample mean does not exceed 3 days if trial estimates showed that the standard deviation s was 10 days.

Solution. To determine the number of required studies n, we will use the formula for non-repetitive selection from the table. 9.4:

In it, the t value is determined from a confidence level of P = 0.954. It is equal to 2. The mean square value is s = 10, the population size is N = 500, and the maximum error of the mean is Δ x = 3. Substituting these values ​​into the formula, we get:

those. It is enough to compile a sample of 41 enterprises to estimate the required parameter - the speed of settlements with creditors.

Math statistics is a branch of mathematics that studies approximate methods for finding distribution laws and numerical characteristics based on experimental results.

Population – this is the set of all conceivable values ​​of observations (objects), homogeneous with respect to some attribute, that could be made.

Sample it is a collection of randomly selected observations (objects) for direct study from the general population.

Statistical distribution is a set of variants x i and their corresponding frequencies n i .

Frequency histogram is a stepped figure consisting of adjacent rectangles constructed on the same straight line, the bases of which are identical and equal to the width of the class, and the height is equal to either the frequency of falling into the interval n i or the relative frequency n i / n. The width of the interval i can be determined according to the Sturges formula:

I=(x max -x min)/(1+3.32lgn),

Where x max – maximum; x min is the minimum value of the option, and their difference is called variation range; n – sample size.

Frequency polygon – a broken line, the segments of which connect points with coordinates x i, n i.

5. Characteristics of position (mode, median, sample mean) and dispersion (sample variance and sample standard deviation).

Fashion (M O ) these are variants of such meaning that the preceding and following meanings have lower frequencies of occurrence.

For unimodal distributions, a mode is the most frequently occurring variant in a given population.

To determine the mode of interval series, use the formula:

M 0 =x bottom +i*((n 2 -n 1 )/(2n 2 -n 1 +n 3 )),

where x lower is the lower boundary of the modal class, i.e. class with the highest frequency of occurrence n 2; n 2 – modal class frequency; n 1 – frequency of the class preceding the modal one; n 3 – frequency of the class next to the modal; i is the width of the class interval.

Median (M e )- this is the value of the attribute. With respect to which the distribution series is divided into 2 parts equal in volume.

Sample mean – this is the arithmetic mean value of a variant of the statistical series

Sample variance– arithmetic mean of squared deviations from their mean value:

Standard deviation is the square root of the sample variance:

S V =√(S V 2 )

6. Estimation of the parameters of the general population based on its sample (point and interval). Confidence interval and confidence probability.

Numerical values ​​characterizing the population are called parameters.

Statistical estimation can be performed in two ways:

1)point estimate– an estimate that is given for a certain point;

2)interval estimation– based on the sample data, the interval in which the true value lies with a given probability is estimated.

Point estimate is a score that is determined by a single number. And this number is determined by sampling.

The point estimate is called wealthy, if, as the sample size increases, the sample characteristic tends to the corresponding characteristic of the general population.

The point estimate is called effective, if it has the smallest sampling distribution variance compared to other similar estimates.

The point estimate is called unbiased, if its mathematical expectation is equal to the estimating parameter for any sample size.

Unbiased estimate of the general mean(mathematical expectation) is the sample average in:

V = i n i ,

where x i – sampling options; n i – frequency of occurrence of option x i; n – sample size.

Interval estimation is a numerical interval that is determined by two numbers - the boundaries of the interval, containing an unknown parameter of the general population.

Confidence interval– this is an interval in which, with one or another predetermined probability, an unknown parameter of the population is located.

Confidence probabilityp this is such a probability that the event of probability (1-p) can be considered impossible. α=1-р is the significance level. Typically, probabilities close to 1 are used as confidence probabilities. Then the event that the interval covers the characteristic will be practically reliable. These are p≥0.95, p≥0.99, p≥0.999.

For a small sample size (n<30) нормально распределенного количественного признака х доверительный интервал может иметь вид:

V - mt≤≤ V + mt (р≥0.95),

where is the general average; c – sample mean; t is the normalized indicator of the Student distribution with (n-1) degrees of freedom, which is determined by the probability of the general parameter falling into a given interval; m is the error of the sample mean.

This is a science that, based on the methods of probability theory, deals with the systematization and processing of statistical data to obtain scientific and practical conclusions.

Statistical data refers to information about the number of objects that have certain characteristics .

A group of objects united according to some qualitative or quantitative characteristic is called statistical totality . The objects included in a collection are called its elements, and their total number is its volume.

General population is the set of all conceivably possible observations that could be made under a given real set of conditions or more strictly: the general population is the random variable x and the associated probability space (W, Á, P).

The distribution of a random variable x is called population distribution(they talk, for example, about a normally distributed or simply normal population).

For example, if a number of independent measurements of a random variable are made x, then the general population is theoretically infinite (i.e. the general population is an abstract, conventionally mathematical concept); if the number of defective products in a batch of N products is checked, then this batch is considered as a finite general population of volume N.

In the case of socio-economic research, the general population of volume N can be the population of a city, region or country, and the measured characteristics can be income, expenses or the amount of savings of an individual person. If some attribute is of a qualitative nature (for example, gender, nationality, social status, occupation, etc.), but belongs to a finite set of options, then it can also be encoded as a number (as is often done in questionnaires).

If the number of objects N is large enough, then it is difficult and sometimes physically impossible to conduct a comprehensive survey (for example, check the quality of all cartridges). Then a limited number of objects are randomly selected from the entire population and subjected to study.

Sample population or simply sampling of volume n is a sequence x 1 , x 2 , ..., x n of independent identically distributed random variables, the distribution of each of which coincides with the distribution of the random variable x.

For example, the results of the first n measurements of a random variable x It is customary to consider it as a sample of size n from an infinite population. The data obtained is called observations of a random variable x, and they also say that the random variable x “takes on the values” x 1, x 2, …, x n.


The main task of mathematical statistics is to make scientifically based conclusions about the distribution of one or more unknown random variables or their relationship with each other. The method consisting in the fact that, based on the properties and characteristics of the sample, conclusions are made about the numerical characteristics and the distribution law of a random variable (general population) is called by selective method.

In order for the characteristics of a random variable obtained by the sampling method to be objective, it is necessary that the sample be representative those. represented the studied quantity quite well. By virtue of the law of large numbers, it can be argued that the sample will be representative if it is carried out randomly, i.e. All objects in the population have the same probability of being included in the sample. There are different types of sample selection for this purpose.

1. Simple random sampling is a selection in which objects are selected one at a time from the entire population.

2. Stratified (stratified) selection is that the original population of volume N is divided into subsets (strata) N 1, N 2,...,N k, so that N 1 + N 2 +...+ N k = N. When strata are determined, from each from them a simple random sample of volume n 1, n 2, ..., n k is extracted. A special case of stratified selection is typical selection, in which objects are selected not from the entire population, but from each typical part of it.

Combined selection combines several types of selection at once, forming different phases of a sample survey. There are other sampling methods.

The sample is called repeated , if the selected object is returned to the population before selecting the next one. The sample is called repeatable , if the selected object is not returned to the population. For a finite population, random selection without return leads at each step to the dependence of individual observations, and random equally possible selection with return leads to independence of observations. In practice, we usually deal with non-repetitive samples. However, when the population size N is many times larger than the sample size n (for example, hundreds or thousands of times), the dependence of the observations can be neglected.

Thus, a random sample x 1, x 2, ..., x n is the result of sequential and independent observations of a random variable ξ, representing the general population, and all elements of the sample have the same distribution as the original random variable x.

We will call the distribution function F x (x) and other numerical characteristics of the random variable x theoretical, Unlike sample characteristics , which are determined from the results of observations.

Let the sample x 1, x 2, ..., x k be the result of independent observations of a random variable x, and x 1 was observed n 1 times, x 2 - n 2 times, ..., x k - n k times, so that n i = n - sample size. The number n i showing how many times the value x i appeared in n observations is called frequency given value, and the ratio n i /n = w i- relative frequency. Obviously the numbers w i are rational and .

A statistical population arranged in ascending order of a characteristic is called variation series . Its members are denoted x (1), x (2), ... x (n) and are called options . The variation series is called discrete, if its members take specific isolated values. Statistical distribution sampling of a discrete random variable x called a list of options and their corresponding relative frequencies w i. The resulting table is called statistically close.

X (1) x(2) ... x k(k)
ω 1 ω 2 ... ωk

The largest and smallest values ​​of the variation series are denoted by x min and x max and are called extreme members of the variation series.

If a continuous random variable is studied, then grouping consists of dividing the interval of observed values ​​into k partial intervals of equal length h, and counting the number of observations that fall into these intervals. The resulting numbers are taken as frequencies n i (for some new, already discrete random variable). The middle values ​​of the intervals are usually taken as new values ​​for option x i (or the intervals themselves are indicated in the table). According to the Sturges formula, the recommended number of partition intervals is k » 1 + log 2 n, and the lengths of partial intervals are equal to h = (x max - x min)/k. It is assumed that the entire interval has the form .

Graphically, statistical series can be presented in the form of a polygon, a histogram or a graph of accumulated frequencies.

Frequency polygon called a broken line, the segments of which connect the points (x 1, n 1), (x 2, n 2), ..., (x k, n k). Polygon relative frequencies called a broken line, the segments of which connect the points (x 1, w 1), (x 2, w 2), …, (x k , w k). Polygons usually serve to represent a sample in the case of discrete random variables (Fig. 7.1.1).

Rice. 7.1

.1.

Relative frequency histogram called a stepped figure consisting of rectangles, the base of which are partial intervals of length h, and the height

equal w i/h.

A histogram is usually used to depict a sample in the case of continuous random variables. The area of ​​the histogram is equal to one (Fig. 7.1.2). If you connect the midpoints of the upper sides of the rectangles on a histogram of relative frequencies, then the resulting broken line forms a polygon of relative frequencies. Therefore, a histogram can be viewed as a graph empirical (sample) distribution density fn(x). If the theoretical distribution has a finite density, then the empirical density is some approximation of the theoretical one.

Graph of accumulated frequencies is a figure constructed similarly to a histogram with the difference that to calculate the heights of rectangles, not simple ones are taken, but accumulated relative frequencies, those. quantities These values ​​do not decrease, and the graph of accumulated frequencies has the form of a stepped “staircase” (from 0 to 1).

The graph of accumulated frequencies is used in practice to approximate the theoretical distribution function.

Task. A sample of 100 small enterprises in the region is analyzed. The purpose of the survey is to measure the ratio of borrowed and equity funds (x i) at each i-th enterprise. The results are presented in Table 7.1.1.

Table Ratios of debt and equity capital of enterprises.

5,56 5,45 5,48 5,45 5,39 5,37 5,46 5,59 5,61 5,31
5,46 5,61 5,11 5,41 5.31 5,57 5,33 5,11 5,54 5,43
5,34 5,53 5,46 5,41 5,48 5,39 5,11 5,42 5,48 5,49
5,36 5,40 5,45 5,49 5,68 5,51 5,50 5,68 5,21 5,38
5,58 5,47 5,46 5,19 5,60 5,63 5,48 5,27 5,22 5,37
5,33 5,49 5,50 5,54 5,40 5.58 5,42 5,29 5,05 5,79
5,79 5,65 5,70 5,71 5,85 5,44 5,47 5,48 5,47 5,55
5,67 5,71 5,73 5,05 5,35 5,72 5,49 5,61 5,57 5,69
5,54 5,39 5,32 5,21 5,73 5,59 5,38 5,25 5,26 5,81
5,27 5,64 5,20 5,23 5,33 5,37 5,24 5,55 5,60 5,51

Construct a histogram and graph of accumulated frequencies.

Solution. Let's build a grouped series of observations:

1. Let us determine in the sample x min = 5.05 and x max = 5.85;

2. Let's divide the entire range into k equal intervals: k » 1 + log 2 100 = 7.62; k = 8, hence the length of the interval

Table 7.1.2. Grouped series of observations

Interval Number Intervals Midpoints of intervals x i w i fn(x)
5,05-5,15 5,1 0,05 0,05 0,5
5,15-5,25 5,2 0,08 0,13 0,8
5,25-5,35 5,3 0,12 0,25 1,2
5,35-5,45 5,4 0,20 0,45 2,0
5,45-5,55 5,5 0,26 0,71 2,6
5,55-5,65 5,6 0,15 0,86 1,5
5,65-5,75 5,7 0,10 0,96 1,0
5,75-5,85 5,8 0,04 1,00 0,4

In Fig. 7.1.3 and 7.1.4, built according to the data in Table 7.1.2, present a histogram and graph of accumulated frequencies. The curves correspond to the density and normal distribution function "fitted" to the data.

Thus, the sample distribution is some approximation of the population distribution.

Share