What is regression?

Consider two continuous variables x=(x 1 , x 2 , .., x n), y=(y 1 , y 2 , ..., y n).

Let's place the points on a two-dimensional scatter plot and say that we have linear relation, if the data is approximated by a straight line.

If we believe that y depends on x, and changes in y are caused precisely by changes in x, we can determine the regression line (regression y on x), which best describes the linear relationship between these two variables.

The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

He showed that although tall fathers tend to have tall sons, the average height of sons is shorter than that of their tall fathers. The average height of sons "regressed" and "moved backward" towards the average height of all fathers in the population. Thus, on average, tall fathers have shorter (but still quite tall) sons, and short fathers have taller (but still quite short) sons.

Regression line

A mathematical equation that estimates a simple (pairwise) linear regression line:

x called the independent variable or predictor.

Y- dependent variable or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

a- free member (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).
b- slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x for one unit.
a And b are called regression coefficients of the estimated line, although this term is often used only for b.

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

Least square method

We fulfill regression analysis, using a sample of observations, where a And b- sample estimates of true (general) parameters, α and β, which determine the line of linear regression in the population ( population).

Most simple method determination of coefficients a And b is method least squares (MNC).

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y- predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

Linear Regression Assumptions

So, for each observed value, the remainder is equal to the difference and the corresponding predicted value. Each remainder can be positive or negative.

You can use residuals to test the following assumptions behind linear regression:

The residuals are normally distributed with a mean of zero;

If the assumptions of linearity, normality, and/or constant variance are questionable, we can transform or and calculate a new regression line for which these assumptions are satisfied (for example, use a logarithmic transformation, etc.).

Anomalous values (outliers) and influence points

An "influential" observation, if omitted, changes one or more model parameter estimates (ie, slope or intercept).

An outlier (an observation that is inconsistent with the majority of values in a data set) can be an "influential" observation and can be easily detected visually by inspecting a bivariate scatterplot or residual plot.

Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

When conducting an analysis, you should not automatically discard outliers or influence points, since simply ignoring them can affect the results obtained. Always study the reasons for these outliers and analyze them.

Linear regression hypothesis

When constructing linear regression, the null hypothesis is tested that the general slope of the regression line β is equal to zero.

If the slope of the line is zero, there is no linear relationship between and: the change does not affect

To test the null hypothesis that the true slope is zero, you can use the following algorithm:

Calculate the test statistic equal to the ratio , which is subject to a distribution with degrees of freedom, where the standard error of the coefficient

- estimation of the dispersion of the residuals.

Typically, if the significance level is reached, the null hypothesis is rejected.

where is the percentage point of the distribution with degrees of freedom, which gives the probability of a two-sided test

This is the interval that contains the general slope with a probability of 95%.

For large samples, say, we can approximate with a value of 1.96 (that is, the test statistic will tend to be normally distributed)

Assessing the quality of linear regression: coefficient of determination R 2

Because of the linear relationship and we expect that changes as , and call it the variation that is due to or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

The proportion of total variance that is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The difference represents the percentage of variance that cannot be explained by regression.

There is no formal test to evaluate; we must rely on subjective judgment to determine the goodness of fit of the regression line.

Applying a Regression Line to Forecast

You can use a regression line to predict a value from a value at the extreme end of the observed range (never extrapolate beyond these limits).

We predict the mean of observables that have a particular value by plugging that value into the equation of the regression line.

So, if we predict as We use this predicted value and its standard error to estimate confidence interval for true average size in the population.

Repeating this procedure for different values allows you to construct confidence limits for this line. This is the band or area that contains the true line, for example at 95% confidence level.

Simple regression plans

Simple regression designs contain one continuous predictor. If there are 3 observations with predictor values P, such as 7, 4, and 9, and the design includes a first-order effect P, then the design matrix X will be

and the regression equation using P for X1 is

Y = b0 + b1 P

If a simple regression design contains an effect higher order for P, for example, a quadratic effect, then the values in column X1 in the design matrix will be raised to the second power:

and the equation will take the form

Y = b0 + b1 P2

Sigma-constrained and overparameterized coding methods do not apply to simple regression designs and other designs containing only continuous predictors (because there are simply no categorical predictors). Regardless of the coding method chosen, the values of the continuous variables are incremented accordingly and used as values for the X variables. In this case, no recoding is performed. In addition, when describing regression plans, you can omit consideration of the design matrix X, and work only with the regression equation.

Example: Simple Regression Analysis

This example uses the data presented in the table:

Rice. 3. Table of initial data.

Data compiled from a comparison of the 1960 and 1970 censuses in randomly selected 30 counties. County names are presented as observation names. Information regarding each variable is presented below:

Rice. 4. Table of variable specifications.

Research problem

For this example, the correlation between the poverty rate and the degree that predicts the percentage of families that are below the poverty line will be analyzed. Therefore, we will treat variable 3 (Pt_Poor) as the dependent variable.

We can put forward a hypothesis: changes in population size and the percentage of families that are below the poverty line are related. It seems reasonable to expect that poverty leads to out-migration, so there would be a negative correlation between the percentage of people below the poverty line and population change. Therefore, we will treat variable 1 (Pop_Chng) as a predictor variable.

View results

Regression coefficients

Rice. 5. Regression coefficients of Pt_Poor on Pop_Chng.

At the intersection of the Pop_Chng row and the Param column. the unstandardized coefficient for the regression of Pt_Poor on Pop_Chng is -0.40374. This means that for every one unit decrease in population, there is an increase in poverty rate of .40374. The upper and lower (default) 95% confidence limits for this unstandardized coefficient do not include zero, so the regression coefficient is significant at the p level<.05 . Обратите внимание на не стандартизованный коэффициент, который также является коэффициентом корреляции Пирсона для простых регрессионных планов, равен -.65, который означает, что для каждого уменьшения стандартного отклонения численности населения происходит увеличение стандартного отклонения уровня бедности на.65.

Variable distribution

Correlation coefficients can become significantly overestimated or underestimated if large outliers are present in the data. Let's study the distribution of the dependent variable Pt_Poor by district. To do this, let's build a histogram of the variable Pt_Poor.

Rice. 6. Histogram of the Pt_Poor variable.

As you can see, the distribution of this variable differs markedly from the normal distribution. However, although even two counties (the two right columns) have a higher percentage of families that are below the poverty line than expected under a normal distribution, they appear to be "within the range."

Rice. 7. Histogram of the Pt_Poor variable.

This judgment is somewhat subjective. The rule of thumb is that outliers should be considered if the observation (or observations) do not fall within the interval (mean ± 3 times the standard deviation). In this case, it is worth repeating the analysis with and without outliers to ensure that they do not have a major effect on the correlation between population members.

Scatterplot

If one of the hypotheses is a priori about the relationship between given variables, then it is useful to test it on the graph of the corresponding scatterplot.

Rice. 8. Scatter diagram.

The scatterplot shows a clear negative correlation (-.65) between the two variables. It also shows the 95% confidence interval for the regression line, i.e., there is a 95% probability that the regression line lies between the two dotted curves.

Significance criteria

Rice. 9. Table containing significance criteria.

The test for the Pop_Chng regression coefficient confirms that Pop_Chng is strongly related to Pt_Poor , p<.001 .

Bottom line

This example showed how to analyze a simple regression design. Interpretations of unstandardized and standardized regression coefficients were also presented. The importance of studying the response distribution of a dependent variable is discussed, and a technique for determining the direction and strength of the relationship between a predictor and a dependent variable is demonstrated.

Calculating Regression Equation Coefficients

The system of equations (7.8) based on the available ED cannot be solved unambiguously, since the number of unknowns is always greater than the number of equations. To overcome this problem, additional assumptions are needed. Common sense dictates: it is advisable to choose the coefficients of the polynomial in such a way as to ensure a minimum error in approximation of the ED. Various measures can be used to evaluate approximation errors. The root mean square error is widely used as such a measure. On its basis, a special method for estimating the coefficients of regression equations has been developed - the least squares method (LSM). This method allows you to obtain maximum likelihood estimates of the unknown coefficients of the regression equation under the normal distribution option, but it can be used for any other distribution of factors.

The MNC is based on the following provisions:

· the values of error values and factors are independent, and therefore uncorrelated, i.e. it is assumed that the mechanisms for generating interference are not related to the mechanism for generating factor values;

· the mathematical expectation of the error ε must be equal to zero (the constant component is included in the coefficient a 0), in other words, the error is a centered quantity;

· the sample estimate of the error variance should be minimal.

Let's consider the use of OLS in relation to linear regression of standardized values. For centered quantities u j coefficient a 0 is equal to zero, then the linear regression equations

. (7.9)

A special sign “^” has been introduced here to denote the values of the indicator calculated using the regression equation, in contrast to the values obtained from observational results.

Using the least squares method, such values of the coefficients of the regression equation are determined that provide an unconditional minimum to the expression

The minimum is found by equating to zero all partial derivatives of expression (7.10), taken over unknown coefficients, and solving the system of equations

(7.11)

Consistently carrying out the transformations and using the previously introduced estimates of the correlation coefficients

. (7.12)

So, received T–1 linear equations, which allows you to uniquely calculate the values a 2 , a 3 , …, a t.

If the linear model is inaccurate or the parameters are measured inaccurately, then in this case the least squares method allows us to find such values of the coefficients at which the linear model the best way describes a real object in the sense of the selected standard deviation criterion.

When there is only one parameter, the linear regression equation becomes

Coefficient a 2 is found from the equation

Then, given that r 2.2= 1, required coefficient

a 2 = r y ,2 . (7.13)

Relationship (7.13) confirms the previously stated statement that the correlation coefficient is a measure of the linear relationship between two standardized parameters.

Substituting the found value of the coefficient a 2 into an expression for w, taking into account the properties of centered and normalized quantities, we obtain the minimum value of this function equal to 1– r 2 y,2. Value 1– r 2 y,2 is called the residual variance of the random variable y relative to a random variable u 2. It characterizes the error that is obtained when replacing the indicator with a function of the parameter υ= a 2 u 2. Only with | r y,2| = 1 the residual variance is zero, and therefore there is no error when approximating the indicator with a linear function.

Moving on from centered and normalized indicator and parameter values

can be obtained for the original values

This equation is also linear with respect to the correlation coefficient. It is easy to see that centering and normalization for linear regression makes it possible to reduce the dimension of the system of equations by one, i.e. simplify the solution to the problem of determining the coefficients, and give the coefficients themselves a clear meaning.

The use of least squares for nonlinear functions is practically no different from the scheme considered (only the coefficient a0 in the original equation is not equal to zero).

For example, suppose it is necessary to determine the coefficients of parabolic regression

Sample error variance

Based on it, we can obtain the following system of equations

After transformations, the system of equations will take the form

Taking into account the properties of the moments of standardized quantities, we write

The determination of nonlinear regression coefficients is based on solving a system of linear equations. To do this, you can use universal packages of numerical methods or specialized packages for processing statistical data.

As the degree of the regression equation increases, so does the degree of distribution moments of the parameters used to determine the coefficients. Thus, to determine the coefficients of the regression equation of the second degree, the moments of the distribution of parameters up to the fourth degree inclusive are used. It is known that the accuracy and reliability of estimating moments from a limited sample of EDs sharply decreases as their order increases. The use of polynomials of degree higher than the second in regression equations is inappropriate.

The quality of the resulting regression equation is assessed by the degree of closeness between the results of observations of the indicator and the values predicted by the regression equation at given points in the parameter space. If the results are close, then the problem of regression analysis can be considered solved. Otherwise, you should change the regression equation (choose a different degree of polynomial or a different type of equation altogether) and repeat the calculations to estimate the parameters.

If there are several indicators, the problem of regression analysis is solved independently for each of them.

Analyzing the essence of the regression equation, the following points should be noted. The considered approach does not provide separate (independent) assessment of coefficients - a change in the value of one coefficient entails a change in the values of others. The obtained coefficients should not be considered as the contribution of the corresponding parameter to the value of the indicator. The regression equation is just a good analytical description of the existing ED, and not a law describing the relationship between the parameters and the indicator. This equation is used to calculate the values of the indicator in a given range of parameter changes. It is of limited suitability for calculations outside this range, i.e. it can be used for solving interpolation problems and, to a limited extent, for extrapolation.

The main reason for the inaccuracy of the forecast is not so much the uncertainty of extrapolation of the regression line, but rather the significant variation of the indicator due to factors not taken into account in the model. The limitation of the forecasting ability is the condition of stability of parameters not taken into account in the model and the nature of the influence of the model factors taken into account. If the external environment changes sharply, then the compiled regression equation will lose its meaning. You cannot substitute into the regression equation values of factors that differ significantly from those presented in the ED. It is recommended not to go beyond one third of the range of variation of the parameter for both the maximum and minimum values of the factor.

The forecast obtained by substituting the expected value of the parameter into the regression equation is a point one. The likelihood of such a forecast being realized is negligible. It is advisable to determine the confidence interval of the forecast. For individual values of the indicator, the interval should take into account errors in the position of the regression line and deviations of individual values from this line. The average error in predicting indicator y for factor x will be

Where is the average error in the position of the regression line in the population at x = x k;

– assessment of the variance of the deviation of the indicator from the regression line in the population;

x k– expected value of the factor.

The confidence limits of the forecast, for example, for the regression equation (7.14), are determined by the expression

Negative free term a 0 in the regression equation for the original variables means that the domain of existence of the indicator does not include zero parameter values. If a 0 > 0, then the domain of existence of the indicator includes zero values of the parameters, and the coefficient itself characterizes the average value of the indicator in the absence of influences of the parameters.

Problem 7.2. Construct a regression equation for channel capacity based on the sample specified in table. 7.1.

Solution. In relation to the specified sample, the construction of the analytical dependence was mainly carried out within the framework of correlation analysis: the throughput depends only on the signal-to-noise ratio parameter. It remains to substitute the previously calculated parameter values into expression (7.14). The equation for capacity will take the form

ŷ = 26.47–0.93×41.68×5.39/6.04+0.93×5.39/6.03× X = – 8,121+0,830X.

The calculation results are presented in table. 7.5.

Table 7.5

N pp	Channel capacity	Signal to noise ratio	Function value	Error
Y	X	ŷ	ε
	26.37	41.98	26.72	-0.35
	28.00	43.83	28.25	-0.25
	27/83	42.83	27.42	0.41
	31.67	47.28	31.12	0.55
	23.50	38.75	24.04	-0.54
	21.04	35.12	21.03	0.01
	16.94	32.07	18.49	-1.55
	37.56	54.25	36.90	0.66
	18.84	32.70	19.02	-0.18
	25.77	40.51	25.50	0.27
	33.52	49.78	33.19	0.33
	28.21	43.84	28.26	-0.05
	28.76	44.03

Fundamentals of data analysis.

A typical problem that arises in practice is identifying dependencies or relationships between variables. In real life, variables are related to each other. For example, in marketing, the amount of money spent on advertising affects sales; in medical research, the dose of a drug affects the effect; in textile production, the quality of fabric dyeing depends on temperature, humidity and other parameters; in metallurgy, the quality of steel depends on special additives, etc. Finding dependencies in data and using them for your own purposes is the task of data analysis.

Let's say you observe the values of a pair of variables X and Y and want to find the relationship between them. For example:

X - number of visitors to the online store, Y - sales volume;

X - plasma panel diagonal, Y - price;

X is the purchase price of the share, Y is the sale price;

X is the cost of aluminum on the London Stock Exchange, Y is sales volumes;

X - the number of breaks in oil pipelines, Y - the amount of losses;

X is the “age” of the aircraft, Y is the cost of its repair;

X - sales area, Y - store turnover;

X is income, Y is consumption, etc.

Variable X is usually called an independent variable, variable Y is called a dependent variable. Sometimes variable X is called a predictor, variable Y is called a response.

We want to determine exactly the dependence on X or predict what the values of Y will be for given values of X. In this case, we observe the X values and the corresponding Y values. The task is to build a model that allows one to determine Y from values of X different from the observed ones. In statistics, such problems are solved within the framework regression analysis.

There are various regression models, determined by the choice of function f(x 1 ,x 2 ,…,x m):

1) Simple Linear Regression

2) Multiple regression

3) Polynomial regression

Odds are called regression parameters.

The main feature of regression analysis: with its help, you can obtain specific information about what form and nature the relationship between the variables under study has.

Sequence of stages of regression analysis

1. Problem formulation. At this stage, preliminary hypotheses about the dependence of the phenomena under study are formed.

2. Definition of dependent and independent (explanatory) variables.

3. Collection of statistical data. Data must be collected for each of the variables included in the regression model.

4. Formulation of a hypothesis about the form of connection (simple or multiple, linear or nonlinear).

5. Determination of the regression function (consists in calculating the numerical values of the parameters of the regression equation)

6. Assessing the accuracy of regression analysis.

7. Interpretation of the results obtained. The obtained results of regression analysis are compared with preliminary hypotheses. The correctness and credibility of the results obtained are assessed.

8. Prediction of unknown values of the dependent variable.

Using regression analysis, it is possible to solve the problem of forecasting and classification. Predicted values are calculated by substituting the values of explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and that part of the set where the function value is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

The main tasks of regression analysis: establishing the form of the dependence, determining the regression function, estimating the unknown values of the dependent variable.

Linear regression

Linear regression reduces to finding an equation of the form

Or . (1.1)

x- is called an independent variable or predictor.

Y– dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

· a– free term (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).

· b– slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x for one unit.

· a And b are called regression coefficients of the estimated line, although this term is often used only for b.

· e- unobservable random variables with mean 0, or they are also called observation errors; it is assumed that the errors are not correlated with each other.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

An equation of the form allows for given values of the factor X have theoretical values of the resultant characteristic, substituting the actual values of the factor into it X. In the graph, the theoretical values represent the regression line.

In most cases (if not always) there is a certain scatter of observations relative to the regression line.

Theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main tendency of the connection.

An important stage of regression analysis is determining the type of function with which the dependence between characteristics is characterized. The main basis for choosing the type of equation should be a meaningful analysis of the nature of the dependence being studied and its mechanism.

To find parameters A And b we use regression equations least squares method (LSM). When applying OLS to find the function that best fits the empirical data, it is believed that the sum of squared deviations (remainder) of empirical points from the theoretical regression line should be a minimum value.

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y– predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

After simple transformations we obtain a system of normal equations using the least squares method to determine the values of the parameters a And b linear correlation equations based on empirical data:

. (1.2)

Solving this system of equations for b, we get the following formula to determine this parameter:

(1.3)

Where and are the average values of y, x.

Parameter value A we obtain by dividing both sides of the first equation in this system by n:

Parameter b in the equation is called the regression coefficient. In the presence of a direct correlation, the regression coefficient is positive, and in the case of an inverse correlation, the regression coefficient is negative.

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

The regression coefficient shows how much the value of the resulting characteristic changes on average y when a factor characteristic changes X per unit, the geometric regression coefficient is the slope of the straight line depicting the correlation equation relative to the axis X(for equation ).

Because of the linear relationship, and we expect that changes as , and we call this the variation that is due or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

A quantitative characteristic of the degree of linear dependence between random variables X and Y is the correlation coefficient r ( An indicator of the closeness of the relationship between two characteristics ) .

Correlation coefficient:

where x is the value of the factor characteristic;

y - the value of the resulting attribute;

n - number of data pairs.

Fig. 3 - Options for the location of the “cloud” of points

If the correlation coefficient r=1, then between X And Y there is a functional linear relationship, all points (x i ,y i) will lie on a straight line.

If the correlation coefficient r=0 (r~0), then they say that X And Y uncorrelated, i.e. there is no linear relationship between them.

The relationship between signs (on the Chaddock scale) can be strong, medium and weak . The closeness of the connection is determined by the value of the correlation coefficient, which can take values from -1 to +1 inclusive. The criteria for assessing the tightness of the connection are shown in Fig. 1.

Rice. 4. Quantitative criteria for assessing the closeness of communication

Any relationship between variables has two important properties: magnitude and reliability. The stronger the relationship between two variables, the greater the magnitude of the relationship and the easier it is to predict the value of one variable from the value of the other variable. The magnitude of dependence is easier to measure than reliability.

The reliability of the dependence is no less important than its magnitude. This property is related to the representativeness of the sample under study. The reliability of a relationship characterizes how likely it is that this relationship will be found again on other data.

As the magnitude of the dependence of variables increases, its reliability usually increases.

The coefficient of determination measures the proportion of variance around the mean that is “explained” by the constructed regression. The coefficient of determination ranges from 0 to 1. The closer the coefficient of determination is to 1, the better the regression “explains” the dependence in the data; a value close to zero means the poor quality of the constructed model. The coefficient of determination can be as close as possible to 1 if all predictors are different.

The difference represents the percentage of variance that cannot be explained by regression.

Multiple regression

Multiple regression is used in situations where, from the many factors influencing the effective attribute, it is impossible to single out one dominant factor and it is necessary to take into account the influence of several factors. For example, the volume of production is determined by the size of fixed and working capital, the number of personnel, the level of management, etc., the level of demand depends not only on the price, but also on the funds available to the population.

The main goal of multiple regression is to build a model with several factors and determine the influence of each factor separately, as well as their joint impact on the indicator being studied.

Multiple regression is a relationship equation with several independent variables:

Regression coefficient is the absolute value by which, on average, the value of one characteristic changes when another associated characteristic changes by a specified unit of measurement. Definition of regression. The relationship between y and x determines the sign of the regression coefficient b (if > 0 - direct relationship, otherwise - inverse). The linear regression model is the most commonly used and most studied in econometrics.

1.4. Approximation error. Let us evaluate the quality of the regression equation using the absolute approximation error. The predicted values of the factors are substituted into the model and predictive point estimates of the indicator being studied are obtained. Thus, regression coefficients characterize the degree of significance of individual factors for increasing the level of the performance indicator.

Regression coefficient

Let us now consider problem 1 of the regression analysis tasks given on p. 300-301. One of the mathematical results of linear regression theory says that the estimator, N, is the unbiased estimator with the minimum variance in the class of all linear unbiased estimators. For example, you can calculate the number of colds on average at certain values of the average monthly air temperature in the autumn-winter period.

Regression line and regression equation

Regression sigma is used to construct a regression scale, which reflects the deviation of the values of the resulting characteristic from its average value plotted on the regression line. 1, x2, x3 and the corresponding average values y1, y2 y3, as well as the smallest (y - σrу/х) and largest (y + σrу/х) values (y) to construct a regression scale. Conclusion. Thus, the regression scale within the limits of the calculated values of body weight makes it possible to determine it at any other value of height or to assess the individual development of the child.

In matrix form, the regression equation (RE) is written as: Y=BX+U(\displaystyle Y=BX+U), where U(\displaystyle U) is the error matrix. The statistical use of the word regression comes from the phenomenon known as regression to the mean, attributed to Sir Francis Galton (1889).

Pairwise linear regression can be extended to include more than one independent variable; in this case it is known as multiple regression. Both for outliers and for “influential” observations (points), models are used, both with and without their inclusion, and attention is paid to changes in estimates (regression coefficients).

Because of the linear relationship, and we expect what changes as it changes, and we call this the variation that is due or explained by regression. If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well. The difference represents the percentage of variance that cannot be explained by regression.

This method is used for visual image forms of connection between the studied economic indicators. Based on the correlation field, we can hypothesize (for the population) that the relationship between all possible values of X and Y is linear.

The reasons for the existence of a random error: 1. Failure to include significant explanatory variables in the regression model; 2. Aggregation of variables. System of normal equations. In our example, the connection is direct. To predict the dependent variable of the resultant attribute, it is necessary to know the predicted values of all factors included in the model.

Comparison of correlation and regression coefficients

With a probability of 95% we can guarantee that the values of Y for unlimited large number observations will not go beyond the found intervals. If the calculated value with lang=EN-US>n-m-1) degrees of freedom is greater than the tabulated value at a given significance level, then the model is considered significant. This ensures that there is no correlation between any deviations and, in particular, between adjacent deviations.

Regression coefficients and their interpretation

In most cases, positive autocorrelation is caused by the directional constant influence of some factors not taken into account in the model. Negative autocorrelation essentially means that a positive deviation is followed by a negative one and vice versa.

What is regression?

2. Inertia. Many economic indicators (inflation, unemployment, GNP, etc.) have a certain cyclical nature associated with the undulation of business activity. In many production and other areas, economic indicators respond to changes in economic conditions with a delay (time lag).

If preliminary standardization of factor indicators is carried out, then b0 is equal to the average value of the effective indicator in the aggregate. Specific values of regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

The linear regression equation has the form y = bx + a + ε Here ε is a random error (deviation, disturbance). Since the error is more than 15%, it is not advisable to use this equation as a regression. By substituting the appropriate x values into the regression equation, we can determine the aligned (predicted) values of the performance indicator y(x) for each observation.

Regression coefficients show the intensity of the influence of factors on the performance indicator. If preliminary standardization of factor indicators is carried out, then b 0 is equal to the average value of the effective indicator in the aggregate. Coefficients b 1, b 2, ..., b n show by how many units the level of the effective indicator deviates from its average value if the values of the factor indicator deviate from the average of zero by one standard deviation. Thus, regression coefficients characterize the degree of significance of individual factors for increasing the level of the performance indicator. Specific values of regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

Regression line- a line that most accurately reflects the distribution of experimental points on a scatter diagram and the steepness of the slope of which characterizes the relationship between two interval variables.

The regression line is most often sought in the form of a linear function (linear regression), which best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed ones from their estimates is minimized (meaning estimates using a straight line that purports to represent the desired regression relationship):

(M - sample size). This approach is based on known fact, that the amount appearing in the above expression takes on a minimum value precisely for the case when .
57. Main tasks of correlation theory.

Correlation theory is an apparatus that evaluates the closeness of connections between phenomena that are not only in cause-and-effect relationships. Using correlation theory, stochastic, but not causal, relationships are assessed. The author, together with M. L. Lukatskaya, made an attempt to obtain estimates for causal relationships. However, the question of the cause-and-effect relationships of phenomena, of how to identify cause and effect, remains open, and it seems that at the formal level it is fundamentally unsolvable.

Correlation theory and its application to production analysis.

Correlation theory, which is one of the branches of mathematical statistics, allows one to make reasonable assumptions about the possible limits within which, with a certain degree of reliability, the parameter under study will be located if other statistically related parameters receive certain values.

In correlation theory, it is customary to distinguish two main tasks.

First task correlation theory - to establish the form of correlation, i.e. type of regression function (linear, quadratic, etc.).

Second task correlation theory - assess the closeness (strength) of the correlation connection.

The closeness of the correlation connection (dependence) of Y on X is assessed by the amount of dispersion of the Y values around the conditional average. Large dispersion indicates a weak dependence of Y on X, small dispersion indicates the presence of a strong dependence.
58. Correlation table and its numerical characteristics.

In practice, as a result of independent observations of the quantities X and Y, as a rule, one deals not with the entire set of all possible pairs of values of these quantities, but only with a limited sample from the general population, and the volume is n sample population is defined as the number of pairs available in the sample.

Let the value X in the sample take the values x 1, x 2,....x m, where the number of values of this value that differ from each other, and in the general case, each of them can be repeated in the sample. Let the value Y in the sample take the values y 1, y 2,....y k, where k is the number of different values of this value, and in the general case, each of them can also be repeated in the sample. In this case, the data is entered into a table taking into account the frequency of occurrence. Such a table with grouped data is called a correlation table.

The first stage of statistical processing of the results is the compilation of a correlation table.

Y\X	x 1	x 2	...	x m	n y
y 1	n 12	n 21		n m1	n y1
y 2		n 22		n m2	n y2
...
y k	n 1k	n 2k		n mk	n yk
n x	n x1	n x2		n xm	n

The first row of the main part of the table lists in ascending order all the values of the quantity X found in the sample. The first column also lists in ascending order all the values of the quantity Y found in the sample. At the intersection of the corresponding rows and columns, frequencies n ij (i = 1,2 ,...,m; j=1,2,...,k) equal to the number of occurrences of the pair (x i ;y i) in the sample. For example, frequency n 12 represents the number of occurrences of the pair (x 1 ;y 1) in the sample.

Also n xi n ij , 1≤i≤m, the sum of the elements of the i-th column, n yj n ij , 1≤j≤k, is the sum of the elements of the j-th row and n xi = n yj =n

Analogues of the formulas obtained from the correlation table data have the form:

59. Empirical and theoretical regression lines.

Theoretical regression line can be calculated in this case from the results of individual observations. To solve a system of normal equations, we need the same data: x, y, xy and xr. We have data on the volume of cement production and the volume of basic production assets in 1958. The task is set: to study the relationship between the volume of cement production (in physical terms) and the volume of fixed assets. [ 1 ]

The less the theoretical regression line (calculated from the equation) deviates from the actual (empirical) one, the smaller the average approximation error.

The process of finding the theoretical regression line involves fitting the empirical regression line using the least squares method.

The process of finding a theoretical regression line is called alignment of the empirical regression line and consists of selecting and justifying the type; curve and calculation of the parameters of its equation.

Empirical regression is built according to analytical or combinational grouping data and represents the dependence of the group average values of the result trait on the group average values of the factor trait. The graphical representation of empirical regression is a broken line made up of points, the abscissas of which are the group average values of the factor trait, and the ordinates are the group average values of the result trait. The number of points is equal to the number of groups in the grouping.

The empirical regression line reflects the main trend of the relationship under consideration. If the empirical regression line approaches a straight line in appearance, then we can assume the presence of a linear correlation between the characteristics. And if the connection line approaches the curve, then this may be due to the presence of a curvilinear correlation relationship.
60. Sample correlation and regression coefficients.

If the dependence between the characteristics on the graph indicates a linear correlation, calculate correlation coefficient r, which allows you to assess the closeness of the relationship between variables, and also find out what proportion of changes in a characteristic is due to the influence of the main characteristic, and what part is due to the influence of other factors. The coefficient varies from –1 to +1. If r=0, then there is no connection between the characteristics. Equality r=0 only indicates the absence of a linear correlation dependence, but not the absence of a correlation at all, much less a statistical dependence. If r= ±1, then this means the presence of a complete (functional) connection. In this case, all observed values are located on the regression line, which is a straight line.
The practical significance of the correlation coefficient is determined by its squared value, called the coefficient of determination.
Regression approximated (approximately described) by a linear function y = kX + b. For the regression of Y on X, the regression equation is: `y x = ryx X + b; (1). Slope factor ryx of the direct regression of Y on X is called the regression coefficient of Y on X.

If equation (1) is found using sample data, then it is called sample regression equation. Accordingly, ryx is the sample regression coefficient of Y on X, and b is the sample dummy term of the equation. The regression coefficient measures the variation in Y per unit variation in X. The parameters of the regression equation (coefficients ryx and b) are found using the least squares method.
61. Assessing the significance of the correlation coefficient and the closeness of the correlation in the general population

Significance of correlation coefficients checked using Student's test:

Where - root mean square error of the correlation coefficient, which is determined by the formula:

If the calculated value is higher than the table value, then we can conclude that the value of the correlation coefficient is significant. Table values t found from the table of Student's t-test values. In this case, the number of degrees of freedom is taken into account (V = n - 1) and the confidence level (in economic calculations, usually 0.05 or 0.01). In our example, the number of degrees of freedom is: P - 1 = 40 - 1 = 39. At the confidence level R = 0,05; t= 2.02. Since (the actual value in all cases is higher than the t-tabular one), the relationship between the resultant and factor indicators is reliable, and the magnitude of the correlation coefficients is significant.

Estimation of the correlation coefficient, calculated from a limited sample, almost always differs from zero. But this does not mean that the correlation coefficient population is also different from zero. It is required to evaluate the significance of the sample value of the coefficient or, in accordance with the formulation of the tasks of testing statistical hypotheses, to test the hypothesis that the correlation coefficient is equal to zero. If the hypothesis N 0 that the correlation coefficient is equal to zero will be rejected, then the sample coefficient is significant, and the corresponding values are related by a linear relationship. If the hypothesis N 0 will be accepted, then the coefficient estimate is not significant, and the values are not linearly related to each other (if, for physical reasons, the factors can be related, then it is better to say that this relationship has not been established based on the available ED). Testing the hypothesis about the significance of the correlation coefficient estimate requires knowledge of the distribution of this random variable. Distribution of  value ik studied only for the special case when random variables U j And Uk distributed according to the normal law.

As a criterion for testing the null hypothesis N 0 apply random variable . If the modulus of the correlation coefficient is relatively far from unity, then the value t if the null hypothesis is true, it is distributed according to Student’s law with n– 2 degrees of freedom. Competing hypothesis N 1 corresponds to the statement that the value  ik not equal to zero (greater or less than zero). Therefore, the critical region is two-sided.
62. Calculation of the sample correlation coefficient and construction of the sample straight line regression equation.

Sample correlation coefficient is found by the formula

where are sample standard deviations of values and .

The sample correlation coefficient shows the closeness of the linear relationship between and : the closer to unity, the stronger the linear relationship between and .

Simple linear regression finds a linear relationship between one input variable and one output variable. To do this, a regression equation is determined - this is a model that reflects the dependence of the values of Y, the dependent value of Y on the values of x, the independent variable x and the population, described by leveling:

Where A0- free term of the regression equation;

A1- regression equation coefficient

Then a corresponding straight line is constructed, called a regression line. Coefficients A0 and A1, also called model parameters, are selected in such a way that the sum of the squared deviations of the points corresponding to real data observations from the regression line is minimal. The coefficients are selected using the least squares method. In other words, simple linear regression describes a linear model that best approximates the relationship between one input variable and one output variable.

Regression line

Least square method

Linear Regression Assumptions

Anomalous values ​​(outliers) and influence points

Linear regression hypothesis

Assessing the quality of linear regression: coefficient of determination R 2

Applying a Regression Line to Forecast

Simple regression plans

Example: Simple Regression Analysis

Research problem

View results

Regression coefficients

Variable distribution

Scatterplot

Significance criteria

Bottom line

Calculating Regression Equation Coefficients

Regression coefficient

Regression line and regression equation

Comparison of correlation and regression coefficients

Regression coefficients and their interpretation

What is regression?

Anomalous values (outliers) and influence points