Conditions necessary to determine regression coefficients. Regression in Excel: equation, examples. Linear regression

Regression analysis is statistical method research that allows you to show the dependence of a particular parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large volumes of data. Today, having learned how to build regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are specific examples from the field of economics.

Types of Regression

This concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • sedate;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Let's consider the problem of determining the dependence of the number of team members who quit on average salary at 6 industrial enterprises.

Task. At six enterprises we analyzed the average monthly wages and the number of employees who left due to at will. In tabular form we have:

Number of people who quit

Salary

30,000 rubles

35,000 rubles

40,000 rubles

45,000 rubles

50,000 rubles

55,000 rubles

60,000 rubles

For the task of determining the dependence of the number of quitting workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +...+a k x k, where x i are the influencing variables, a i are the regression coefficients, and k is the number of factors.

For this problem, Y is the indicator of quitting employees, and the influencing factor is salary, which we denote by X.

Using the capabilities of the Excel spreadsheet processor

Regression analysis in Excel must be preceded by applying built-in functions to existing tabular data. However, for these purposes it is better to use the very useful “Analysis Pack” add-on. To activate it you need:

  • from the “File” tab go to the “Options” section;
  • in the window that opens, select the line “Add-ons”;
  • click on the “Go” button located below, to the right of the “Management” line;
  • check the box next to the name “Analysis package” and confirm your actions by clicking “Ok”.

If everything is done correctly, on the right side of the “Data” tab, located above the “Excel” worksheet, you will see desired button.

in Excel

Now that we have all the necessary virtual tools at hand to carry out econometric calculations, we can begin to solve our problem. For this:

  • Click on the “Data Analysis” button;
  • in the window that opens, click on the “Regression” button;
  • in the tab that appears, enter the range of values ​​for Y (the number of quitting employees) and for X (their salaries);
  • We confirm our actions by pressing the “Ok” button.

As a result, the program will automatically fill a new spreadsheet with regression analysis data. Note! Excel allows you to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are located, or even A new book, specifically designed for storing such data.

Analysis of regression results for R-squared

In Excel, the data obtained during processing of the data in the example under consideration has the form:

First of all, you should pay attention to the R-squared value. It represents the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e. design parameters the models explain the dependence between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more suitable the selected model is for a specific task. It is considered to correctly describe the real situation when the R-square value is above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Odds Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are reset to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a specific model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence is completely small. The "-" sign indicates that the coefficient is negative. This is obvious, since everyone knows that the higher the salary at the enterprise, the fewer people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a relationship equation with several independent variables of the form:

y=f(x 1 +x 2 +…x m) + ε, where y is the resultant characteristic (dependent variable), and x 1, x 2,…x m are factor characteristics (independent variables).

Parameter Estimation

For multiple regression (MR), it is carried out using the method least squares(MNC). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε we construct a system of normal equations (see below)

To understand the principle of the method, consider a two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

OLS is applicable to the MR equation on a standardized scale. In this case we get the equation:

in which t y, t x 1, … t xm are standardized variables, for which the average values ​​are equal to 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in in this case are specified as standardized and centralized, therefore their comparison with each other is considered correct and acceptable. In addition, it is customary to screen out factors by discarding those with the lowest βi values.

Problem Using Linear Regression Equation

Suppose we have a table of price dynamics for a specific product N over the past 8 months. It is necessary to make a decision on the advisability of purchasing a batch of it at a price of 1850 rubles/t.

month number

month name

product price N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet processor, you need to use the “Data Analysis” tool, already known from the example presented above. Next, select the “Regression” section and set the parameters. It must be remembered that in the “Input interval Y” field a range of values ​​must be entered for the dependent variable (in this case, prices for goods in specific months of the year), and in the “Input interval X” - for the independent variable (month number). Confirm the action by clicking “Ok”. On a new sheet (if so indicated) we obtain data for regression.

We build according to them linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the line with the name of the month number and the coefficients and lines “Y-intersection” from the sheet with the results regression analysis. Thus, the linear regression equation (LR) for task 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, the coefficients of multiple correlation (MCC) and determination are used, as well as the Fisher test and the Student t test. In the Excel spreadsheet with regression results, they are called multiple R, R-squared, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the closeness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong connection between the variables “Number of month” and “Price of product N in rubles per 1 ton”. However, the nature of this relationship remains unknown.

The square of the coefficient of determination R2 (RI) is a numerical characteristic of the proportion of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., statistical data are described with a high degree of accuracy by the resulting SD.

F-statistics, also called Fisher's test, are used to evaluate the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's test) helps to evaluate the significance of the coefficient with an unknown or free term of the linear relationship. If the value of the t-test > tcr, then the hypothesis about the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free term, using Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e., we have zero probability that the correct hypothesis about the insignificance of the free term will be rejected. For the coefficient for the unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for an unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the feasibility of purchasing a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Let's consider a specific application problem.

The management of the NNN company must decide on the advisability of purchasing a 20% stake in MMM JSC. The cost of the package (SP) is 70 million US dollars. NNN specialists have collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover volume (VO);
  • accounts receivable (VD);
  • cost of fixed assets (COF).

In addition, the parameter of the enterprise's wage arrears (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet processor

First of all, you need to create a table of source data. It looks like this:

  • call the “Data Analysis” window;
  • select the “Regression” section;
  • In the “Input interval Y” box, enter the range of values ​​of the dependent variables from column G;
  • click on the red arrow icon to the right of the “Input Range X” window and highlight on the sheet the range of all values ​​from columns B,C,D,F.

Mark the “New worksheet” item and click “Ok”.

Obtain a regression analysis for a given problem.

Study of results and conclusions

We “collect” the regression equation from the rounded data presented above on the Excel spreadsheet:

SP = 0.103*SOF + 0.541*VO - 0.031*VK +0.405*VD +0.691*VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for MMM JSC are presented in the table:

Substituting them into the regression equation, we get a figure of 64.72 million US dollars. This means that the shares of MMM JSC are not worth purchasing, since their value of 70 million US dollars is quite inflated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The Excel examples discussed above will help you solve practical problems in the field of econometrics.

Concept of regression. Dependence between variables x And y can be described in different ways. In particular, any form of connection can be expressed by a general equation, where y treated as a dependent variable, or functions from another - independent variable x, called argument. The correspondence between an argument and a function can be specified by a table, formula, graph, etc. Changing a function depending on a change in one or more arguments is called regression. All means used to describe correlations constitute the content regression analysis.

To express regression, correlation equations, or regression equations, empirical and theoretically calculated regression series, their graphs, called regression lines, as well as linear and nonlinear regression coefficients are used.

Regression indicators express the correlation relationship bilaterally, taking into account changes in the average values ​​of the characteristic Y when changing values x i sign X, and, conversely, show a change in the average values ​​of the characteristic X according to changed values y i sign Y. The exception is time series, or time series, showing changes in characteristics over time. The regression of such series is one-sided.

There are many different forms and types of correlations. The task comes down to identifying the form of the connection in each specific case and expressing it with the appropriate correlation equation, which allows us to anticipate possible changes in one characteristic Y based on known changes in another X, related to the first correlationally.

12.1 Linear regression

Regression equation. Results of observations carried out on a particular biological object based on correlated characteristics x And y, can be represented by points on a plane by constructing a system of rectangular coordinates. The result is a kind of scatter diagram that allows one to judge the form and closeness of the relationship between varying characteristics. Quite often this relationship looks like a straight line or can be approximated by a straight line.

Linear relationship between variables x And y is described by a general equation, where a, b, c, d,... – parameters of the equation that determine the relationships between the arguments x 1 , x 2 , x 3 , …, x m and functions.

In practice, not all possible arguments are taken into account, but only some arguments; in the simplest case, only one:

In the linear regression equation (1) a is the free term, and the parameter b determines the slope of the regression line relative to the rectangular coordinate axes. In analytical geometry this parameter is called slope, and in biometrics – regression coefficient. A visual representation of this parameter and the position of the regression lines Y By X And X By Y in the rectangular coordinate system gives Fig. 1.

Rice. 1 Regression lines of Y by X and X by Y in the system

rectangular coordinates

Regression lines, as shown in Fig. 1, intersect at point O (,), corresponding to the arithmetic average values ​​of characteristics correlated with each other Y And X. When constructing regression graphs, the values ​​of the independent variable X are plotted along the abscissa axis, and the values ​​of the dependent variable, or function Y, are plotted along the ordinate axis. Line AB passing through point O (,) corresponds to the complete (functional) relationship between the variables Y And X, when the correlation coefficient . The stronger the connection between Y And X, the closer the regression lines are to AB, and, conversely, the weaker the connection between these quantities, the more distant the regression lines are from AB. If there is no connection between the characteristics, the regression lines are at right angles to each other and .

Since regression indicators express the correlation relationship bilaterally, regression equation (1) should be written as follows:

The first formula determines the average values ​​when the characteristic changes X per unit of measure, for the second - average values ​​when changing by one unit of measure of the attribute Y.

Regression coefficient. The regression coefficient shows how much on average the value of one characteristic y changes when the measure of another, correlated with, changes by one Y sign X. This indicator is determined by the formula

Here are the values s multiplied by the size of class intervals λ , if they were found from variation series or correlation tables.

The regression coefficient can be calculated without calculating averages square deviations s y And s x according to the formula

If the correlation coefficient is unknown, the regression coefficient is determined as follows:

Relationship between regression and correlation coefficients. Comparing formulas (11.1) (topic 11) and (12.5), we see: their numerator has the same value, which indicates a connection between these indicators. This relationship is expressed by the equality

Thus, the correlation coefficient is equal to the geometric mean of the coefficients b yx And b xy. Formula (6) allows, firstly, based on the known values ​​of the regression coefficients b yx And b xy determine the regression coefficient R xy, and secondly, check the correctness of the calculation of this correlation indicator R xy between varying characteristics X And Y.

Like the correlation coefficient, the regression coefficient characterizes only a linear relationship and is accompanied by a plus sign for a positive relationship and a minus sign for a negative relationship.

Determination of linear regression parameters. It is known that the sum of squared deviations is a variant x i from the average is the smallest value, i.e. This theorem forms the basis of the least squares method. Regarding linear regression [see formula (1)] the requirement of this theorem is satisfied by a certain system of equations called normal:

Joint solution of these equations with respect to parameters a And b leads to the following results:

;

;

, from where and.

Considering the two-way nature of the relationship between the variables Y And X, formula for determining the parameter A should be expressed like this:

And . (7)

Parameter b, or regression coefficient, is determined by the following formulas:

Construction of empirical regression series. In the presence of large number observations, regression analysis begins with the construction of empirical regression series. Empirical regression series is formed by calculating the values ​​of one varying characteristic X average values ​​of another, correlated with X sign Y. In other words, the construction of empirical regression series comes down to finding group averages from the corresponding values ​​of characteristics Y and X.

An empirical regression series is a double series of numbers that can be represented by points on a plane, and then, by connecting these points with straight line segments, an empirical regression line can be obtained. Empirical regression series, especially their graphs, called regression lines, give a clear idea of ​​the form and closeness of the correlation between varying characteristics.

Alignment of empirical regression series. Graphs of empirical regression series turn out, as a rule, not to be smooth, but broken lines. This is explained by the fact that, along with the main reasons that determine the general pattern in the variability of correlated characteristics, their magnitude is affected by the influence of numerous secondary reasons that cause random fluctuations in the nodal points of regression. To identify the main tendency (trend) of the conjugate variation of correlated characteristics, it is necessary to replace broken lines with smooth, smoothly running regression lines. The process of replacing broken lines with smooth ones is called alignment of empirical series And regression lines.

Graphic alignment method. This is the simplest method that does not require computational work. Its essence boils down to the following. The empirical regression series is depicted as a graph in a rectangular coordinate system. Then the midpoints of regression are visually outlined, along which a solid line is drawn using a ruler or pattern. The disadvantage of this method is obvious: it does not exclude the influence of the individual properties of the researcher on the results of alignment of empirical regression lines. Therefore, in cases where more is needed high accuracy When replacing broken regression lines with smooth ones, other methods are used to align empirical series.

Moving average method. The essence of this method comes down to the sequential calculation of arithmetic averages from two or three adjacent terms of the empirical series. This method is especially convenient in cases where the empirical series is represented by a large number of terms, so that the loss of two of them - the extreme ones, which is inevitable with this method of alignment, will not noticeably affect its structure.

Least square method. This method was proposed at the beginning of the 19th century by A.M. Legendre and, independently of him, K. Gauss. It allows you to most accurately align empirical series. This method, as shown above, is based on the assumption that the sum of squared deviations is an option x i from their average there is a minimum value, i.e. Hence the name of the method, which is used not only in ecology, but also in technology. The least squares method is objective and universal; it is used in a wide variety of cases when finding empirical equations for regression series and determining their parameters.

The requirement of the least squares method is that the theoretical points of the regression line must be obtained in such a way that the sum of the squared deviations from these points for the empirical observations y i was minimal, i.e.

By calculating the minimum of this expression in accordance with the principles of mathematical analysis and transforming it in a certain way, one can obtain a system of so-called normal equations, in which the unknown values ​​are the required parameters of the regression equation, and the known coefficients are determined by the empirical values ​​of the characteristics, usually the sums of their values ​​and their cross products.

Multiple linear regression. The relationship between several variables is usually expressed by a multiple regression equation, which can be linear And nonlinear. In its simplest form, multiple regression is expressed as an equation with two independent variables ( x, z):

Where a– free term of the equation; b And c– parameters of the equation. To find the parameters of equation (10) (using the least squares method), the following system of normal equations is used:

Dynamic series. Alignment of rows. Changes in characteristics over time form the so-called time series or dynamics series. A characteristic feature of such series is that the independent variable X here is always the time factor, and the dependent variable Y is a changing feature. Depending on the regression series, the relationship between the variables X and Y is one-sided, since the time factor does not depend on the variability of the characteristics. Despite these features, dynamics series can be likened to regression series and processed using the same methods.

Like regression series, empirical dynamics series are influenced not only by the main ones, but also by numerous secondary (random) factors that obscure the main trend in the variability of characteristics, which in the language of statistics is called trend.

Analysis of time series begins with identifying the shape of the trend. To do this, the time series is depicted as a line graph in a rectangular coordinate system. In this case, time points (years, months and other units of time) are plotted along the abscissa axis, and the values ​​of the dependent variable Y are plotted along the ordinate axis. If there is a linear relationship between the variables X and Y (linear trend), the least squares method is the most appropriate for aligning the time series is a regression equation in the form of deviations of the terms of the series of the dependent variable Y from the arithmetic mean of the series of the independent variable X:

Here is the linear regression parameter.

Numerical characteristics of dynamics series. The main generalizing numerical characteristics of dynamics series include geometric mean and an arithmetic mean close to it. They characterize the average rate at which the value of the dependent variable changes over certain periods of time:

An assessment of the variability of members of the dynamics series is standard deviation. When choosing regression equations to describe time series, the shape of the trend is taken into account, which can be linear (or reduced to linear) and nonlinear. The correctness of the choice of regression equation is usually judged by the similarity of the empirically observed and you numerical values dependent variable. A more accurate solution to this problem is the regression analysis of variance method (topic 12, paragraph 4).

Correlation of time series. It is often necessary to compare the dynamics of parallel time series related to each other by certain general conditions, for example, to find out the relationship between agricultural production and the growth of livestock numbers over a certain period of time. In such cases, the characteristic of the relationship between variables X and Y is correlation coefficient R xy (in the presence of a linear trend).

It is known that the trend of time series is, as a rule, obscured by fluctuations in the series of the dependent variable Y. This gives rise to a twofold problem: measuring the dependence between compared series, without excluding the trend, and measuring the dependence between neighboring members of the same series, excluding the trend. In the first case, the indicator of the closeness of the connection between the compared time series is correlation coefficient(if the relationship is linear), in the second – autocorrelation coefficient. These indicators have different meanings, although they are calculated using the same formulas (see topic 11).

It is easy to see that the value of the autocorrelation coefficient is affected by the variability of the series members of the dependent variable: the less the series members deviate from the trend, the higher the autocorrelation coefficient, and vice versa.

Fundamentals of data analysis.

A typical problem that arises in practice is identifying dependencies or relationships between variables. IN real life variables are related to each other. For example, in marketing, the amount of money spent on advertising affects sales; in medical research dose medicinal product influences the effect; in textile production, the quality of fabric dyeing depends on temperature, humidity and other parameters; in metallurgy, the quality of steel depends on special additives, etc. Finding dependencies in data and using them for your own purposes is the task of data analysis.

Let's say you observe the values ​​of a pair of variables X and Y and want to find the relationship between them. For example:

X - number of visitors to the online store, Y - sales volume;

X - plasma panel diagonal, Y - price;

X is the purchase price of the share, Y is the sale price;

X is the cost of aluminum on the London Stock Exchange, Y is sales volumes;

X - the number of breaks in oil pipelines, Y - the amount of losses;

X is the “age” of the aircraft, Y is the cost of its repair;

X - sales area, Y - store turnover;

X is income, Y is consumption, etc.

Variable X is usually called an independent variable, variable Y is called a dependent variable. Sometimes variable X is called a predictor, variable Y is called a response.



We want to determine exactly the dependence on X or predict what the values ​​of Y will be for given values ​​of X. In this case, we observe the X values ​​and the corresponding Y values. The task is to build a model that allows one to determine Y from values ​​of X different from the observed ones. In statistics, such problems are solved within the framework regression analysis.

There are various regression models, determined by the choice of function f(x 1 ,x 2 ,…,x m):

1) Simple Linear Regression

2) Multiple regression

3) Polynomial regression

Odds are called regression parameters.

The main feature of regression analysis: with its help, you can obtain specific information about what form and nature the relationship between the variables under study has.

Sequence of stages of regression analysis

1. Problem formulation. At this stage, preliminary hypotheses about the dependence of the phenomena under study are formed.

2. Definition of dependent and independent (explanatory) variables.

3. Collection of statistical data. Data must be collected for each of the variables included in the regression model.

4. Formulation of a hypothesis about the form of connection (simple or multiple, linear or nonlinear).

5. Determination of the regression function (consists in calculating the numerical values ​​of the parameters of the regression equation)

6. Assessing the accuracy of regression analysis.

7. Interpretation of the results obtained. The obtained results of regression analysis are compared with preliminary hypotheses. The correctness and credibility of the results obtained are assessed.

8. Prediction of unknown values ​​of the dependent variable.

Using regression analysis, it is possible to solve the problem of forecasting and classification. Predicted values ​​are calculated by substituting the values ​​of explanatory variables into the regression equation. The classification problem is solved in this way: the regression line divides the entire set of objects into two classes, and that part of the set where the function value is greater than zero belongs to one class, and the part where it is less than zero belongs to another class.

The main tasks of regression analysis: establishing the form of the dependence, determining the regression function, estimating the unknown values ​​of the dependent variable.

Linear regression

Linear regression reduces to finding an equation of the form

Or . (1.1)

x- is called an independent variable or predictor.

Y– dependent or response variable. This is the value we expect for y(on average) if we know the value x, i.e. is the "predicted value" y»

· a– free term (intersection) of the evaluation line; this is the meaning Y, When x=0(Fig.1).

· b– slope or gradient of the estimated line; it represents the amount by which Y increases on average if we increase x for one unit.

· a And b are called regression coefficients of the estimated line, although this term is often used only for b.

· e- unobservable random variables with mean 0, or they are also called observation errors; it is assumed that the errors are not correlated with each other.

Fig.1. Linear regression line showing the intercept a and the slope b (the amount Y increases as x increases by one unit)

An equation of the form allows for given values ​​of the factor X have theoretical values ​​of the resultant characteristic, substituting the actual values ​​of the factor into it X. In the graph, the theoretical values ​​represent the regression line.

In most cases (if not always) there is a certain scatter of observations relative to the regression line.

Theoretical regression line is the line around which the points of the correlation field are grouped and which indicates the main direction, the main tendency of the connection.

An important stage of regression analysis is determining the type of function with which the dependence between characteristics is characterized. The main basis for choosing the type of equation should be a meaningful analysis of the nature of the dependence being studied and its mechanism.

To find parameters A And b we use regression equations least squares method (LSM). When applying the least squares method to find a function that the best way corresponds to empirical data, it is believed that the sum of squared deviations (remainder) of empirical points from the theoretical regression line should be a minimum value.

The fit is assessed by looking at the residuals (the vertical distance of each point from the line, e.g. residual = observed y– predicted y, Rice. 2).

The line of best fit is chosen so that the sum of the squares of the residuals is minimal.

Rice. 2. Linear regression line with residuals depicted (vertical dotted lines) for each point.

After simple transformations we obtain a system of normal equations using the least squares method to determine the values ​​of the parameters a And b linear correlation equations based on empirical data:

. (1.2)

Deciding this system equations regarding b, we get the following formula to determine this parameter:

(1.3)

Where and are the average values ​​of y, x.

Parameter value A we obtain by dividing both sides of the first equation in this system by n:

Parameter b in the equation is called the regression coefficient. If there is a direct correlation, the regression coefficient has positive value, and in the case of an inverse relationship, the regression coefficient is negative.

If the sign of the regression coefficient is positive, the relationship between the dependent variable and the independent variable will be positive.

If the sign of the regression coefficient is negative, the relationship between the dependent variable and the independent variable is negative (inverse).

The regression coefficient shows how much the value of the resulting characteristic changes on average y when a factor characteristic changes X per unit, the geometric regression coefficient is the slope of the straight line depicting the correlation equation relative to the axis X(for equation ).

Because of the linear relationship, and we expect that changes as , and we call this the variation that is due or explained by regression. The residual variation should be as small as possible.

If this is true, then most of the variation will be explained by regression, and the points will lie close to the regression line, i.e. the line fits the data well.

A quantitative characteristic of the degree of linear dependence between random variables X and Y is the correlation coefficient r ( An indicator of the closeness of the relationship between two characteristics ) .

Correlation coefficient:

where x is the value of the factor characteristic;

y - the value of the resulting attribute;

n - number of data pairs.


Fig. 3 - Options for the location of the “cloud” of points

If the correlation coefficient r=1, then between X And Y there is a functional linear relationship, all points (x i ,y i) will lie on a straight line.

If the correlation coefficient r=0 (r~0), then they say that X And Y uncorrelated, i.e. there is no linear relationship between them.

The relationship between signs (on the Chaddock scale) can be strong, medium and weak . The closeness of the connection is determined by the value of the correlation coefficient, which can take values ​​from -1 to +1 inclusive. The criteria for assessing the tightness of the connection are shown in Fig. 1.

Rice. 4. Quantitative criteria for assessing the closeness of communication

Any relationship between variables has two important properties: magnitude and reliability. The stronger the relationship between two variables, the greater the magnitude of the relationship and the easier it is to predict the value of one variable from the value of the other variable. The magnitude of dependence is easier to measure than reliability.

The reliability of the dependence is no less important than its magnitude. This property is related to the representativeness of the sample under study. The reliability of a relationship characterizes how likely it is that this relationship will be found again on other data.

As the magnitude of the dependence of variables increases, its reliability usually increases.

The proportion of total variance that is explained by regression is called coefficient of determination, usually expressed as a percentage and denoted R 2(in paired linear regression this is the quantity r 2, square of the correlation coefficient), allows you to subjectively assess the quality of the regression equation.

The coefficient of determination measures the proportion of variance around the mean that is “explained” by the constructed regression. The coefficient of determination ranges from 0 to 1. The closer the coefficient of determination is to 1, the better the regression “explains” the dependence in the data; a value close to zero means the poor quality of the constructed model. The coefficient of determination can be as close as possible to 1 if all predictors are different.

The difference represents the percentage of variance that cannot be explained by regression.

Multiple regression

Multiple regression is used in situations where, from the many factors influencing the effective attribute, it is impossible to single out one dominant factor and it is necessary to take into account the influence of several factors. For example, the volume of output is determined by the size of the main and working capital, number of personnel, level of management, etc., the level of demand depends not only on the price, but also on the funds available to the population.

The main goal of multiple regression is to build a model with several factors and determine the influence of each factor separately, as well as their joint impact on the indicator being studied.

Multiple regression is a relationship equation with several independent variables:

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type equalities are used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values ​​of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposing types of relationships: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. A hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. A logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + m * In x + In E.

Multiple and nonlinear

Two more complex types Regression is multiple and non-linear. The multiple regression equation is expressed by the function y = f(x 1, x 2 ... x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. The nonlinear regression equation is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence feedback, positive - about a straight line. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters; the closer to 0, the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values ​​of the describing factor are marked along the abscissa axis, while the values ​​of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can speak of an almost complete absence of connection. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values ​​of other indicators.

Nonlinear equations have, for example, the form power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to build correctly multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and modeled. Factors that will need to be included must meet the following criteria:

  • Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
  • There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a huge number of methods and methods that explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

  • Elimination method.
  • Switching method.
  • Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.

Regression coefficients show the intensity of the influence of factors on the performance indicator. If preliminary standardization of factor indicators is carried out, then b 0 is equal to the average value of the effective indicator in the aggregate. Coefficients b 1, b 2, ..., b n show by how many units the level of the effective indicator deviates from its average value if the values ​​of the factor indicator deviate from the average of zero by one standard deviation. Thus, regression coefficients characterize the degree of significance of individual factors for increasing the level of the performance indicator. Specific values ​​of regression coefficients are determined from empirical data according to the least squares method (as a result of solving systems of normal equations).

Regression line- a line that most accurately reflects the distribution of experimental points on a scatter diagram and the steepness of the slope of which characterizes the relationship between two interval variables.

The regression line is most often sought in the form of a linear function (linear regression), which best approximates the desired curve. This is done using the least squares method, when the sum of the squared deviations of the actually observed ones from their estimates is minimized (meaning estimates using a straight line that purports to represent the desired regression relationship):

(M - sample size). This approach is based on known fact, that the amount appearing in the above expression takes on a minimum value precisely for the case when .
57. Main tasks of correlation theory.

Correlation theory is an apparatus that evaluates the closeness of connections between phenomena that are not only in cause-and-effect relationships. Using correlation theory, stochastic, but not causal, relationships are assessed. The author, together with M. L. Lukatskaya, made an attempt to obtain estimates for causal relationships. However, the question of the cause-and-effect relationships of phenomena, of how to identify cause and effect, remains open, and it seems that at the formal level it is fundamentally unsolvable.

Correlation theory and its application to production analysis.

Correlation theory, which is one of the branches of mathematical statistics, allows one to make reasonable assumptions about the possible limits within which, with a certain degree of reliability, the parameter under study will be located if other statistically related parameters receive certain values.

In correlation theory, it is customary to distinguish two main tasks.

First task correlation theory - to establish the form of correlation, i.e. type of regression function (linear, quadratic, etc.).

Second task correlation theory - assess the closeness (strength) of the correlation connection.

The closeness of the correlation connection (dependence) of Y on X is assessed by the amount of dispersion of the Y values ​​around the conditional average. Large dispersion indicates a weak dependence of Y on X, small dispersion indicates the presence of a strong dependence.
58. Correlation table and its numerical characteristics.

In practice, as a result of independent observations of the quantities X and Y, as a rule, one deals not with the entire set of all possible pairs of values ​​of these quantities, but only with a limited sample from population, and the volume n sample population is defined as the number of pairs available in the sample.

Let the value X in the sample take the values ​​x 1, x 2,....x m, where the number of values ​​of this value that differ from each other, and in the general case, each of them can be repeated in the sample. Let the value Y in the sample take the values ​​y 1, y 2,....y k, where k is the number of different values ​​of this value, and in the general case, each of them can also be repeated in the sample. In this case, the data is entered into a table taking into account the frequency of occurrence. Such a table with grouped data is called a correlation table.

The first stage of statistical processing of the results is the compilation of a correlation table.

Y\X x 1 x 2 ... x m n y
y 1 n 12 n 21 n m1 n y1
y 2 n 22 n m2 n y2
...
y k n 1k n 2k n mk n yk
n x n x1 n x2 n xm n

The first row of the main part of the table lists in ascending order all the values ​​of the quantity X found in the sample. The first column also lists in ascending order all the values ​​of the quantity Y found in the sample. At the intersection of the corresponding rows and columns, frequencies n ij (i = 1,2 ,...,m; j=1,2,...,k) equal to the number of occurrences of the pair (x i ;y i) in the sample. For example, frequency n 12 represents the number of occurrences of the pair (x 1 ;y 1) in the sample.

Also n xi n ij , 1≤i≤m, the sum of the elements of the i-th column, n yj n ij , 1≤j≤k, is the sum of the elements of the j-th row and n xi = n yj =n

Analogues of the formulas obtained from the correlation table data have the form:


59. Empirical and theoretical regression lines.

Theoretical regression line can be calculated in this case from the results of individual observations. To solve a system of normal equations, we need the same data: x, y, xy and xr. We have data on the volume of cement production and the volume of fixed production assets in 1958. The task is set: to study the relationship between the volume of cement production (in physical terms) and the volume of fixed assets. [ 1 ]

The less the theoretical regression line (calculated from the equation) deviates from the actual (empirical) one, the smaller the average approximation error.

The process of finding the theoretical regression line involves fitting the empirical regression line using the least squares method.

The process of finding a theoretical regression line is called alignment of the empirical regression line and consists of selecting and justifying the type; curve and calculation of the parameters of its equation.

Empirical regression is built according to analytical or combinational grouping data and represents the dependence of the group average values ​​of the result trait on the group average values ​​of the factor trait. The graphical representation of empirical regression is a broken line made up of points, the abscissas of which are the group average values ​​of the factor trait, and the ordinates are the group average values ​​of the result trait. The number of points is equal to the number of groups in the grouping.

The empirical regression line reflects the main trend of the relationship under consideration. If the empirical regression line approaches a straight line in appearance, then we can assume the presence of a linear correlation between the characteristics. And if the connection line approaches the curve, then this may be due to the presence of a curvilinear correlation relationship.
60. Sample correlation and regression coefficients.

If the dependence between the characteristics on the graph indicates a linear correlation, calculate correlation coefficient r, which allows you to assess the closeness of the relationship between variables, and also find out what proportion of changes in a characteristic is due to the influence of the main characteristic, and what part is due to the influence of other factors. The coefficient varies from –1 to +1. If r=0, then there is no connection between the characteristics. Equality r=0 only indicates the absence of a linear correlation dependence, but not the absence of a correlation at all, much less a statistical dependence. If r= ±1, then this means the presence of a complete (functional) connection. In this case, all observed values ​​are located on the regression line, which is a straight line.
The practical significance of the correlation coefficient is determined by its squared value, called the coefficient of determination.
Regression approximated (approximately described) by a linear function y = kX + b. For the regression of Y on X, the regression equation is: `y x = ryx X + b; (1). Slope factor ryx of the direct regression of Y on X is called the regression coefficient of Y on X.

If equation (1) is found using sample data, then it is called sample regression equation. Accordingly, ryx is the sample regression coefficient of Y on X, and b is the sample dummy term of the equation. The regression coefficient measures the variation in Y per unit variation in X. The parameters of the regression equation (coefficients ryx and b) are found using the least squares method.
61. Assessing the significance of the correlation coefficient and the closeness of the correlation in the general population

Significance of correlation coefficients checked using Student's test:

Where - root mean square error of the correlation coefficient, which is determined by the formula:

If the calculated value is higher than the table value, then we can conclude that the value of the correlation coefficient is significant. Table values t found from the table of Student's t-test values. In this case, the number of degrees of freedom is taken into account (V = n - 1) and the confidence level (in economic calculations, usually 0.05 or 0.01). In our example, the number of degrees of freedom is: P - 1 = 40 - 1 = 39. At the confidence level R = 0,05; t= 2.02. Since (the actual value in all cases is higher than the t-tabular one), the relationship between the resultant and factor indicators is reliable, and the magnitude of the correlation coefficients is significant.

Estimation of the correlation coefficient, calculated from a limited sample, almost always differs from zero. But this does not mean that the correlation coefficient population is also different from zero. It is required to evaluate the significance of the sample value of the coefficient or, in accordance with the formulation of the tasks of testing statistical hypotheses, to test the hypothesis that the correlation coefficient is equal to zero. If the hypothesis N 0 that the correlation coefficient is equal to zero will be rejected, then the sample coefficient is significant, and the corresponding values ​​are related by a linear relationship. If the hypothesis N 0 will be accepted, then the coefficient estimate is not significant, and the values ​​are not linearly related to each other (if, for physical reasons, the factors can be related, then it is better to say that this relationship has not been established based on the available ED). Testing the hypothesis about the significance of the correlation coefficient estimate requires knowledge of the distribution of this random variable. Distribution of  value ik studied only for the special case when random variables U j And Uk distributed according to the normal law.

As a criterion for testing the null hypothesis N 0 apply random variable . If the modulus of the correlation coefficient is relatively far from unity, then the value t if the null hypothesis is true, it is distributed according to Student’s law with n– 2 degrees of freedom. Competing hypothesis N 1 corresponds to the statement that the value  ik not equal to zero (greater or less than zero). Therefore, the critical region is two-sided.
62. Calculation of the sample correlation coefficient and construction of the sample straight line regression equation.

Sample correlation coefficient is found by the formula

where are sample standard deviations of values ​​and .

The sample correlation coefficient shows the closeness of the linear relationship between and : the closer to unity, the stronger the linear relationship between and .

Simple linear regression finds a linear relationship between one input variable and one output variable. To do this, a regression equation is determined - this is a model that reflects the dependence of the values ​​of Y, the dependent value of Y on the values ​​of x, the independent variable x and the population, described by leveling:

Where A0- free term of the regression equation;

A1- regression equation coefficient

Then a corresponding straight line is constructed, called a regression line. Coefficients A0 and A1, also called model parameters, are selected in such a way that the sum of the squared deviations of the points corresponding to real data observations from the regression line is minimal. The coefficients are selected using the least squares method. In other words, simple linear regression describes a linear model that best approximates the relationship between one input variable and one output variable.

Share