Correlation coefficient analysis. Correlation and regression analysis in Excel: execution instructions

The article discusses the definitions of correlation, correlation analysis and correlation coefficient. A definition of correlation and its main characteristics is given.

  • Correlation and regression analysis in the study of fertility factors
  • Assessment of fertility factors in the Republic of Bashkortostan

Researchers are often interested in how two or large quantity variables in one or more study samples. For example, such a relationship can be observed between the error in hardware processing of experimental data and the magnitude of network voltage surges. Another example is the relationship between data link capacity and signal-to-noise ratio.

In 1886, the English naturalist Francis Galton coined the term “correlation” to describe the nature of this kind of interaction. Later his student Karl Pearson developed mathematical formula, which makes it possible to give a quantitative assessment of the correlations of characteristics.

Dependencies between quantities (factors, characteristics) are divided into two types: functional and statistical.

With functional dependencies, each value of one variable corresponds to a certain value of another variable. In addition, the functional connection of two factors is possible only on the condition that the second quantity depends only on the first and does not depend on any other quantities. If a quantity depends on many factors, a functional connection is possible if the first quantity does not depend on any other factors other than those included in the specified set.

With a statistical dependence, a change in one of the quantities entails a change in the distribution of other quantities, which with certain probabilities take on certain values.

Of much greater interest is another special case statistical dependence, when there is a relationship between the values ​​of some random variables with the average value of others, with the peculiarity that in each individual case any of the interrelated quantities can take on different values.

This kind of dependence between variables is called correlation, or correlation.

Correlation analysis- a method that allows you to detect a relationship between several random variables.

Correlation analysis solves two main problems:

  • The first task is to determine the form of communication, i.e. in establishing mathematical form, in which this relationship is expressed. This is very important, because from the right choice The form of communication depends on the final result of studying the relationship between characteristics.
  • The second task is to measure crowding, i.e. measures of connection between characteristics in order to establish the degree of influence of a given factor on the result. It is solved mathematically by determining the parameters of the correlation equation.

Then the results obtained are assessed and analyzed using special indicators of the correlation method (coefficients of determination, linear and multiple correlation, etc.), as well as checking the significance of the relationship between the characteristics being studied.

The following tasks are solved using correlation analysis methods:

  1. Relationship. Is there a relationship between the parameters?
  2. Forecasting. If the behavior of one parameter is known, then the behavior of another parameter that correlates with the first can be predicted.
  3. Classification and identification of objects. Correlation analysis helps to select a set of independent features for classification.

Correlation is a statistical relationship between two or more random variables (or values ​​that can be considered as such with some acceptable degree of accuracy). Its essence lies in the fact that when the value of one variable changes, a natural change (decrease or increase) of another variable occurs.

The correlation coefficient is used to determine whether there is a relationship between two properties.

Correlation coefficient p for population, as a rule, is unknown, therefore it is estimated from experimental data, which is a sample of n pairs of values ​​(x i, y i), obtained by jointly measuring two characteristics X and Y. The correlation coefficient determined from sample data is called the sample correlation coefficient (or just a correlation coefficient). It is usually denoted by the symbol r.

The main properties of the correlation coefficient include:

  1. Correlation coefficients can only characterize linear relationships, i.e. those that are expressed by the equation of a linear function. If there is a nonlinear relationship between varying characteristics, other indicators of connection should be used.
  2. The values ​​of the correlation coefficients are abstract numbers ranging from -1 to +1, i.e. -1< r < 1.
  3. With independent variation of characteristics, when there is no connection between them, r = 0.
  4. With a positive, or direct, relationship, when with an increase in the values ​​of one characteristic the values ​​of another increase, the correlation coefficient acquires a positive (+) sign and ranges from 0 to +1, i.e. 0< r < 1.
  5. With a negative or inverse relationship, when with an increase in the values ​​of one characteristic the values ​​of another decrease accordingly, the correlation coefficient is accompanied by a negative (–) sign and ranges from 0 to –1, i.e. -1< r <0.
  6. The stronger the connection between the characteristics, the closer the correlation coefficient is to ô1ô. If r = ± 1, then the correlation relationship becomes functional, i.e. Each value of attribute X will correspond to one or more strictly defined values ​​of attribute Y.
  7. The reliability of the correlation between characteristics cannot be judged solely by the magnitude of the correlation coefficients. This parameter depends on the number of degrees of freedom k = n –2, where: n is the number of correlated pairs of indicators X and Y. The larger n, the higher the reliability of the relationship at the same value of the correlation coefficient.

The correlation coefficient is calculated using the following formula:

where x is the value of the factor characteristic; y - the value of the resulting attribute; n - number of data pairs.

The correlation is studied on the basis of experimental data, which are the measured values ​​x i ,y i of two features x,y. If there is relatively little experimental data, then the two-dimensional empirical distribution is represented as a double series of values ​​x i , y i . At the same time, the correlation dependence between characteristics can be described in different ways. The correspondence between an argument and a function can be given by a table, formula, graph, etc.

When the correlation between quantitative traits whose values ​​can be accurately measured in units of metric scales is studied, a bivariate normally distributed population model is very often adopted. Such a model displays the relationship between the variables x and y graphically in the form of a geometric location of points in a rectangular coordinate system. This graphical relationship is called a scatterplot or correlation field.

This model of a two-dimensional normal distribution (correlation field) allows us to give a clear graphical interpretation of the correlation coefficient, because The distribution collectively depends on five parameters:

  • mathematical expectations E[x], E[y] of values ​​x,y;
  • standard deviations px, py of random variables x,y ;
  • correlation coefficient p, which is a measure of the relationship between the random variables, x and y. Let us give examples of correlation fields.

If p = 0, then the values ​​x i ,y i obtained from the two-dimensional normal population are located on the graph within the area limited by the circle. In this case, there is no correlation between the random variables x and y, and they are called uncorrelated. For a bivariate normal distribution, uncorrelatedness simultaneously means independence of the random variables x and y.

If p = 1 or p = -1, then we speak of complete correlation, that is, there is a linear functional dependence between the random variables x and y.

When p = 1, the values ​​of x i,y i determine the points lying on a straight line with a positive slope (with an increase in x i, the values ​​of y i also increase).

In intermediate cases, when -1< p <1, определяемые значениями x i ,y i точки попадают в область, ограниченную некоторым эллипсом, причём при p>0 there is a positive correlation (with an increase in x, the y values ​​generally tend to increase), with p<0 корреляция отрицательная. Чем ближе p к ±1, тем уже эллипс и тем теснее точки, определяемые экспериментальными значениями, группируются около прямой линии.

Here you should pay attention to the fact that the line along which the points are grouped can be not only a straight line, but have any other shape: parabola, hyperbola, etc. In these cases, nonlinear correlation is considered.

The correlation dependence between features can be described in different ways, in particular, any form of connection can be expressed by a general equation y=f(x), where feature y is a dependent variable, or a function of an independent variable x, called an argument.

Thus, visual analysis of the correlation field helps to determine not only the presence of a statistical relationship (linear or nonlinear) between the characteristics under study, but also its closeness and shape.

When studying a correlation connection, an important area of ​​analysis is to assess the degree of closeness of the connection. The concept of the degree of closeness of the connection between two characteristics arises due to the fact that in reality many factors influence the change in the resulting characteristic. In this case, the influence of one of the factors can be expressed more noticeably and clearly than the influence of other factors. As conditions change, the role of the decisive factor may shift to another feature.

When statistically studying relationships, as a rule, only the main factors are taken into account. Also, taking into account the degree of closeness of the connection, the need for a more detailed study of this particular connection and the significance of its practical use are assessed.

In general, knowledge of a quantitative assessment of the closeness of the correlation allows us to solve the following group of questions:

  • the need for an in-depth study of this relationship between signs and the feasibility of its practical application;
  • the degree of differences in the manifestation of the connection in specific conditions (comparing the assessment of the closeness of the connection for different conditions);
  • identification of major and minor factors in given specific conditions by sequential consideration and comparison of a trait with various factors.

Indicators of connection closeness must satisfy a number of basic requirements:

  • the value of the indicator of the closeness of the connection should be equal to or close to zero if there is no connection between the characteristics (processes, phenomena) being studied;
  • if there is a functional connection between the studied characteristics, the value of the indicator of the closeness of the connection should be equal to one;
  • if there is a correlation between the characteristics, the absolute value of the indicator of the closeness of the connection should be expressed as a proper fraction, which is larger in value, the closer the connection between the studied characteristics (tends to unity).

The correlation dependence is determined by various parameters, among which the most widely used are paired indicators characterizing the relationship between two random variables: the covariance coefficient (correlation moment) and the linear correlation coefficient (Pearson's correlation coefficient).

The strength of the connection is determined by the absolute value of the connection tightness indicator and does not depend on the direction of the connection.

Depending on the absolute value of the correlation coefficient p, correlations between characteristics are divided by strength as follows:

  • strong or tight (at p >0.70);
  • average (at 0.50< p <0,69);
  • moderate (at 0.30< p <0,49);
  • weak (at 0.20< p <0,29);
  • very weak (at p<0,19).

The form of the correlation relationship can be linear or nonlinear.

For example, the relationship between a student’s level of training and final certification grades can be linear. An example of a nonlinear relationship is the level of motivation and the effectiveness of completing a given task. (As motivation increases, the efficiency of completing a task first increases, then, at a certain level of motivation, maximum efficiency is achieved; but a further increase in motivation is accompanied by a decrease in efficiency.)

In direction, the correlation relationship can be positive (direct) and negative (inverse).

With a positive linear correlation, higher values ​​of one characteristic correspond to higher values ​​of another, and lower values ​​of one characteristic correspond to lower values ​​of another. With a negative correlation, the relationships are reversed.

The sign of the correlation coefficient depends on the direction of the correlation: with a positive correlation, the correlation coefficient has a positive sign, with a negative correlation, it has a negative sign.

Bibliography

  1. Ableeva, A. M. Formation of a fund of assessment tools in the conditions of the Federal State Educational Standard [Text] / A. M. Ableeva, G. A. Salimova // Current problems of teaching social, humanitarian, natural science and technical disciplines in the context of modernization of higher education: materials international scientific and methodological conference, April 4-5, 2014 / Bashkir State Agrarian University, Faculty of Information Technologies and Management. - Ufa, 2014. - pp. 11-14.
  2. Ganieva, A.M. Statistical analysis of employment and unemployment [Text] / A.M. Ganieva, T.N. Lubova // Current issues of economic-statistical research and information technologies: collection of articles. scientific Art.: dedicated to the 40th anniversary of the creation of the department of “Statistics and Information Systems in Economics” / Bashkir State Agrarian University. - Ufa, 2011. - pp. 315-316.
  3. Ismagilov, R. R. Creative group - an effective form of organizing scientific research in higher education [Text] / R. R. Ismagilov, M. Kh. Urazlin, D. R. Islamgulov // Scientific, technical and scientific-educational complexes of the region: problems and development prospects: materials of a scientific-practical conference / Academy of Sciences of the Republic of Belarus, UGATU. - Ufa, 1999. - pp. 105-106.
  4. Islamgulov, D.R. Competence-based approach to teaching: assessing the quality of education [Text] / D.R. Islamgulov, T.N. Lubova, I.R. Islamgulova // Modern scientific bulletin. – 2015. – T. 7. – No. 1. – P. 62-69.
  5. Islamgulov, D. R. Research work of students is the most important element of training specialists in an agricultural university [Text] / D. R. Islamgulov // Problems of practical training of students at a university at the present stage and ways to solve them: collection. materials scientific-method. Conf., April 24, 2007 / Bashkir State Agrarian University. - Ufa, 2007. - pp. 20-22.
  6. Lubova, T.N. The basis for the implementation of the federal state educational standard is the competency-based approach [Text] / T.N. Lubova, D.R. Islamgulov, I.R. Islamgulova// BIG RESEARCH - 2016: Materials for the XII International Scientific and Practical Conference, February 15-22, 2016. - Sofia: Byal GRAD-BG OOD, 2016. - Volume 4 Pedagogical sciences. – pp. 80-85.
  7. Lubova, T.N. New educational standards: implementation features [Text] / T.N. Lubova, D.R. Islamgulov // Modern scientific bulletin. – 2015. – T. 7. – No. 1. – P. 79-84.
  8. Lubova, T.N. Organization of independent work of students [Text] / T.N. Lubova, D.R. Islamgulov // Implementation of educational programs of higher education within the framework of the Federal State Educational Standard of Higher Education: materials of the All-Russian scientific and methodological conference within the framework of the visiting meeting of the National Medical Council on environmental management and water use of the Federal Educational Institution in the higher education system. / Bashkir State Agrarian University. - Ufa, 2016. - pp. 214-219.
  9. Lubova, T.N. The basis for the implementation of the federal state educational standard is the competency-based approach [Text] / T.N. Lubova, D.R. Islamgulov, I.R. Islamgulova // Modern scientific bulletin. – 2015. – T. 7. – No. 1. – P. 85-93.
  10. Saubanova, L.M. Demographic load level [Text] / L.M. Saubanova, T.N. Lubova // Current issues of economic-statistical research and information technologies: collection of articles. scientific Art.: dedicated to the 40th anniversary of the creation of the department of “Statistics and Information Systems in Economics” / Bashkir State Agrarian University. - Ufa, 2011. - P. 321-322.
  11. Fakhrullina, A.R. Statistical analysis of inflation in Russia [Text] / A.R. Fakhrullina, T.N. Lubova // Current issues of economic-statistical research and information technologies: collection of articles. scientific Art.: dedicated to the 40th anniversary of the creation of the department of “Statistics and Information Systems in Economics” / Bashkir State Agrarian University. - Ufa, 2011. - pp. 323-324.
  12. Farkhutdinova, A.T. Labor market in the Republic of Bashkortostan in 2012 [Electronic resource] / A.T. Farkhutdinova, T.N. Lubova // Student scientific forum. Materials of the V International Student Electronic Scientific Conference: electronic scientific conference (electronic collection). Russian Academy of Natural Sciences. 2013.

Correlation analysis

Correlation- statistical relationship between two or more random variables (or variables that can be considered as such with some acceptable degree of accuracy). Moreover, changes in one or more of these quantities lead to a systematic change in another or other quantities. A mathematical measure of the correlation between two random variables is the correlation coefficient.

The correlation can be positive and negative (it is also possible that there is no statistical relationship - for example, for independent random variables). Negative correlation - correlation, in which an increase in one variable is associated with a decrease in another variable, and the correlation coefficient is negative. Positive correlation - correlation, in which an increase in one variable is associated with an increase in another variable, and the correlation coefficient is positive.

Autocorrelation - statistical relationship between random variables from the same series, but taken with a shift, for example, for a random process - with a time shift.

Let X,Y- two random variables defined on one probability space. Then their correlation coefficient is given by the formula:

,

where cov denotes covariance, and D is variance, or equivalently,

,

where the symbol denotes the mathematical expectation.

To graphically represent such a relationship, you can use a rectangular coordinate system with axes that correspond to both variables. Each pair of values ​​is marked with a specific symbol. This graph is called a “scatterplot.”

The method for calculating the correlation coefficient depends on the type of scale to which the variables belong. Thus, to measure variables with interval and quantitative scales, it is necessary to use the Pearson correlation coefficient (product moment correlation). If at least one of the two variables is on an ordinal scale or is not normally distributed, Spearman's rank correlation or Kendal's τ (tau) must be used. In the case where one of the two variables is dichotomous, a point-biserial correlation is used, and if both variables are dichotomous: a four-field correlation. Calculating the correlation coefficient between two non-dichotomous variables makes sense only when the relationship between them is linear (unidirectional).

Kendell correlation coefficient

Used to measure mutual disorder.

Spearman correlation coefficient

Properties of the correlation coefficient

if we take covariance as the scalar product of two random variables, then the norm of the random variable will be equal to , and the consequence of the Cauchy-Bunyakovsky inequality will be: . , Where . Moreover, in this case the signs and k match up: .

Correlation analysis

Correlation analysis- method of processing statistical data, which consists in studying coefficients ( correlations) between variables. In this case, correlation coefficients between one pair or many pairs of characteristics are compared to establish statistical relationships between them.

Target correlation analysis- provide some information about one variable using another variable. In cases where it is possible to achieve a goal, the variables are said to be correlate. In its most general form, accepting the hypothesis of a correlation means that a change in the value of variable A will occur simultaneously with a proportional change in the value of B: if both variables increase, then the correlation is positive, if one variable increases and the other decreases, correlation is negative.

Correlation reflects only the linear dependence of values, but does not reflect their functional connectivity. For example, if you calculate the correlation coefficient between the quantities A = sin(x) And B = cos(x) , then it will be close to zero, i.e. there is no dependence between the quantities. Meanwhile, quantities A and B are obviously related functionally according to the law sin 2 (x) + cos 2 (x) = 1 .

Limitations of Correlation Analysis

Graphs of distributions of pairs (x,y) with the corresponding correlation coefficients x and y for each of them. Note that the correlation coefficient reflects a linear relationship (top line), but does not describe a relationship curve (middle line), and is not at all suitable for describing complex, nonlinear relationships (bottom line).

  1. Application is possible if there are a sufficient number of cases for study: for a particular type, the correlation coefficient ranges from 25 to 100 pairs of observations.
  2. The second limitation follows from the correlation analysis hypothesis, which includes linear dependence of variables. In many cases, when it is reliably known that a relationship exists, correlation analysis may not yield results simply because the relationship is nonlinear (expressed, for example, as a parabola).
  3. The mere fact of correlation does not provide grounds for asserting which of the variables precedes or causes changes, or that the variables are generally causally related to each other, for example, due to the action of a third factor.

Application area

This method of processing statistical data is very popular in economics and social sciences (in particular in psychology and sociology), although the scope of application of correlation coefficients is extensive: quality control of industrial products, metallurgy, agrochemistry, hydrobiology, biometrics and others.

The popularity of the method is due to two factors: correlation coefficients are relatively easy to calculate, and their use does not require special mathematical training. Combined with its ease of interpretation, the ease of application of the coefficient has led to its widespread use in the field of statistical data analysis.

False correlation

Often, the tempting simplicity of correlation research encourages the researcher to make false intuitive conclusions about the presence of a cause-and-effect relationship between pairs of characteristics, while correlation coefficients establish only statistical relationships.

In modern quantitative methodology of the social sciences, there has, in fact, been a abandonment of attempts to establish cause-and-effect relationships between observed variables using empirical methods. Therefore, when researchers in the social sciences talk about establishing relationships between the variables being studied, either a general theoretical assumption or a statistical dependence is implied.

see also

Wikimedia Foundation. 2010.

See what “Correlation analysis” is in other dictionaries:

    See CORRELATION ANALYSIS. Antinazi. Encyclopedia of Sociology, 2009 ... Encyclopedia of Sociology

    A branch of mathematical statistics that combines practical methods for studying the correlation between two (or more) random characteristics or factors. See Correlation (in mathematical statistics)... Big Encyclopedic Dictionary

    CORRELATION ANALYSIS, a section of mathematical statistics that combines practical methods for studying the correlation between two (or more) random characteristics or factors. See Correlation (see CORRELATION (mutual relationship ... encyclopedic Dictionary

    Correlation analysis- (in economics) a branch of mathematical statistics that studies the relationships between changing quantities (correlation is a ratio, from the Latin word correlatio). The relationship can be complete (i.e. functional) and incomplete,... ... Economic-mathematical dictionary

    correlation analysis- (in psychology) (from the Latin correlatio ratio) a statistical method for assessing the form, sign and closeness of the connection between the characteristics or factors being studied. When determining the form of a connection, its linearity or nonlinearity is considered (i.e., as on average... ... Great psychological encyclopedia

    correlation analysis- - [L.G. Sumenko. English-Russian dictionary on information technology. M.: State Enterprise TsNIIS, 2003.] Topics information technology in general EN correlation analysis ... Technical Translator's Guide

    correlation analysis- koreliacinė analizė statusas T sritis Kūno kultūra ir sportas apibrėžtis Statistikos metodas, kuriuo įvertinami tiriamųjų asmenų, reiškinių požymiai arba veiksnių santykiai. atitikmenys: engl. correlation studies vok. Analyze der Correlation, f;… … Sporto terminų žodynas

    A set of methods based on the mathematical theory of correlation (See Correlation) for detecting a correlation between two random characteristics or factors. K. a. experimental data includes the following... ... Great Soviet Encyclopedia

    Math section statistics, combining practical Correlative research methods. dependencies between two (or more) random characteristics or factors. See Correlation... Big Encyclopedic Polytechnic Dictionary

COURSE WORK

Topic: Correlation analysis

Introduction

1. Correlation analysis

1.1 The concept of correlation

1.2 General classification of correlations

1.3 Correlation fields and the purpose of their construction

1.4 Stages of correlation analysis

1.5 Correlation coefficients

1.6 Normalized Bravais-Pearson correlation coefficient

1.7 Spearman's rank correlation coefficient

1.8 Basic properties of correlation coefficients

1.9 Checking the significance of correlation coefficients

1.10 Critical values ​​of the pair correlation coefficient

2. Planning a multifactorial experiment

2.1 Condition of the problem

2.2 Determination of the center of the plan (basic level) and the level of factor variation

2.3 Construction of the planning matrix

2.4 Checking the homogeneity of dispersion and equivalence of measurement in different series

2.5 Regression equation coefficients

2.6 Reproducibility variance

2.7 Checking the significance of regression equation coefficients

2.8 Checking the adequacy of the regression equation

Conclusion

Bibliography

INTRODUCTION

Experimental planning is a mathematical and statistical discipline that studies methods for the rational organization of experimental research - from the optimal choice of factors under study and the determination of the actual experimental plan in accordance with its purpose to methods for analyzing the results. Experimental planning began with the works of the English statistician R. Fisher (1935), who emphasized that rational experimental planning provides no less significant gains in the accuracy of estimates than optimal processing of measurement results. In the 60s of the 20th century, the modern theory of experimental planning emerged. Her methods are closely related to function approximation theory and mathematical programming. Optimal plans were constructed and their properties were studied for a wide class of models.

Experimental planning is the selection of an experimental plan that meets specified requirements, a set of actions aimed at developing an experimentation strategy (from obtaining a priori information to obtaining a workable mathematical model or determining optimal conditions). This is purposeful control of an experiment, implemented under conditions of incomplete knowledge of the mechanism of the phenomenon being studied.

In the process of measurements, subsequent data processing, as well as formalization of the results in the form of a mathematical model, errors arise and some of the information contained in the original data is lost. The use of experimental planning methods makes it possible to determine the error of the mathematical model and judge its adequacy. If the accuracy of the model turns out to be insufficient, then the use of experimental planning methods makes it possible to modernize the mathematical model with additional experiments without losing previous information and with minimal costs.

The purpose of planning an experiment is to find such conditions and rules for conducting experiments under which it is possible to obtain reliable and reliable information about an object with the least amount of labor, as well as to present this information in a compact and convenient form with a quantitative assessment of accuracy.

Among the main planning methods used at different stages of the study are:

Planning a screening experiment, the main meaning of which is the selection from the entire set of factors of a group of significant factors that are subject to further detailed study;

Experimental design for ANOVA, i.e. drawing up plans for objects with qualitative factors;

Planning a regression experiment that allows you to obtain regression models (polynomial and others);

Planning an extreme experiment in which the main task is experimental optimization of the research object;

Planning when studying dynamic processes, etc.

The purpose of studying the discipline is to prepare students for production and technical activities in their specialty using methods of planning theory and modern information technologies.

Objectives of the discipline: study of modern methods of planning, organizing and optimizing scientific and industrial experiments, conducting experiments and processing the results obtained.

1. CORRELATION ANALYSIS

1.1 The concept of correlation

A researcher is often interested in how two or more variables are related to each other in one or more samples being studied. For example, can height affect a person's weight, or can blood pressure affect product quality?

This kind of dependence between variables is called correlation, or correlation. A correlation is a consistent change in two characteristics, reflecting the fact that the variability of one characteristic is in accordance with the variability of the other.

It is known, for example, that on average there is a positive relationship between the height of people and their weight, and such that the greater the height, the greater the person’s weight. However, there are exceptions to this rule, when relatively short people are overweight, and, conversely, asthenic people with high stature have low weight. The reason for such exceptions is that each biological, physiological or psychological sign is determined by the influence of many factors: environmental, genetic, social, environmental, etc.

Correlation connections are probabilistic changes that can only be studied on representative samples using the methods of mathematical statistics. Both terms - correlation link and correlation dependence - are often used interchangeably. Dependency implies influence, connection - any coordinated changes that can be explained by hundreds of reasons. Correlation connections cannot be considered as evidence of a cause-and-effect relationship; they only indicate that changes in one characteristic are usually accompanied by certain changes in another.

Correlation dependence - These are changes that introduce the values ​​of one characteristic into the probability of the occurrence of different values ​​of another characteristic.

The task of correlation analysis comes down to establishing the direction (positive or negative) and form (linear, nonlinear) of the relationship between varying characteristics, measuring its closeness, and, finally, checking the level of significance of the obtained correlation coefficients.

Correlation connections vary in form, direction and degree (strength) .

The form of the correlation relationship can be linear or curvilinear. For example, the relationship between the number of training sessions on the simulator and the number of correctly solved problems in the control session may be straightforward. For example, the relationship between the level of motivation and the effectiveness of a task may be curvilinear (Figure 1). As motivation increases, the effectiveness of completing a task first increases, then the optimal level of motivation is achieved, which corresponds to the maximum effectiveness of completing the task; A further increase in motivation is accompanied by a decrease in efficiency.

Figure 1 - Relationship between the effectiveness of problem solving and the strength of motivational tendencies

In direction, the correlation relationship can be positive (“direct”) and negative (“inverse”). With a positive linear correlation, higher values ​​of one characteristic correspond to higher values ​​of another, and lower values ​​of one characteristic correspond to low values ​​of another (Figure 2). With a negative correlation, the relationships are inverse (Figure 3). With a positive correlation, the correlation coefficient has a positive sign, with a negative correlation, it has a negative sign.

Figure 2 – Direct correlation

Figure 3 – Inverse correlation


Figure 4 – No correlation

The degree, strength or closeness of the correlation is determined by the value of the correlation coefficient. The strength of the connection does not depend on its direction and is determined by the absolute value of the correlation coefficient.

1.2 General classification of correlations

Depending on the correlation coefficient, the following correlations are distinguished:

Strong, or close with a correlation coefficient r>0.70;

Average (at 0.50

Moderate (at 0.30

Weak (at 0.20

Very weak (at r<0,19).

1.3 Correlation fields and the purpose of their construction

Correlation is studied on the basis of experimental data, which are the measured values ​​(x i, y i) of two characteristics. If there is little experimental data, then the two-dimensional empirical distribution is represented as a double series of values ​​x i and y i. At the same time, the correlation dependence between characteristics can be described in different ways. The correspondence between an argument and a function can be given by a table, formula, graph, etc.

Correlation analysis, like other statistical methods, is based on the use of probabilistic models that describe the behavior of the characteristics under study in a certain general population from which the experimental values ​​xi and y i are obtained. When studying the correlation between quantitative characteristics, the values ​​of which can be accurately measured in units of metric scales (meters, seconds, kilograms, etc.), a two-dimensional normally distributed population model is very often adopted. Such a model displays the relationship between the variables x i and y i graphically in the form of a geometric location of points in a system of rectangular coordinates. This graphical relationship is also called a scatterplot or correlation field.
This model of a two-dimensional normal distribution (correlation field) allows us to give a clear graphical interpretation of the correlation coefficient, because the distribution in total depends on five parameters: μ x, μ y – average values ​​(mathematical expectations); σ x ,σ y – standard deviations of random variables X and Y and p – correlation coefficient, which is a measure of the relationship between random variables X and Y.
If p = 0, then the values ​​x i , y i obtained from a two-dimensional normal population are located on the graph in coordinates x, y within the area limited by the circle (Figure 5, a). In this case, there is no correlation between the random variables X and Y and they are called uncorrelated. For a two-dimensional normal distribution, uncorrelatedness simultaneously means independence of random variables X and Y.

In scientific research, there is often a need to find a connection between outcome and factor variables (the yield of a crop and the amount of precipitation, the height and weight of a person in homogeneous groups by sex and age, heart rate and body temperature, etc.).

The second are signs that contribute to changes in those associated with them (the first).

The concept of correlation analysis

There are many Based on the above, we can say that correlation analysis is a method used to test the hypothesis about the statistical significance of two or more variables if the researcher can measure them, but not change them.

There are other definitions of the concept in question. Correlation analysis is a processing method that involves studying correlation coefficients between variables. In this case, correlation coefficients between one pair or many pairs of characteristics are compared to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to the dynamics of the mathematical expectation of another.

The concept of false correlation

When conducting correlation analysis, it is necessary to take into account that it can be carried out in relation to any set of characteristics, often absurd in relation to each other. Sometimes they have no causal connection with each other.

In this case, they talk about a false correlation.

Problems of correlation analysis

Based on the above definitions, the following tasks of the described method can be formulated: obtain information about one of the sought variables using another; determine the closeness of the relationship between the studied variables.

Correlation analysis involves determining the relationship between the characteristics being studied, and therefore the tasks of correlation analysis can be supplemented with the following:

  • identification of factors that have the greatest impact on the resulting characteristic;
  • identification of previously unexplored causes of connections;
  • construction of a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval assessment.

Relationship between correlation analysis and regression

The method of correlation analysis is often not limited to finding the closeness of the relationship between the studied quantities. Sometimes it is supplemented by the compilation of regression equations, which are obtained using the analysis of the same name, and which represent a description of the correlation dependence between the resulting and factor (factor) characteristic (features). This method, together with the analysis under consideration, constitutes the method

Conditions for using the method

Effective factors depend on one to several factors. The method of correlation analysis can be used if there is a large number of observations about the value of effective and factor indicators (factors), while the factors under study must be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case, the result of the correlation analysis is the Pearson correlation coefficients, or, if the characteristics do not obey this law, the Spearman rank correlation coefficient is used.

Rules for selecting correlation analysis factors

When applying this method, it is necessary to determine the factors that influence performance indicators. They are selected taking into account the fact that there must be cause-and-effect relationships between the indicators. In the case of creating a multifactor correlation model, those that have a significant impact on the resulting indicator are selected, while it is preferable not to include interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model, as well as those for which the relationship with the resultant parameter is not linear or functional character.

Displaying results

The results of correlation analysis can be presented in text and graphic forms. In the first case they are presented as a correlation coefficient, in the second - in the form of a scatter diagram.

In the absence of correlation between the parameters, the points on the diagram are located chaotically, the average degree of connection is characterized by a greater degree of order and is characterized by a more or less uniform distance of the marked marks from the median. A strong connection tends to be straight and at r=1 the dot plot is a flat line. Reverse correlation differs in the direction of the graph from the upper left to the lower right, direct correlation - from the lower left to the upper right corner.

3D representation of a scatter plot

In addition to the traditional 2D scatter plot display, a 3D graphical representation of correlation analysis is now used.

A scatterplot matrix is ​​also used, which displays all paired plots in a single figure in a matrix format. For n variables, the matrix contains n rows and n columns. The chart located at the intersection of the i-th row and the j-th column is a plot of the variables Xi versus Xj. Thus, each row and column is one dimension, a single cell displays a scatterplot of two dimensions.

Assessing the tightness of the connection

The closeness of the correlation connection is determined by the correlation coefficient (r): strong - r = ±0.7 to ±1, medium - r = ±0.3 to ±0.699, weak - r = 0 to ±0.299. This classification is not strict. The figure shows a slightly different diagram.

An example of using the correlation analysis method

An interesting study was undertaken in the UK. It is devoted to the connection between smoking and lung cancer, and was carried out through correlation analysis. This observation is presented below.

Initial data for correlation analysis

Professional group

mortality

Farmers, foresters and fishermen

Miners and quarry workers

Manufacturers of gas, coke and chemicals

Manufacturers of glass and ceramics

Workers of furnaces, forges, foundries and rolling mills

Electrical and electronics workers

Engineering and related professions

Woodworking industries

Leatherworkers

Textile workers

Manufacturers of work clothes

Workers in the food, drink and tobacco industries

Paper and Print Manufacturers

Manufacturers of other products

Builders

Painters and decorators

Drivers of stationary engines, cranes, etc.

Workers not elsewhere included

Transport and communications workers

Warehouse workers, storekeepers, packers and filling machine workers

Office workers

Sellers

Sports and recreation workers

Administrators and managers

Professionals, technicians and artists

We begin correlation analysis. For clarity, it is better to start the solution with a graphical method, for which we will construct a scatter diagram.

It demonstrates a direct connection. However, it is difficult to draw an unambiguous conclusion based on the graphical method alone. Therefore, we will continue to perform correlation analysis. An example of calculating the correlation coefficient is presented below.

Using software (MS Excel will be described below as an example), we determine the correlation coefficient, which is 0.716, which means a strong connection between the parameters under study. Let's determine the statistical reliability of the obtained value using the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result we get 23 and using this line in the table we find r critical for p = 0.01 (since these are medical data, a more strict dependence, in other cases p=0.05 is sufficient), which is 0.51 for this correlation analysis. The example demonstrated that the calculated r is greater than the critical r, and the value of the correlation coefficient is considered statistically reliable.

Using software when conducting correlation analysis

The described type of statistical data processing can be carried out using software, in particular MS Excel. Correlation involves calculating the following parameters using functions:

1. The correlation coefficient is determined using the CORREL function (array1; array2). Array1,2 - cell of the interval of values ​​of the resultant and factor variables.

The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting with Excel 2007, you can use the function with the same arrays.

Graphical display of correlation analysis in Excel is done using the “Charts” panel with the “Scatter Plot” selection.

After specifying the initial data, we get a graph.

2. Assessing the significance of the pairwise correlation coefficient using Student’s t-test. The calculated value of the t-criterion is compared with the tabulated (critical) value of this indicator from the corresponding table of values ​​of the parameter under consideration, taking into account the specified level of significance and the number of degrees of freedom. This estimation is carried out using the function STUDISCOVER(probability; degrees_of_freedom).

3. Matrix of pair correlation coefficients. The analysis is carried out using the Data Analysis tool, in which Correlation is selected. Statistical assessment of pair correlation coefficients is carried out by comparing its absolute value with the tabulated (critical) value. When the calculated pairwise correlation coefficient exceeds the critical one, we can say, taking into account the given degree of probability, that the null hypothesis about the significance of the linear relationship is not rejected.

Finally

The use of correlation analysis method in scientific research allows us to determine the relationship between various factors and performance indicators. It is necessary to take into account that a high correlation coefficient can be obtained from an absurd pair or set of data, and therefore this type of analysis must be carried out on a sufficiently large array of data.

After obtaining the calculated value of r, it is advisable to compare it with the critical r to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software, in particular MS Excel. Here you can also construct a scatter diagram for the purpose of visually representing the relationship between the studied factors of correlation analysis and the resulting characteristic.

Definition of Correlation Analysis

When solving problems of an economic nature, namely forecasting, correlation analysis is often used. It is based on some values ​​of a random variable, represented by a variable that depends on the case and can take on some values ​​with a certain probability. In this case, the corresponding distribution law can show the frequency of specific values ​​in their totality. Correlation analysis in statistics is based on stochastic dependence when conducting research into the relationship between certain economic indicators.

Types of correlation analysis

Correlation analysis operates with both functional (complete) and distorted by other factors (incomplete) dependencies of this relationship. An example of the first type (functional dependence) is the production and consumption of finished products in conditions of shortage. An incomplete relationship can be seen, for example, between labor productivity and the length of service of workers. At the same time, greater experience influences its quality, however, under the influence of certain factors (health or education), this dependence is distorted.

Using correlation analysis in statistics

Correlation analysis is widely used in mathematical statistics.

At the same time, its main task is to determine the closeness of connection and character between independent (factorial) and dependent (resultative) characteristics in a process or phenomenon. A correlation is revealed only with a large-scale factorial comparison. Thus, its tightness can be determined using a certain correlation coefficient, specially calculated and located in the interval [-1;+1]. The nature of the relationship between these indicators can be determined by the correlation field. In the case where Y is a dependent feature, X is an independent feature, then when taking each case in the form X(j), the correlation field will have coordinates (x j;y j).

Correlation analysis in economics

The economic activity of business entities depends on a huge number of different factors. In this case, it is necessary to consider their complex, since each of them separately cannot determine the phenomenon being studied in its entirety. Therefore, only a set of factors in their close interrelation gives a clear idea of ​​the object under study. Multivariate correlation analysis can consist of several stages. First of all, those factors with the help of which the maximum impact on the indicator under study are determined, and the most significant ones are selected for analysis. The second stage involves the collection and assessment of initial information that is necessary for correlation analysis. In the third, character is studied, and the relationship between the final indicators and other factors is modeled. In other words, the generated mathematical equation is substantiated, which most accurately expresses the essence of the analyzed dependence. And the last stage involves evaluating the results of the correlation analysis with its practical application.

Share