Kendall rank correlation coefficient. Kendall rank correlation coefficient Kendall test example solution

To calculate the coefficient rank correlation Kendall rk it is necessary to rank the data by one of the features in ascending order and determine the corresponding ranks by the second feature. Then, for each rank of the second feature, the number of subsequent ranks greater than the rank taken is determined, and the sum of these numbers is found.

Kendall's rank correlation coefficient is given by


where R i is the number of ranks of the second variable, starting from i+1 whose magnitude is greater than the magnitude i th rank of this variable.

There are tables of coefficient distribution percentage points rk, allowing to test the hypothesis about the significance of the correlation coefficient.

For large sample sizes, the critical values rk are not tabulated, and they have to be calculated using approximate formulas based on the fact that under the null hypothesis H 0: rk=0 and large n random value

distributed approximately according to the standard normal law.

40. Relationship between features measured in nominal or ordinal scales

Often there is a problem of checking the independence of two features measured on a nominal or ordinal scale.

Let two features be measured for some objects X and Y with number of levels r and s respectively. It is convenient to present the results of such observations in the form of a table called the feature contingency table.

Table u i(i = 1, ..., r) and vj (j= 1, ..., s) are the values ​​taken by the features, the value nij is the number of objects out of the total number of objects that have the attribute X took on the meaning u i, and the sign Y- meaning vj

We introduce the following random variables:

u i


- the number of objects that have a value vj


In addition, there are obvious equalities



Discrete random variables X and Y independent if and only if

for all couples i, j

Therefore, the hypothesis about the independence of discrete random variables X and Y can be written like this:

As an alternative, as a rule, use the hypothesis

The validity of the hypothesis H 0 should be judged on the basis of sample frequencies nij contingency tables. According to the law of large numbers, n→∞ the relative frequencies are close to the corresponding probabilities:



To test the hypothesis H 0, statistics are used

which, under the validity of the hypothesis, has the distribution χ 2 s rs − (r + s− 1) degrees of freedom.

Independence criterion χ 2 rejects the hypothesis H 0 with significance level α if:


41. Regression analysis. Basic concepts of regression analysis

For the mathematical description of statistical relationships between the studied variables, the following tasks should be solved:

ü choose a class of functions in which it is advisable to look for the best (in a certain sense) approximation of the dependence of interest;

ü find estimates of unknown values ​​of the parameters included in the equations of the desired dependence;

ü establish the adequacy of the obtained equation of the desired dependence;

ü identify the most informative input variables.

The totality of these tasks is the subject of regression analysis research.

The regression function (or regression) is the dependence of the mathematical expectation of one random variable on the value taken by another random variable, which forms a two-dimensional system of random variables with the first one.

Let there be a system of random variables ( X,Y), then the regression function Y on the X

And the regression function X on the Y

Regression functions f(x) and φ (y), are not mutually reversible, unless the relationship between X and Y is not functional.

When n-dimensional vector with coordinates X 1 , X 2 ,…, X n it is possible to consider the conditional mathematical expectation for any component. For example, for X 1


called regression X 1 on X 2 ,…, X n.

To fully define the regression function, it is necessary to know the conditional distribution of the output variable for fixed values ​​of the input variable.

Since in a real situation such information is not available, they usually limit themselves to searching for a suitable approximating function f a(x) for f(x), based on statistical data of the form ( x i, y i), i = 1,…, n. This data is the result n independent observations y 1 ,…, y n random variable Y at the values ​​of the input variable x 1 ,…, x n, while in the regression analysis it is assumed that the values ​​of the input variable are specified exactly.

The problem of choosing the best approximating function f a(x), being the main one in regression analysis, and does not have formalized procedures for its solution. Sometimes the choice is determined on the basis of an analysis of experimental data, more often from theoretical considerations.

If it is assumed that the regression function is sufficiently smooth, then the function approximating it f a(x) can be represented as a linear combination of some set of linearly independent basis functions ψ k(x), k = 0, 1,…, m−1, i.e., in the form


where m is the number of unknown parameters θ k(in general case unknown quantity, specified in the course of model building).

Such a function is linear in parameters, therefore, in the case under consideration, one speaks of a regression function model linear in parameters.

Then the problem of finding the best approximation for the regression line f(x) is reduced to finding such parameter values ​​for which f a(x;θ) is the most adequate to the available data. One of the methods to solve this problem is the method of least squares.

42. Least square method

Let the set of points ( x i, y i), i= 1,…, n located on a plane along some straight line

Then as a function f a(x) approximating the regression function f(x) = M [Y|x] it is natural to take a linear function of the argument x:


That is, we have chosen here as basis functions ψ 0 (x)≡1 and ψ 1 (x)≡x. This regression is called simple linear regression.

If the set of points ( x i, y i), i= 1,…, n located along some curve, then as f a(x) it is natural to try to choose a family of parabolas

This function is non-linear in parameters θ 0 and θ 1 , however, by functional transformation (in this case, logarithm) it can be reduced to new feature f'a(x) , linear in parameters:


43. Simple Linear Regression

The simplest regression model is simple (univariate, one-factor, paired) linear model, which has the following form:


where ε i- uncorrelated random variables (errors) with zero mathematical expectations and the same variances σ 2 , a and b are constant coefficients (parameters) that need to be estimated from the measured response values y i.

To find parameter estimates a and b linear regression, which determine the straight line that best satisfies the experimental data:


the method of least squares is applied.

According to least squares parameter estimates a and b is found from the condition of minimizing the sum of squared deviations of the values y i vertically from the “true” regression line:

Let there be ten observations of a random variable Y for fixed values ​​of the variable X

To minimize D equate to zero the partial derivatives with respect to a and b:



As a result, we get next system equations for finding estimates a and b:


Solving these two equations gives:



Expressions for parameter estimates a and b can also be represented as:

Then the empirical equation of the regression line Y on the X can be written as:


Unbiased estimator of variance σ 2 value deviations y i from the fitted regression straight line is given by

Calculate the parameters of the regression equation


Thus, the direct regression looks like:


And the estimation of the dispersion of deviations of the values y i from fitted straight regression line


44. Checking the Significance of the Regression Line

Found estimate b≠ 0 may be the realization of a random variable whose mathematical expectation is equal to zero, i.e. it may turn out that there is actually no regression dependence.

To deal with this situation, you should test the hypothesis H 0: b= 0 under the competing hypothesis H 1: b ≠ 0.

The significance of the regression line can be tested using analysis of variance.

Consider the following identity:

Value y iŷ i = ε i is called the remainder and is the difference between the two quantities:

ü deviation of the observed value (response) from the general average of the responses;

ü deviation of the predicted response value ŷ i from the same average

The above identity can be written as


By squaring both sides and summing over i, we get:


Where the quantities are named:

the full (total) sum of squares of the SC n, which is equal to the sum of the squared deviations of the observations relative to the mean value of the observations

the sum of squares due to the regression of the SC p, which is equal to the sum of the squared deviations of the values ​​of the regression line relative to the mean of the observations.

residual sum of squares SC 0 . which is equal to the sum of the squared deviations of observations relative to the values ​​of the regression line

Thus, the spread Y-kov relative to their mean can be attributed to some extent to the fact that not all observations lie on the regression line. If this were the case, then the sum of squares with respect to the regression would be zero. It follows that the regression will be significant if the sum of squares of SC p is greater than the sum of squares of SC 0 .

Calculations to test the significance of the regression are carried out in following table analysis of variance

If errors ε i distributed according to the normal law, then if the hypothesis H 0 is true: b= 0 statistics:


distributed according to Fisher's law with the number of degrees of freedom 1 and n−2.

The null hypothesis will be rejected at a significance level α if the calculated value of the statistic F will be greater than the α-percentage point f 1;n−2;α of the Fisher distribution.

45. Checking the adequacy of the regression model. Residual method

Under the adequacy of the constructed regression model it is understood that no other model provides a significant improvement in response prediction.

If all response values ​​are received at different meanings x, i.e. there are no several response values ​​obtained for the same x i, then only a limited test of the adequacy of the linear model can be carried out. The basis for such a check is the residuals:

Deviations from the established pattern:

Because the X is a one-dimensional variable, points ( x i, d i) can be depicted on a plane in the form of a so-called residual plot. Such a representation sometimes makes it possible to detect some regularity in the behavior of the residues. In addition, the analysis of the residuals allows us to analyze the assumption regarding the law of error distribution.

In the case when the errors are distributed according to the normal law and there is an a priori estimate of their variance σ 2 (estimate obtained on the basis of previously performed measurements), then a more accurate assessment of the adequacy of the model is possible.

By using F-Fischer's test, you can check whether the residual variance is significant s 0 2 differs from the a priori estimate. If it is significantly larger, then there is inadequacy and the model should be revised.

If the a priori estimate σ 2 no, but response measurements Y repeated two or more times at the same values X, then these repeated observations can be used to obtain another estimate σ 2 (the first is the residual variance). Such an estimate is said to represent a “pure” error, because if we make x the same for two or more observations, then only random changes can affect the results and create a scatter between them.

The resulting estimate turns out to be a more reliable estimate of the variance than the estimate obtained by other methods. For this reason, when planning experiments, it makes sense to set up experiments with repetitions.

Let's assume there is m different meanings X : x 1 , x 2 , ..., x m. Let for each of these values x i available n i response observations Y. The total number of observations is:

Then the simple linear regression model can be written as:


Let us find the variance of “pure” errors. This variance is the combined variance estimate σ 2 if we present the response values yij at x = x i as sampling volume n i. As a result, the variance of “pure” errors is equal to:

This variance serves as an estimate σ 2 regardless of whether the fitted model is correct.

Let us show that the sum of squares “ pure mistakes” is part of the residual sum of squares (the sum of squares included in the expression for the residual variance). Remaining for j th observation at x i can be written as:

If we square both sides of this equation and then sum them over j and by i, then we get:

On the left side of this equation is the residual sum of squares. The first term on the right side is the sum of the squares of “pure” errors, the second term can be called the sum of the squares of inadequacy. The last amount has m−2 degrees of freedom, hence the inadequacy variance

The test statistic for testing the hypothesis H 0: simple linear model is adequate, against the hypothesis H 1: simple linear model is inadequate, is a random variable

If the null hypothesis is true, the value F has a Fisher distribution with degrees of freedom m−2 and nm. The hypothesis of linearity of the regression line should be rejected with a significance level α if the resulting statistic value is greater than the α-percentage point of the Fisher distribution with the number of degrees of freedom m−2 and nm.

46. Checking the adequacy of the regression model (see 45). Analysis of variance

47. Checking the adequacy of the regression model (see 45). Determination coefficient

Sometimes, to characterize the quality of the regression line, a sample coefficient of determination is used. R 2 , showing what part (share) of the sum of squares due to regression, SC p is in the total sum of squares of SC n:

The closer R 2 to one, the better the regression approximates the experimental data, the closer the observations are adjacent to the regression line. If a R 2 = 0, then the response changes are completely due to the influence of unaccounted for factors, and the regression line is parallel to the axis x-ov. In the case of a simple linear regression, the coefficient of determination R 2 is equal to the square of the correlation coefficient r 2 .

The maximum value of R 2 =1 can be achieved only in the case when the observations were carried out at different values ​​of x-s. If there are repeated experiences in the data, then the value of R 2 cannot reach unity, no matter how good the model is.

48. Confidence intervals for simple linear regression parameters

Just as the sample mean is an estimate of the true mean (the population mean), so are the sample parameters of the regression equation a and b- nothing more than estimates of the true regression coefficients. Different samples give different estimates of the mean, just as different samples will give different estimates of the regression coefficients.

Assuming that the error distribution law ε i are described by a normal law, the parameter estimate b will have a normal distribution with parameters:


Since the parameter estimate a is a linear combination of independent normally distributed variables, it will also have a normal distribution with mean and variance:


In this case, the (1 − α) confidence interval for estimating the variance σ 2, taking into account that the ratio ( n−2)s 0 2 /σ 2 distributed by law χ 2 with the number of degrees of freedom n−2 will be determined by the expression


49. Confidence intervals for the regression line. Confidence interval for dependent variable values

Usually we do not know the true values ​​of the regression coefficients a and b. We only know their estimates. In other words, the true regression line can go higher or lower, be steeper or flatter than that built on the sample data. We calculated confidence intervals for the regression coefficients. You can also calculate the confidence region for the regression line itself.

Let for a simple linear regression it is necessary to construct (1− α ) confidence interval for the mathematical expectation of the response Y with the value X = X 0 . This mathematical expectation is a+bx 0 , and its estimate

Because, then.

The obtained estimate of the mathematical expectation is a linear combination of uncorrelated normally distributed quantities and therefore also has a normal distribution centered at the point of the true value of the conditional mathematical expectation and variance

Therefore, the confidence interval for the regression line at each value x 0 can be represented as


As you can see, the minimum confidence interval is obtained when x 0 equal to the average value and increases as x 0 “moves away” from the mean in any direction.

To obtain a set of joint confidence intervals suitable for the entire regression function, throughout its entire length, in the above expression, instead of t n −2,α /2 must be substituted

Rank correlation coefficient characterizes the general nature of the non-linear dependence: an increase or decrease in the resultant sign with an increase in the factorial one. This is an indicator of the tightness of a monotonic nonlinear relationship.

Service assignment. This online calculator calculates Kendal's rank correlation coefficient for all basic formulas, as well as an assessment of its significance.

Instruction. Specify the amount of data (number of rows). The resulting solution is saved in a Word file.

The coefficient proposed by Kendall is built on the basis of relations of the “more-less” type, the validity of which was established when constructing the scales.
Let's single out a couple of objects and compare their ranks according to one attribute and according to another. If the ranks form a direct order according to this feature (i.e., the order of the natural series), then +1 is assigned to the pair, if the opposite, then -1. For the selected pair, the corresponding plus-minus units (by feature X and by feature Y) are multiplied. The result is obviously +1; if the ranks of a pair of both features are in the same sequence, and -1 if they are in reverse order.
If the rank orders for both features are the same for all pairs, then the sum of units assigned to all pairs of objects is maximum and equal to the number of pairs. If the rank orders of all pairs are reversed, then –C 2 N . In the general case, C 2 N = P + Q, where P is the number of positive and Q negative units assigned to pairs when comparing their ranks for both features.
The value is called the Kendall coefficient.
It can be seen from the formula that the coefficient τ is the difference between the proportion of pairs of objects that have the same order in both features (in relation to the number of all pairs) and the proportion of pairs of objects that do not have the same order.
For example, a coefficient value of 0.60 means that 80% of pairs have the same order of objects, and 20% do not (80% + 20% = 100%; 0.80 - 0.20 = 0.60). Those. τ can be interpreted as the difference between the probabilities of coincidence and non-coincidence of orders in both features for a randomly selected pair of objects.
In the general case, the calculation of τ (more precisely, P or Q), even for N of the order of 10, turns out to be cumbersome.
Let's show how to simplify the calculations.


Example. The relationship between the volume of industrial output and investments in fixed capital in 10 regions of one of the federal districts of the Russian Federation in 2003 is characterized by the following data:


Calculate the Spearman and Kendall rank correlation coefficients. Check their significance at α=0.05. Formulate a conclusion about the relationship between the volume of industrial production and investments in fixed assets in the regions of the Russian Federation under consideration.

Solution. Assign ranks to the feature Y and factor X.


Let's sort the data by X.
In the Y series, to the right of 3, there are 7 ranks greater than 3, therefore, 3 will give rise to a term 7 in P.
To the right of 1 are 8 ranks greater than 1 (these are 2, 4, 6, 9, 5, 10, 7, 8), i.e. P will include 8, and so on. As a result, P = 37 and using the formulas we have:

XYrank X, dxrank Y, d yPQ
18.4 5.57 1 3 7 2
20.6 2.88 2 1 8 0
21.5 4.12 3 2 7 0
35.7 7.24 4 4 6 0
37.1 9.67 5 6 4 1
39.8 10.48 6 9 1 3
51.1 8.58 7 5 3 0
54.4 14.79 8 10 0 2
64.6 10.22 9 7 1 0
90.6 10.45 10 8 0 0
37 8


Simplified formulas:




where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф(z kp)=(1-α)/2.
If |τ|< T kp - нет оснований отвергнуть нулевую гипотезу. Ранговая корреляционная связь между качественными признаками незначима. Если |τ| >T kp - null hypothesis is rejected. There is a significant rank correlation between qualitative features.
Let us find the critical point z kp
Ф(z kp) = (1-α)/2 = (1 - 0.05)/2 = 0.475

Let's find the critical point:

Since τ > T kp - we reject the null hypothesis; the rank correlation between the scores on the two tests is significant.

Example. According to the data on the volume of construction and installation work performed on their own, and the number of employees in 10 construction companies in one of the cities of the Russian Federation, determine the relationship between these features using the Kendel coefficient.

Solution find with a calculator.
Assign ranks to the feature Y and factor X.
Let's arrange the objects so that their ranks in X represent natural numbers. Since the ratings assigned to each pair of this series are positive, the values ​​"+1" included in P will be generated only by those pairs whose ranks in Y form a direct order.
They are easy to calculate by sequentially comparing the ranks of each object in the Y row with the steel ones.
Kendall coefficient.

In the general case, the calculation of τ (more precisely, P or Q), even for N of the order of 10, turns out to be cumbersome. Let's show how to simplify the calculations.

or

Solution.
Let's sort the data by X.
In the Y series, to the right of 2, there are 8 ranks greater than 2, so 2 will give rise to a term 8 in P.
To the right of 4 are 6 ranks greater than 4 (these are 7, 5, 6, 8, 9, 10), i.e. P will include 6, and so on. As a result, P = 29 and using the formulas we have:

XYrank X, dxrank Y, d yPQ
38 292 1 2 8 1
50 302 2 4 6 2
52 366 3 7 3 4
54 312 4 5 4 2
59 359 5 6 3 2
61 398 6 8 2 2
66 401 7 9 1 2
70 298 8 3 1 1
71 283 9 1 1 0
73 413 10 10 0 0
29 16


Simplified formulas:


In order to test the null hypothesis at the significance level α that Kendall's general rank correlation coefficient is equal to zero under the competing hypothesis Н 1: τ ≠ 0, it is necessary to calculate the critical point:

where n is the sample size; z kp is the critical point of the two-sided critical region, which is found from the table of the Laplace function by the equality Ф(z kp)=(1 - α)/2.
If |τ| T kp - null hypothesis is rejected. There is a significant rank correlation between qualitative features.
Let us find the critical point z kp
Ф(z kp) = (1 - α)/2 = (1 - 0.05)/2 = 0.475
According to the Laplace table, we find z kp = 1.96
Let's find the critical point:

Since t

Brief theory

The Kendall correlation coefficient is used when the variables are represented by two ordinal scales, provided that there are no associated ranks. The calculation of the Kendall coefficient is associated with counting the number of matches and inversions.

This coefficient varies within and is calculated by the formula:

For calculation, all units are ranked according to the attribute; for a number of other features, for each rank, the number of subsequent ranks exceeding the given one (we denote them by ) and the number of subsequent ranks below the given one (we denote them by ) are calculated.

It can be shown that

and Kendall's rank correlation coefficient can be written as

In order to test the null hypothesis about the equality of Kendall's general rank correlation coefficient to zero under the competing hypothesis at a significance level , it is necessary to calculate the critical point:

where is the sample size; - the critical point of the two-sided critical region, which is found from the table of the Laplace function by equality

If there is no reason to reject the null hypothesis. The rank correlation between the features is insignificant.

If, the null hypothesis is rejected. There is a significant rank correlation between the signs.

Problem solution example

The task

When hiring seven candidates for vacant positions, two tests were offered. The test results (in points) are shown in the table:

Test Candidate 1 2 3 4 5 6 7 1 31 82 25 26 53 30 29 2 21 55 8 27 32 42 26

Calculate the Kendall rank correlation coefficient between test results for two tests and assess its significance at a level.

The solution of the problem

Calculate the Kendall coefficient

The ranks of the factor attribute are arranged strictly in ascending order and the corresponding ranks of the effective attribute are written in parallel. For each rank, from the number of ranks following it, the number of ranks greater than it is counted (included in the column ) and the number of ranks that are smaller in value (included in the column ).

1 1 6 0 2 4 3 2 3 3 3 1 4 6 1 2 5 2 2 0 6 5 1 0 7 7 0 0 Sum 16 5

One of the factors limiting the application of criteria based on the assumption of normality is the sample size. As long as the sample is large enough (for example, 100 or more observations), you can assume that the sample distribution is normal, even if you are not sure that the distribution of the variable in the population is normal. However, if the sample is small, these tests should only be used if there is confidence that the variable is indeed normally distributed. However, there is no way to test this assumption on a small sample.

The use of criteria based on the assumption of normality is also limited by the scale of measurement (see the chapter Elementary concepts of data analysis). Statistical methods such as t-test, regression, etc. assume that the original data is continuous. However, there are situations where the data is simply ranked (measured on an ordinal scale) rather than accurately measured.

A typical example is the ratings of sites on the Internet: the first position is occupied by the site with the maximum number of visitors, the second position is occupied by the site with the maximum number of visitors among the remaining sites (among sites from which the first site was removed), etc. Knowing the ratings, we can say that the number of visitors to one site is greater than the number of visitors to another, but how much more, it is impossible to say. Imagine you have 5 sites: A, B, C, D, E, which are located in the top 5 places. Suppose that in the current month we had the following arrangement: A, B, C, D, E, and in the previous month: D, E, A, B, C. The question is, have there been significant changes in the ratings of sites or not? In this situation, obviously, we cannot use a t-test to compare these two sets of data, and we are moving into the realm of specific probabilistic calculations (and any statistical test contains a probabilistic calculation!). We reason approximately as follows: how likely is it that the difference in the two site arrangements is due to purely random reasons, or this difference is too large and cannot be explained by pure chance. In these discussions, we use only the ranks or permutations of sites and do not use a specific type of distribution of the number of visitors to them.

For the analysis of small samples and for data measured on poor scales, non-parametric methods are used.

Brief overview of nonparametric procedures

Essentially, for every parametric criterion, there is at least one non-parametric alternative.

In general, these procedures fall into one of the following categories:

  • difference criteria for independent samples;
  • difference criteria for dependent samples;
  • assessment of the degree of dependence between variables.

In general, the approach to statistical criteria in data analysis should be pragmatic and not burdened with unnecessary theoretical considerations. With a STATISTICA computer at your disposal, you can easily apply several criteria to your data. Knowing about some of the pitfalls of the methods, you will choose the right solution through experimentation. The development of the plot is quite natural: if you need to compare the values ​​of two variables, then you use the t-test. However, it should be remembered that it is based on the assumption of normality and equality of variances in each group. Breaking free of these assumptions leads to non-parametric tests, which are especially useful for small samples.

The development of the t-test leads to analysis of variance, which is used when the number of compared groups is greater than two. The corresponding development of nonparametric procedures leads to nonparametric analysis of variance, although it is much poorer than classical analysis of variance.

To assess the dependence, or, to put it somewhat grandiloquently, the degree of closeness of the connection, the Pearson correlation coefficient is calculated. Strictly speaking, its use has limitations associated, for example, with the type of scale in which the data is measured and the non-linearity of the dependence, therefore, as an alternative, non-parametric, or so-called rank, correlation coefficients are also used, which are used, for example, for ranked data. If the data are measured on a nominal scale, then it is natural to present them in contingency tables that use Pearson's chi-square test with various variations and adjustments for accuracy.

So, in essence, there are only a few types of criteria and procedures that you need to know and be able to use, depending on the specifics of the data. You need to determine which criterion should be applied in a particular situation.

Nonparametric methods are most appropriate when the sample size is small. If there is a lot of data (for example, n > 100), it often makes no sense to use nonparametric statistics.

If the sample size is very small (for example, n = 10 or less), then the significance levels for those nonparametric tests that use the normal approximation can only be considered as rough estimates.

Differences between independent groups. If there are two samples (eg men and women) that need to be compared with respect to some mean value, such as mean blood pressure or white blood cell count, then an independent sample t-test can be used.

Nonparametric alternatives to this test are the Wald-Wolfowitz, Mann-Whitney )/n series test, where x i - i-th value, n - number of observations. If the variable contains negative values ​​or zero (0), the geometric mean cannot be calculated.

Harmonic mean

The harmonic mean is sometimes used to average frequencies. The harmonic mean is calculated by the formula: HS = n/S(1/x i) where HS is the harmonic mean, n is the number of observations, x i is the value of the observation with number i. If the variable contains zero (0), the harmonic mean cannot be calculated.

Variance and standard deviation

Sample variance and standard deviation are the most commonly used measures of variability (variation) in data. The variance is calculated as the sum of the squared deviations of the values ​​of the variable from the sample mean, divided by n-1 (but not by n). The standard deviation is calculated as the square root of the variance estimate.

scope

The range of a variable is a measure of volatility, calculated as the maximum minus the minimum.

Quartile range

The quarterly range, by definition, is: the top quartile minus the bottom quartile (75% percentile minus 25% percentile). Since the 75% percentile (upper quartile) is the value to the left of which 75% of the observations are, and the 25% percentile (lower quartile) is the value to the left of which 25% of the observations are, the quartile range is the interval around the median, which contains 50% of the observations (values ​​of the variable).

Asymmetry

Skewness is a characteristic of the shape of a distribution. The distribution is skewed to the left if the skewness is negative. The distribution is skewed to the right if the skewness is positive. The skewness of the standard normal distribution is 0. The skewness is related to the third moment and is defined as: skewness = n × M 3 /[(n-1) × (n-2) × s 3 ], where M 3 is: (x i -xmean x) 3 , s 3 - standard deviation raised to the third power, n - number of observations.

Excess

Kurtosis is a characteristic of the distribution shape, namely, a measure of the sharpness of its peak (relative to the normal distribution, the kurtosis of which is 0). As a general rule, distributions with a sharper peak than a normal distribution have positive kurtosis; distributions whose peak is less sharp than the peak of the normal distribution have negative kurtosis. Kurtosis is associated with the fourth moment and is determined by the formula:

kurtosis = /[(n-1) × (n-2) × (n-3) × s 4 ], where M j is: (x-x mean x , s 4 is the standard deviation to the fourth power, n is the number of observations .

The Kendall correlation coefficient is used when the variables are represented by two ordinal scales, provided that there are no associated ranks. The calculation of the Kendall coefficient is associated with counting the number of matches and inversions. Let's consider this procedure on the example of the previous problem.

The algorithm for solving the problem is as follows:

    We re-format the data of the table. 8.5 so that one of the rows (in this case the row x i) was ranked. In other words, we swap pairs x and y in right order and we enter data in columns 1 and 2 of the table. 8.6.

Table 8.6

x i

y i

2. Determine the "degree of ranking" of the 2nd row ( y i). This procedure is carried out in the following sequence:

a) take the first value of the unranked series "3". Counting the number of ranks below given number, which more compared value. There are 9 such values ​​(numbers 6, 7, 4, 9, 5, 11, 8, 12 and 10). We enter the number 9 in the "coincidence" column. Then we count the number of values ​​that less three. There are 2 such values ​​(ranks 1 and 2); enter the number 2 in the "inversion" column.

b) discard the number 3 (we have already worked with it) and repeat the procedure for the next value "6": the number of matches is 6 (ranks 7, 9, 11, 8, 12 and 10), the number of inversions is 4 (ranks 1, 2 , 4 and 5). We enter the number 6 in the “coincidence” column, and the number 4 in the “inversion” column.

c) in a similar way, the procedure is repeated until the end of the row; at the same time, it should be remembered that each “worked out” value is excluded from further consideration (only ranks that lie below this number are calculated).

Note

In order not to make mistakes in calculations, it should be borne in mind that with each "step" the sum of coincidences and inversions decreases by one; this is understandable, given that each time one value is excluded from consideration.

3. The sum of matches is calculated (R) and the sum of inversions (Q); data are entered into one and three interchangeable Kendall coefficient formulas (8.10). The corresponding calculations are carried out.

t (8.10)

In our case:

In table. XIV Applications are the critical values ​​of the coefficient for a given sample: τ cr. = 0.45; 0.59. The empirically obtained value is compared with the table value.

Conclusion

τ = 0.55 > τ cr. = 0.45. The correlation is statistically significant for level 1.

Note:

If necessary (for example, in the absence of a table of critical values), statistical significance t Kendall can be defined by a formula like this:

(8.11)

where S* = P - Q+ 1 if P< Q , and S* = P - Q - 1 if P > Q.

Values z for the corresponding significance level correspond to the Pearson measure and are found according to the corresponding tables (not included in the appendix. For standard significance levels z cr = 1.96 (for β 1 = 0.95) and 2.58 (for β 2 = 0.99). The Kendall correlation coefficient is statistically significant if z > z kr

In our case S* = P - Q– 1 = 35 and z= 2.40, i.e., the initial conclusion is confirmed: the correlation between the signs is statistically significant for the 1st level of significance.

Internet