Mathematical modeling multiple regression model example. Multiple regression

1) Multivariate regression equation:

Economic interpretation of the resulting model:

Apartments in area A are 15.5% cheaper than in area B. With an increase in the total area by 1, the cost of an apartment increases by $1.25 thousand. With an increase in living space by 1, the cost of an apartment increases by $0.2 thousand. With an increase in the area of ​​the kitchen by 1, the cost of the apartment increases by $0.8 thousand. With an increase in the floor to the middle and outermost, it increases by $0.05 thousand. With an increase in the house for brick and panel, it increases by $ 24.8 thousand. With an increase in the term of delivery by 1 month. The cost of the apartment is reduced by $0.4 thousand.

Minimum sample size:

those. to obtain a statistically significant model, it is necessary to select 45 apartments and collect the necessary data on them.

  • 2) The model used 1 dummy variable, the name of the district, because 2 districts participate in the construction of the model - "a" and "b", which are assigned the quantitative values ​​"1" and "2", respectively.
  • 3) Check the factors for multicollinearity:

This condition is satisfied for the following pairs of factors and, and and, : .

Consider the first pair of multicollinear factors. To exclude variables, it is necessary to know how each of the factor features is associated with the resulting feature Y. This dependence is reflected in the last row of the pair correlation matrix. So, .

Multicollinear factors are found.

To eliminate multicollinearity, the variable elimination method is used.

We will exclude the factors that have the smallest value.

Consider the first pair of multicollinear factors.

So, . Comparison: 0.899 >

Second pair; , . 0.885 > 0.690. Therefore, to include in the model of the set, because its connection with the effective feature is greater than that of the. Similarly, the following pairs are considered.

The subject of regression analysis is the study of the dependence of a random variable on a set of random and non-random variables. Regression analysis allows, on the basis of sample observations, to create a mathematical model of the dependence of the resulting feature on factor features.

Depending on the number of factor features, the regression model can be paired and multidimensional. Let us write in a general form the dependence of the effective feature on the joint and simultaneous influence of factor features
(- number of factor signs)

(3.28)

where
- regression function, which expresses the objective regular dependence of the resultant feature on the joint influence of factor features; - a random variable expressing the influence of uncontrolled and unaccounted for factors, as well as measurement errors.

From expression (3.28) we have

(3.29)

those. - deviation of the resulting feature from the average value calculated from the regression function.

The estimate of the regression function is the regression equation

For paired linear regression, expression (3.28) has the form:

(3.31)

where
- parameters of the regression function. Let's write the regression equation for this case

(3.32)

where
- estimates of the parameters of the regression function - parameters of the regression equation or simply regression parameters.

The technique for obtaining paired linear regression equations is given in paragraphs 3.7 and 3.10.

Pairwise Nonlinear Regression Analysis

Let the form of the correlation field of points assume a non-linear dependence of the resulting feature on the factor feature. Let's write in general view pairwise non-linear regression equation

(3.33)

It is required to determine the regression parameters using the least squares method, the mathematical notation of which is:

and add-ons "Search for a solution".

Placement of information at the ET workplace when determining the regression parameters of example 3.5 using the add-in “Search for a solution” is presented in table 3.15.

Table 3.15. Information placement

Objective function value

F2:=SUMMQDIFF(e4:e18; d4:d18);

E4: = SUMPRODUCT(a4:c4,$a$2:$c$2);

H2: =CORREL(d4:d18;e4:e18);I2: =AVERAGE(d4:d18).

The solution results are presented in Table 3.16.

Table 3.16. Calculation results

Analysis of calculation results. The result of the calculation is:

      pairwise non-linear regression equation


Figure 3.7 shows the pairwise non-linear regression equation obtained by constructing a trend line. An analysis of the equations confirms their identity. Comparison of the calculation results in paired linear and non-linear regression analysis shows that they differ insignificantly, i.e. for the features under consideration, a linear regression model can be adopted.

Rice. 3.7. Trendline Equation

Multivariate Linear Regression Analysis

The generalized mathematical model of the multidimensional linear regression function (3.28) has the form

where - the number of factor signs;
- effective sign; – deviation;
are the parameters of the regression function.

Multivariate linear regression equation for this case

The requirement for factor features included in the mathematical model: the factors must be independent of each other. Violation of this condition is called multicollinearity.

The coefficients of the regression equation are obtained using the "Regression" tool of the analysis package.

Analysis of the quality of the resulting model carried out similarly to the analysis of pairwise linear regression.

The multiple linear regression methods we discuss can be very useful, but also very dangerous if they are misused or misinterpreted. Before embarking on a large task using multiple regression methods, it makes sense, as far as possible, to pre-plan all the work in relation to a specific goal and outline the control activities carried out along the way. Such planning will be the subject of this chapter. First, however, we will discuss the three main types of mathematical models commonly used in science:

1. Functional model.

2. Model for management.

3. Model for prediction.

FUNCTIONAL MODEL

If the “true” functional relationship between the response and the predictors is known in some task, then the experimenter is able to understand and predict the response, and even control it 1. However, situations are rare in life when such a model can be proposed. But even in these cases, the functional equations are usually very complex, difficult to understand and apply, and most often have a non-linear form. In the most complex cases, numerical integration of such equations may be required. Examples of nonlinear models were mentioned in Chap. 5, and their construction will be discussed in Chap. 10. For such models, linear regression methods are not applicable or are applicable only for approximating true models in iterative estimation procedures.

Model for management

The functional model, even if it is fully known, is not always suitable for controlling the output variable (response). For example, in a problem about steam used in a factory, one of the most important variables is the outside temperature, which is

nothing better, you can choose a line of behavior for further experimentation, specifying important variables, and, which is very useful, weeding out irrelevant variables.

However, the use of multiple regression requires special care to avoid misunderstandings and incorrect conclusions. The organization of a scheme for solving problems using the methods of multiple regression analysis is not only useful, but also necessary.

Rice. 8.1. Flowchart of the model building procedure

This chapter is only a blueprint, and any use of the proposed or similar scheme will require special "tuning" to the specific situation.

Although the plan below is intended to develop a predictive mathematical model, it is quite general; it can be used in the construction of both functional and control models. We will pay special attention to tasks with “unmanaged data”. The scheme is divided into three stages - planning, development and use. The block diagram is shown in fig. 8.1 and will be discussed in detail later on.

Let some stochastic object be given, the input and output coordinates of which X and Y are random variables.

Y is affected not only by the input coordinate X, but also by random interference Z (instability of the object's operation mode, stochastic environmental influences, errors in Y changes, etc.). Therefore, one cannot speak of a functional dependence of Y on X. In such cases, one should speak of the presence of a stochastic relationship between the variables X and Y of static objects.

Random variables X and Y are dependent if the law of the probability distribution of one of them depends on the value of the other.

- conditionally integral law of probability distribution;

- conditional probability distribution density;

Suppose we can establish that , then the behavior of the complex quantity Y will be fully characterized by the conditional probability density distribution .

Let's designate conditional numerical characteristics Y:

- expected value;

Dispersion;

It does not depend on x, but the parameters of the density function and depends on what value x the value of X takes. The dependence of x is called regression.

- regression dependence, shows how the average value of Y changes when X changes. If we connect the points with smooth lines, we get a regression line. This line is a static characteristic of the object.

The regression equation is the function f(x), which describes the regression line. Regression equations are classified into linear and non-linear. When building a regression model of an object, it is widely used passive identification method.

This method is used in the study of object statics, noise equations, and also in cases where the values ​​of the initial disturbances at the input of the object are unacceptable. The passive identification method is based on obtaining static information about the object according to its normal operation. Then the realization of the input x and output y values ​​are processed in such a way as to determine the regression model.



, where is the vector of model coefficients.

The definition of the regression equation consists of 2 steps:

1. choice of regression equation type- is carried out either by empirical choice of the type of regression equation according to the type of correlation field between input and output values, or by theoretical study of the regularity of the physical process, which is reflected in the stochastic relationship between these values. Sometimes both approaches are used in combination with each other.

2. calculation of coefficients of the regression equation- most often done least squares.

It should be noted that the passive static method has a number of significant disadvantages compared to the active method:

1. Obtaining a model of an object is valid only within the limits of the experimental static material used.

2. it is difficult to separate the effects from the correlation of part of the input values ​​of a multidimensional object.

3. individual regression coefficients do not have any physical meaning.

4.Experience error information is not retrieved.

5. It is required to obtain a large amount of experimental data and perform labor-intensive calculations.

These shortcomings greatly reduce the value of the model obtained by the passive method. This method is used only when other methods cannot be used.

Preliminary analysis of experimental static material is the main task of correlation analysis in the identification of a stochastic object. At the same time, the essence correlation analysis is reduced to estimating the strength of the stochastic relationship between random variables X and Y and establishing the type of relationship between them in the form of a regression equation. To preliminarily determine the presence of a characteristic relationship between X and Y, extreme points and are applied. A correlation field is built on the graph.


a-strongly negative correlation

b-strongly positive correlation

c-weakly positive correlation

d, e-lack of correlation

According to the tightness of the grouping of points around the straight line, it is possible to judge the correlation.

The correlation field characterizes the type of relationship between X and Y, i.e. the presence of a linear and non-linear dependence:

There are 3 types of correlation:

1) linear;

2) non-linear;

3) multiple;

At linear correlation linear regression is approximated by the equation of a straight line, with non-linear– curve equation. Multiple Correlation determines the relationship between many quantities and uses a multiple regression equation. The most common is linear correlation. The concept of correlation makes it possible to judge how close the experimental points are on the upscaled regression curve.

If the regression determines the expected relationship between variables, then the correlation shows how well this relationship reflects reality.

The task of a stochastic object is posed in the following way: according to the sample size n, estimate the strength (tightness) of the correlation between X and Y, find the regression equation, and estimate the allowable error.

A computer