blog posts

Linear Regression

What Is Regression And Linear Regression?

One Of The Most Important Topics We Hear A Lot About, Especially When Reading AI Articles, Is Regression, And Especially Linear Regression. In The Lexical Definition Of The Word Regression, It Means Regression, Return, And Return. 

Linear Regression, from a statistical point of view, this term is a return to an average or average value. To be more precise, some phenomena tend to quantify towards a moderate value over time.

What is regression?

In statistical models, regression analysis or relationship analysis is a statistical process for estimating the relationships between variables. This method includes many techniques for modeling and analyzing specific and unique variables, focusing on the relationship between the dependent variable and one or more independent variables.

Regression analysis helps to understand how the value of the dependent variable changes with the change of each of the independent variables and with the constantness of the other independent variables.

The most common application of conditional mathematical regression estimation is the dependent variable of certain independent variables, which is equivalent to the mean value of the dependent variable when the independent variables are constant.

Its least use is to focus on the quantum or spatial parameter of the conditional distribution of a dependent variable of a given independent variable. In all cases, the goal is to estimate a function of independent variables called the regression function.

In regression analysis, the distribution of the dependent variable around the regression function is considered, which can be explained by a probability distribution. The regression analysis has been widely used for prediction.

The regression analysis has also been used to identify the relationship between the independent and dependent variables and the shape of these relationships.

In certain circumstances, this analysis can be used to infer excellent relationships between independent and dependent variables. However, this can lead to wrong or invalid relationships, so caution is recommended.

Many techniques have been developed to perform regression analysis. Familiar methods such as linear regression and least-squares are parametric, in which the regression function is estimated under a limited number of unknown parameters from the data. Non-parametric regression refers to methods that allow regression functions to fit into a specific set of functions with the possibility of unlimited parameters.

Regression analysis or statistical analysis is a statistical technique for examining and modeling the relationship between variables. Regression is required for estimation and prediction in almost every field, including engineering, physics, economics, management, life sciences, biology, and social sciences.

What is linear regression?

Linear regression is one of the methods of regression analysis. Regression is a type of statistical model for predicting a variable from one or more other variables.

Linear regression is a linear prediction function in which the dependent variable, the variable to be predicted, is predicted as a linear combination of independent variables, meaning that each of the independent variables is multiplied by the coefficient obtained for that variable in the estimation process. ; The final answer will be the sum of the product of the multiplications plus a fixed value, which is also obtained in the estimation process.

The simplest type of linear regression is simple linear regression, which, unlike multiple linear regression, has only one independent variable.

Another type of linear regression is multivariate linear regression, in which several dependent variables are predicted instead of predicting one dependent variable.

The estimation process tries to select the coefficients of the linear regression model in a way that is consistent with the available data, i.e., the predictions are close to the values ​​seen in the data, and one of the most important issues in linear regression is to minimize the difference between the two. There are several ways to solve this problem.

In probabilistic methods, linear regression models try to estimate the conditional probability distribution of the dependent variable (rather than the combined probability distribution). They use a statistic of the dependent variable as the final prediction.

One of the most commonly used statistics is the mean, although other statistics such as medians or multiples are also used.

Another common method of estimation is the least-squares method, in which the sum of squares minimizes the difference between predictions and related data. This method requires finding the inverse of the external multiplication of the matrix of all independent data with its transcript matrix, a process that can be costly and inefficient due to the inversion of the final matrix and the lack of data. Hence, alternative methods such as random reduction gradient are commonly used.

Usually, several presuppositions are considered for the use of linear regression. If we call the difference between the dependent variable and the model prediction “error” or “residual,” then the following assumptions must make in linear regression modeling:

The residues follow a natural distribution. This assumption means that the conditional distribution of dependent variables is natural. This assumption is necessary for the least-squares, but this assumption can be violated in multiple or middle regression.

The rest are independent of each other.

This assumption considers the remaining variables (and consequently the dependent variables) to be independent of each other. Some methods, such as least generalized squares, can work with correlated residues, although more data is usually required to do this unless a model adjustment is used. Bayesian linear regression is a general method to solve this problem.

The variance of the residues is constant. This assumption assumes that the residual values ​​(and consequently the dependent variables) have a constant variance. In practice, this assumption is usually invalid, and the residues are heterogeneous. This assumption can be violated in multiple regression.

There is no alignment between the independent variables. This assumption implies that the matrix of independent variables is full rank. If this condition is not met, some independent variables will be a linear combination of one or more other linear variables.

A small amount of data can violate this assumption, especially when the number of data is less than the number of linear regression model parameters (number of linear regression coefficients).

The relationship between the mean of the dependent variable and the independent variables is linear.

This assumption means that the mean of the dependent variable is a linear combination of parameters (regression coefficients) and independent variables. This default does not place much constraint because linearity is just a constraint on the parameters.

In generalized linear regression, several new variables can create from a combination of independent variables or simple polynomial regression. The dependent variable can be considered a polynomial combination of independent variables. It is usually necessary to adjust the model to avoid over-fitting and the complexity of generalized linear regression models.

Linear regression is widely used in life sciences, behavioral, social, finance, economics, and the environment. Linear regression and its derivatives are also one of the well-known and widely used tools in machine learning.

Despite the widespread use of linear regression in various sciences, this method has some limitations.

Many research problems in the social sciences do not fit into regression models and do not have an output variable (such as cluster analysis to reveal coherent groups in data).

Also, linear regression is not a suitable tool to find the causality between independent and dependent variables.