Regression is a branch of statistics that has a major applicability in predictive analytics. Regression analysis is used to measure the relationship between a dependent variable with one or more predictor variables. The goal of regression analysis is to predict the value of the dependent variable given the values of the predictor variables. Regression finds a mathematical model that best fits the data given, in such a way that there exist minimum to no outliers.
Regression is an integral part of predictive modelling and is one of the supervised machine learning techniques. In simple language, regression refers to a line or a curve that passes through all the datapoints on the X-Y plot in such a way that the vertical distance between the line and the datapoints is minimum. The distance between the line and the point suggests whether the model has captured a strong relationship, know as correlation. Thus, a ‘best-fit’ model is the one that has effectively captured a strong relationship and the uniform variance is minimum and for that regression analysis is a standard approach.
Regression analysis is mainly used for:
a. Causal analysis
b. Forecasting the impact of change
c. Forecasting trends
All these applications make it useful for market research, sales and stocks prediction, to name a few. Depending on the number of independent variables and the relationship between the dependent and independent variables, there are different types of regression techniques. Some of the most widely used techniques are explained here.
1. Simple Linear regression
This is the most fundamental regression model which needs to be understood to know the basic of regression analysis. When we have one predictor variable x for one dependent or response variable y that are linearly related to each other, the model is called simple linear regression model. In case of more than one predictors present, the model is called multiple linear regression model. The relation is defined using the equation- y=ax+b+e
a= slope of the line
e= error term
The line that best fits the model is determined by the values of parameters a and b. The x-coefficient and intercept are estimated by least squares i.e. giving them values that minimise the sum of squared errors within the sample of data.
The difference between the observed outcome Y and the predicted outcome y is known as a prediction error. Hence, the values of a and b should be such that they minimise the sum of the squares of the prediction error.
Maximum likelihood estimation is also a technique to predict the values of regression line parameters under the assumption that the prediction error has a normal distribution.
Simple linear model does not perform well on large amounts of data as it is sensitive to outliers, multicollinearity and cross-correlation. For multiple regression, the best-fit line assumptions remain similar; however, the error prediction depended on a fixed value of predictor now will depend on a fixed set of values.
2. Logistic regression
This is a special case of generalized linear regression that has applications where the response variable is categorical or discrete in nature – winner or loser, pass or fail, 0 or 1, etc. The relationship between the dependent and independent variable(s) are measured by estimating probabilities using the logit function.
The error may not be Gaussian white noise (normal distribution) but will have a logistic distribution. The logit function predicts the probabilities of the outcomes and thus the values are restricted through (0,1) giving an S-curve (sigmoidal curve). The regression coefficients are estimated using iteratively reweighted least squares (IRLS) method or maximum likelihood estimation rather than ordinary least square method and works better with large sample sizes.
After transforming the response variable using the logit function, the model can be approximated by linear regression. Logistic regression will not always have response variables with binary outcomes. In case of three or more categories, it is called nominal or multinomial logistic regression and if the categories have ordered levels with unequal intervals, it is called ordinal logistic regression.
3. Ridge regression
It is a more robust version of linear regression which is less subject to overfitting. The model puts some constraints or penalization on the sum of squares of regression coefficients. The least square method of estimating parameters gives unbiased values of these parameters with least variance (to be very precise). However, when the predictor variables are highly correlated (when predictors A and B change in a similar manner) small amount of bias factor is included to alleviate the problem.
A bias matrix is added to the equation of least squares and then the minimization of the sum of squares is performed for low variance parameters. Hence, a penalization on large parameters is performed. This bias matrix is essentially a scalar multiplied identity matrix whose optimum value needs to be selected.
4. LASSO regression
LASSO (Least Absolute Shrinkage Selector Operator) is another alternative to Ridge regression but the only difference is that it penalizes the absolute size of the regression coefficients. By penalizing the absolute values, the estimated coefficients shrink more towards zero which could not be possible using ridge regression. This method makes it useful for feature selection where a set or variables and parameters are picked for model construction. LASSO takes the relevant features and zeroes the irrelevant values such that overfitting is avoided and also makes the learning faster. Hence, LASSO is both a feature selection model and a regularization model.
ElasticNet is a hybrid to both LASSO and Ridge regression which combines the linear L1 and L2 penalties of the two and is preferred over the two methods for many applications.
5. Polynomial Regression
Polynomial regression is similar to multiple linear regression. However, in this type of regression the relationship between X and Y variables is defined by taking the k-th degree polynomial in X. Polynomial regression fits a non-linear model to the data but as an estimator, it is a linear model. Polynomial models are also fitted using the least squares method but are slightly difficult to interpret the values as the individual monomials can be highly correlated. The estimated value of the dependent variable Y is modelled with the equation (for the k-th order polynomial):
The line that passes through the points will not be a straight line but a curved one depending on the power of X. High-degree polynomials are observed to induce more oscillations in the observed curve and have poor interpolator properties. In modern approaches, polynomial regression is not directly performed on data but is used as a kernel in Support Vector Machines algorithms.
6. Bayesian linear regression
Bayesian regression uses the Bayes theorem of posterior probability to determine the regression coefficients. In techniques like maximum likelihood and least squares, we try to find the optimal value for the model whereas through this method a posterior distribution of parameters is obtained. Bayes theorem is applied on the prior assumption of parameters i.e.- posterior parameters ∝ likelihood * prior estimate
This method too is penalized likelihood estimator just like in ridge regression and more stable compared to the original linear model.
Apart from the above-mentioned, there are techniques like Quantile Regression that gives an alternative to least squares method, Stepwise Regression, JackKnife Regression which uses the resampling technique, ElasticNet Regression, and Ecological Regression among a few others that were not explained in this article. Many times, dimension reduction or box-cox transformation is performed prior to a regression method. Dimensionality of the data, nature of dependent variables (discrete or continuous) are some of the techniques to determine which of the regression models are suitable to be used.
A good knowledge of the above will surely help you get started with understanding predictive analytics and data modelling. Statistical Softwares like Minitab, Matlab, STATA or R can be very helpful for practically understanding these techniques.