Linear regression is a statistical method to find the relationship between one dependent and one or more independent variables. Regression analysis constitutes an important part of a statistical analysis to explore and model the relationship between variables.

The variable we are predicting is called the dependent variable and is denoted as Y, while the variables we are basing our predictions on are known as predictors or independent variables. These are referred as X. Regression analysis helps in predicting the value of a dependent variable based on the values of the independent variables. The regression line in a simple linear model is formed as Y = a + bX + error, where the slope of the line is b, while a is the intercept. Errors in the line are the residuals which are normally distributed.

**Pre-Analysis Checks:**

There are a few common assumptions which are to be followed before performing the regression analysis. If any of these assumptions are not met we would not be able to get the desired results or would assume that our data is not correct.

- The dependent variable should be a scalar variable. Examples of scalar variables include height, salaries and age.
- The independent variables should be scalar or categorical variables (sometimes referred as nominal variables). For example, gender is a categorical variable which has two categories- male and female.
- The data should not have any outliers. This can be checked by using various methods including histogram and boxplot techniques. Also, the residuals of the regression line should be normally distributed.
- The relationship between the dependent and independent variables should be linear. We can use scatterplot to verify the linearity in our data.
- The data needs to showcase homoscedasticity, which means the variance around the regression line is same for all the values of the independent variables.

Once we are done with cross checking all the above points, we are good to run the regression analysis. The coefficient of determination, denoted by R^{2 }is the key output which any statistician/analyst sees after running the regression analysis. It is the proportion of the variance in the dependent variable that is predicted from the independent variable. It ranges from 0 to 1, and the R^{2 }value close to the latter is assumed to fit the best regression model.

**The Importance of Intercept**

The intercept (often labeled as constant) is the point where the function crosses the y-axis. In some analysis, the regression model only becomes significant when we remove the intercept, and the regression line reduces to Y = bX + error.

There is a misconception among analysts that it can be removed in order to make the model significant, leading to higher R2 and F-ratio. However, a regression without a constant means that the regression line goes through the origin wherein the dependent variable and the independent variable is equal to zero. In the figure shown, the dashed line is the regular regression line without removing the intercept. The line in bold is the one which has its intercept removed. This means that by removing the intercept we are actually forcing the line to run through the origin.

Let us see this by running a multiple linear regression analysis in R. The purpose of using this is to determine by multiple regression analysis the relationship between API score (y) for the year 2000 with nine independent variables-english language learners (x1), pct free meals (x2), year round school (x3), pct 1st year in school (x4), avg class size k-3 (x5), avg class size 4-6 (x6), pct full credential (x7), pct emergency credential (x8) and number of students(x9).

The function lm fits a linear model with dependent variable on the left side separated by ~ from the independent variables.

> model1<- lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 +x7 + x8 +x9, data=api)

Please note that there are alternative functions available in R, such as glm() and rlm() for the same analysis. The summary of the model using lm is given as:

> summary(model1)

Call:

lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9,

data = API)

Residuals:

Min 1Q Median 3Q Max

-189.280 -41.179 0.655 41.047 160.349

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 703.40189 70.72442 9.946 < 2e-16 ***

x1 -1.65482 0.21953 -7.538 3.45e-13 ***

x2 -94.86795 6.93161 -13.686 < 2e-16 ***

x3 -22.31043 10.43706 -2.138 0.0332 *

x4 -2.42024 0.47928 -5.050 6.84e-07 ***

x5 3.29630 2.52143 1.307 0.1919

x6 2.00408 0.90486 2.215 0.0274 *

x7 0.79240 0.53396 1.484 0.1386

x8 -0.99575 0.67968 -1.465 0.1437

x9 0.04524 0.01845 2.452 0.0146 *

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 63.79 on 385 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.8171, Adjusted R-squared: 0.8128

F-statistic: 191.1 on 9 and 385 DF, p-value: < 2.2e-16

The estimate for the model intercept is 703.40189 with P-value significant for x1, x2, x3, x4 and x6 variables. Now if we remove the constant by adding -1 in the code.

model2<-lm(y ~ x1 + x2 + x3 + x4 + x5 + x6 +x7 + x8 +x9-1, data=api)

The output of the above function comes as:

> summary(model2)

Call:

lm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 –

1, data = API)

Residuals:

Min 1Q Median 3Q Max

-228.364 -46.954 4.289 50.233 164.909

Coefficients:

Estimate Std. Error t value Pr(>|t|)

x1 -1.82243 0.24508 -7.436 6.77e-13 ***

x2 -73.18996 7.36745 -9.934 < 2e-16 ***

x3 -37.76303 11.55589 -3.268 0.00118 **

x4 -3.29621 0.52750 -6.249 1.09e-09 ***

x5 15.74214 2.45095 6.423 3.94e-10 ***

x6 4.66512 0.96784 4.820 2.07e-06 ***

x7 4.40405 0.43831 10.048 < 2e-16 ***

x8 3.12829 0.60299 5.188 3.44e-07 ***

x9 0.06161 0.02057 2.995 0.00292 **

—

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 71.42 on 386 degrees of freedom (5 observations deleted due to missingness) Multiple R-squared: 0.9874, Adjusted R-squared: 0.9871

F-statistic: 3353 on 9 and 386 DF, p-value: < 2.2e-16

As you can see, by removing the intercept almost all the variables become significant with p-values less than 0.05, and most importantly the R2 value increases considerably. An R2 of 0.81 means that 81 percent of the variance in Y is predictable from the independent variables; and an R2 of 0.98 means that 98 percent is predictable; and so on.

Many times, the intercept makes no sense. For example, suppose we use the rain to predict the quantity of wheat produced. Practically, if there is no rain, there would be no production. So in this situation, the regression line crosses the y-axis somewhere else beside zero, and the intercept doesn’t make any sense. However, the intercept is important to calculate the predicted values especially in the industry like analytics and market research and it is advised not to cross it out completely from the analysis.

**Conclusion**

Getting a high R2 value forces us to exclude intercept, and add more variables in an attempt to explain the unexplainable. This also misleads the result, reduces the significance of our analysis and sabotages the entire predictability. However, there are times when we need to perform a regression analysis without the intercept i.e when the model requires a process which has a zero-intercept. Regression analysis is a powerful statistical technique to make predictions, but we need to use it wisely without manipulating the results and get the most out of our data.

## No Comments so far

Jump into a conversation## No Comments Yet!

You can be the one to start a conversation.