Introduction to Logistic Regression

by September 1, 2017 0 comments

Logistic regression is used to predict a discrete outcome based on variables which may be discrete, continuous or mixed. Thus, when the dependent variable has two or more discrete outcomes, logistic regression is a commonly used technique. The outcome could be in the form of Yes / No, 1 / 0, True / False, High/Low, given a set of independent variables.

Let’s first understand how logistic regression is used in business world. Logistic regression has an array of applications. Here are a few applications used in real-world situations.

Marketing: A marketing consultant wants to predict if the subsidiary of his company will make profit, loss or just break even depending on the characteristic of the subsidiary operations.

Human Resources: The HR manager of a company wants to predict the absenteeism pattern of his employees based on their individual characteristic.

Finance: A bank wants to predict if his customers would default based on the previous transactions and history.
Types of logistic regression

If the response variable is dichotomous (two categories), then it is called binary logistic regression. If you have more than two categories within the response variable, then there are two possible logistic regression models.

  1. If the response variable is nominal, you fit a nominal logistic regression model.
  2. If the response variable is ordinal, you fit an ordinal regression model.

Logistic regression model

The plot shows a model of the relationship between a continuous predictor and the probability of an event or outcome. The linear model clearly does not fit if this is the true relationship between X and the probability. In order to model this relationship directly, you must use a nonlinear function. The plot displays one such function. The S-shape of the function is known as sigmoid.

Logit transformation
A logistic regression model applies a logit transformation to the probabilities. The logit is the natural log of the odds.

 

 

 

 

P is the probability of the event

In is the natural log (to the base e)

Logit is also denoted as Ln

 

So, the final logistic regression model formula is

 

 

Unlike linear regression, the logit is not normally distributed and the variance is not constant. Therefore, logistic regression requires a more computationally complex estimation method named as Method of Maximum Likelihood (ML) to estimate the parameters. ML obtains the model coefficients that relate predictors to the target. After this initial function is estimated, the process is repeated until LL (Log Likelihood) does not change significantly.

Using R

R makes it very easy to fit a logistic regression model. The function to be called is glm() and the fitting process is similar the one used in linear regression. In this post, I would discuss binary logistic regression with an example though the procedure for multinomial logistic regression is pretty much the same.

The data which has been used is Bankloan. The dataset has 850 rows and 9 columns. (age, education, employment, address, income, debtinc, creddebt, othdebt, default). The dependent variable is default (Defaulted and Not Defaulted).

Let’s first load and check the head of data.

bankloan<-read.csv(“bankloan.csv”)

head(bankloan)

Now, making the subset of the data with 700 rows.

mod_bankloan <- bankloan[1:700,]

Setting a seed of 1000 (meaning picking random numbers from 1000 as starting point)

set.seed(500)

Let’s have a sample of 500 values. So, creating a variable of training data of 700 rows.

>train<-sample(1:700, 500, replace=FALSE)

Creating training as well as testing data.

>trainingdata<- mod_bankloan [train,]
>testingdata<- mod_bankloan [-train,]

Now, let’s fit the model. Be sure to specify the parameter family=binomial in the glm() function.

model1<-glm(default~.,family=binomial(link=’logit’),data=trainingdata)

 >summary(model1)

The summary will also include the significance level of all the variables. If the P value is less than 0.05 then the variables are significant. We can also remove the insignificant variables to make our accurate.

In our model, only age, employment, address and creddebt seems to be significant. So, building another model with only these variables.

model12<-glm(default~age+employ+address+creddebt,family=binomial(link=’logit’),data=trainingdata)

Let’s now predict the model with the training data.

pred1<-predict(model12,newdata=trainingdata, type=”response”)

Now looking at the probability with 0.5% flight delayed or ontime.

predicted_class<-ifelse(pred1<0.5, “Defaluted”, “Not Defaulted”)

Creating a table to see the same.

table(trainingdata$default, predicted_class)

This is also known as confusion matrix. It is a tabular representation of Actual vs Predicted values. This helps us to find the accuracy or error of the model and avoid overfitting.

There are 64 customers who actually defaulted and our model also predicted the same. However, 72 customers defaulted but model predicted them as Not Defaulted. Also, 36 customers actually Not Defaulted where the model mentioned them as defaulted. Let’s now find out the error rate.

err_rate<-1-sum((trainingdata$default ==predicted_class))/500
> err_rate
0.344
Which is 34%.

Going ahead, lets test the model on testing data.

pred2<-predict(model12, newdata=testingdata,type=”response”)
predicted_class2<-ifelse(pred2<0.5, “Defaluted”, “Not Defaulted”)
table(testingdata$default, predicted_class2)
err_rate<-1-sum((testingdata$default ==predicted_class2))/200
err_rate
0.31

Here the error rate is 31%.

Now, we can plot this in Receiver Operating Characteristics Curve (commonly known as ROC curve). In R, it can be done by downloading a package called ROCR. An output of the plot is given below.

ROC traces the percentage of true positives accurately predicted by a given logit model as the prediction probability cutoff is lowered from 1 to 0. For a perfect model, as the cutoff is lowered, it should mark more of actual 1’s as positives and lesser of actual 0’s as 1’s. The area under curve, known as index of accuracy is a performance metric for the curve. Higher the area under curve, better the prediction power of the model.

Conclusion

Logistic regression is a widely used supervised machine learning technique. It is one of the best tools used by statisticians, researchers and data scientists in predictive analytics. The assumptions for logistic regression are mostly similar to that of multiple regression except that the dependent variable should be discrete. Most of the data science students struggled to learn this technique, which is why I am pleased to present you a basic introduction to help you grasp the topic. As I always say “the sky is the limit”, and the internet is your best friend 😊, go ahead… and start your learning journey. All the best.

No Comments so far

Jump into a conversation

No Comments Yet!

You can be the one to start a conversation.

Your data will be safe!Your e-mail address will not be published. Also other data will not be shared with third person.