What are Imbalanced Data and How to Address them?

Understand classification problems and imbalanced data and techniques for handling them

Classification problems are quite ordinary in the machine learning world. After contributing a little time to machine learning and data science, one will encounter imbalanced data class distribution. This is an outline where the number of observations belonging to one class is significantly nether than those belonging to the other classes.

Imbalanced data indicates the particular types of datasets where the target class has an unequal representation of classes, i.e., one class label have a very enormous number of observations and the other has a very depressing number of observations. It can be better interpreted with an example. Assume there is a bank named XYZ who perform to issues a credit card to its customers. Now the bank is anxious that some fraudulent transactions are rolling on and after analyzing the data they got to know that for every 2000 transactions there are only 30 Nos of fraud recorded. So, the number of frauds per 100 transactions calculated is lesser than 2% or can be assumed that more than 98% of the transaction is "No Fraud" in nature. Here, the class "No Fraud" is labelled as the majority class, and the much smaller size "Fraud" class is termed the minority class.

Imbalanced datasets are the primary and dominant problem of this real world where oddity detection may be detected crucial like electricity pilferage, fraudulent transactions in banks, identification of rare diseases, Customer churn prediction, Natural disasters, etc. In this scenario, the anticipating model developed using conventional machine learning algorithms could contain a chance of inaccuracy. This is because machine learning algorithms are generally conceived to uplift accuracy by shortening the error. So, this is not taken into account the class distribution/proportion or balance of classes. Dealing with classification problems class imbalanced is quite normal. But if considering some particular cases this imbalance is quite acute where the majority class's presence is much higher than the minority class.

The Complication with Imbalanced Data Classification

The central complication with imbalanced dataset in data science prediction is how precisely are we predicting both majority and minority classes. Let's take an example of disease diagnosis. Assume someone is going to predict disease from an extant dataset where for every 100 records only 5 patients are diagnosed with the disease. So, the bulk class is 95% with no disease and the minus class is only 5% with the disease. Then, assume the model predicts that all 100 out of 100 patients have no disease. So, the traditional approach of classification and model accuracy calculation is not found useful in the case of the imbalanced datasets.

Approach to Deal with the Imbalanced Dataset Problem

In limited cases like fraud detection or disease prediction, it is crucial to recognize the minority classes accurately. So, model should not be biased to encounter only the majority class but the minority class should be given equal weight or importance. Here are some of the few suggested techniques that have been discussed which can dispense with this problem. There is no vicious method or inexact method in this, different techniques work well with different problems.

Re-Sampling Technique: In this technique, we concentrated on balancing the classes in the training data science (data pre-processing) before supplying the data as input to the machine learning algorithm. The foremost objective of balancing classes is to either enlarge the frequency of the minority class or reduce the frequency of the majority class. This is performed to obtain approximately the twin number of instances for both classes. Below are a few resampling techniques:

Random Under-Sampling – Random Under sampling tries to balance class distribution by randomly removing majority class examples. This is done until the majority and minority class instances are balanced out.

Random Over-Sampling– Over-Sampling escalates the number of instances in the minority class by randomly replicating them to give a higher representation of the minority class in the sample.

Cluster-Based Over Sampling– In this sample, the K-means clustering algorithm is independently applied to minority and majority class instances. That is to identify clusters in the dataset.

Algorithmic Ensemble Techniques

The above section deals with handling imbalanced data by resampling original data to provide balanced classes. Its function is to ensemble methodology is to improve the performance of single classifiers. The approach contains constructing several two-stage classifiers from the primary data and then aggregating their predictions. Below are a few examples of this technique.

Bagging Based techniques for imbalanced data– Bagging is an abbreviation of Bootstrap Aggregating. This algorithm involves generating 'n' different bootstrap training samples with replacement. And training the algorithm on each bootstrapped algorithm separately and then aggregating the predictions at the end. Bagging is applied for reducing Overfitting to create strong learners for generating accurate predictions.

Boosting-Based techniques for imbalanced data- Boosting is an ensemble technique that deals with combining weak learners to create a strong learner that can make precise predictions. Boosting begins with a base classifier / weak classifier prepared on the training data. Unlike bagging, boosting allows replacement in the bootstrapped sample.

SMOTE

Synthetic Minority Oversampling Technique (SMOTE) is one more technique to oversample the minority class. Adding duplicate records of minority class often don't add any new information to the model. In SMOTE new instances are synthesized from the existing data. If explained in clear words, SMOTE looks into minority class instances and uses k nearest neighbor to select a random nearest neighbor, and a synthetic instance is created randomly in the feature space.