Unstructured data, like text, images and videos contain an information goldmine. However, due to the complexity to analyse and process this data, organisations often refrain from spending extra time and efforts in these unstructured sources of data. Understanding the language of unstructured data has always been difficult; however, plenty of work is being done to integrate language into the field of artificial intelligence in the form of Natural Language Processing (NLP). The intersection of computer science, artificial intelligence, and linguistics, NLP envisages a goal for computers to process or understand the human unstructured language in order to perform tasks like Language Translation and Question Answering.
With the rise of chatbots and voice interfaces NLP, a critical component of AI is one of the most important technologies of the information age. Fully understanding and representing the meaning of the human language is an extremely difficult goal, because human language is quite special, however, NLP is taking giant steps ahead to achieve the difficult goal.
Decoding the Human Language
To understand how NLP decodes the Human Language, let’s consider the text snippet below from a customer review of a fictional financial services company selling auto insurance called Dash Auto Insurance:
“The customer service of Dash Insurance is terrible. I have to call the call center multiple times before I get a decent reply. The call center officers are extremely totally ignorant and extremely rude. Last month I called them with a request to update my correspondence address from Houston to Dallas. I spoke with a change in address to about a dozen representatives –Antonio Parker, Emma Jones, Renee Stevenson to name a few. Even after drafting multiple emails and filling out numerous forms, the address is not been updated. Even my agent Nicole is useless. The policy details she gave me were wrong and the only good thing about the company is the pricing. The premium is reasonable as compared to the other insurance companies which are their competitors. Dash Auto Insurance has not increased my premium significantly since 2015.”
Let’s analyse the 5 common techniques that are used for extracting information from the above text:
1. Named Entity Recognition
Extracting the entities in the text is the most basic feature of NLP. Name Entity Recognition highlights the fundamental concepts and references in a given text document. Named entity recognition (NER) identifies entities like organizations, dates, people, locations etc. from a given text. The NER output for the sample text customer review as above will typically be:
• Person: Antonio Parker, Emma Jones, Renee Stevenson, Nicole
• Location: Houston, Dallas
• Date: Last month, 2015
• Organization: Dash
Named Entity Recognition (NER) is based on supervised models and grammar rules. There are NER platforms such as open NLP that have pre-trained and built-in NER models as well.
2. Sentiment Analysis
Sentiment Analysis is the most widely used technique in NLP deployed in cases such as Customer review analysis, studying social media comments and customer surveys where customers express their opinions and feedback. The simplest output of sentiment analysis is a 3 section scale: positive/negative/neutral. In more complex cases the output may be a numeric score which can be bucketed into multiple categories as per requirements.
In the case of the sample text snippet as above, the customer clearly expresses different sentiments in various parts of the text, and thus the output is not very useful. Instead, the sentiment behind each sentence can be found out and separated with the negative and positive parts of the review. Sentiment score can also assist to pick out the most negative and positive parts of the review as under:
• Most negative comment: The call center officers are extremely totally ignorant and extremely rude.
• Sentiment Score: -1.3058402
• Most positive comment: The premium is reasonable as compared to the other insurance companies which are their competitors.
• Sentiment Score: 0.2542809
Sentiment Analysis can be undertaken with supervised and unsupervised techniques. The most popular and widely deployed supervised model used for sentiment analysis is naïve Bayes. Naive Bayes algorithm requires a training corpus with sentiment labels; the model is trained over these labels which are then used to identify the sentiment. Additionally, different machine learning techniques like random forest or gradient boosting can also be used. The unsupervised techniques also known as the lexicon-based methods and require a dictionary of corpus of words with their associated sentiment and polarity for analysis. The sentiment score of the sentence is calculated suing the polarities of the words in a given sentence.
3. Text Summarization
As the name suggests, Text Summarization is the techniques in NLP helping to summarize large chunks of text. Text summarization technique is mainly used in cases like research papers and news articles. Extraction and Abstraction are the two broad approaches to text summarization. Extraction methods create a summary by extracting parts from the text whereas Abstraction methods create a summary from a fresh text that conveys the synopsis of the main text. There are various algorithms that can be deployed for text summarization like TextRank, Latent Semantic Analysis and LexRank. To take the example of LexRank, this algorithm ranks the sentences using similarity between them; a customer review sentence is ranked higher when it is similar to more sentences, which are in turn similar to other sentences.
Using LexRank, the sample review text is summarized as: I have to call the call center multiple times before I get a decent reply. The premium is reasonable as compared to the other insurance companies which are their competitors.
4. Aspect Mining
Aspect mining identifies the different aspects in a given text and when used in conjunction with sentiment analysis, Aspect mining extracts complete information from the text. One of the easiest methods of aspect mining is using part-of-speech tagging, when aspect mining and sentiment analysis are used on the sample text, the output conveys the complete intent of the text as analysed under:
Aspects & Sentiments:
• Customer service – Negative
• Call center – Negative
• Agent – Negative
• Pricing/Premium – Positive
5. Topic Modeling
Topic modeling is one of the more complicated methods in NLP algorithm to identify natural topics in a given text. The main advantage of topic modeling is that it is an unsupervised technique, where a labelled training and Model training dataset is not required. Topic modeling comprises of the following algorithms:
• Latent Semantic Analysis (LSA)
• Probabilistic Latent Semantic Analysis (PLSA)
• Latent Dirichlet Allocation (LDA)
• Correlated Topic Model (CTM)
Using the sample text and assuming two inherent topics, the topic modeling output will identify the common words across both topics. In the text as above, the customer complaints about the call centre and the work not being done, the second theme revolves around the fact that the premium is low. The main theme for the first topic 1 includes words like call, center, and service. The main themes in topic 2 are words like price, premium and reasonable. This implies that topic 1 corresponds to customer service and topic two corresponds to pricing.
The techniques discussed above are just a few techniques of natural language processing. Once the important information is extracted from unstructured text using these methods, it can be directly be consumed as insights or used as input in clustering exercises and machine learning models to enhance their performance and accuracy.