Real-Time Processing – A Big Data Use Case

by October 19, 2019

Attention – Who doesn’t need attention. Well, it is about user attention. Organizations are spending millions of dollars to ensure their products draw the much-required user attention. The best of best personalization experiences are provided so that customers build a long-lasting relationship and personal connection with the product.

What is hyper-personalization? How do we understand the effectiveness? It is a process of understanding customer’s needs with effective interaction between a product and a customer resulting in enhanced customer satisfaction. This may possibly increase the likelihood of repurchasing the product and motivate others.


Use Case

E-commerce giants sell millions of products on their portal and mobile app, which have the options of customization of design, packing, numbers and many more. It is necessary to bring a sense of ownership and behavioral change to all their customers. This has to be well augmented with price control and availability.

The portal/app offers various products and their sale or search generates an enormous amount of data as well as opportunities to understand customer behavior. Hence there is a need to design a solution that manages a large volume of data for prescriptive & predictive recommendations in real-time. Real-time analytics should help to search for various properties like price, customization, availability, and personalization of various products so that businesses can derive conclusions that should have a positive impact on sales.


Solution Requirements

The proposed solution should have all the necessary components with automated data loading and governance processes. Along with a predictive solution, there should be a provision to have an analytics data mart with robust data management and automated data quality process. The datamart should contain components to build dashboards/insights. The solution should have a production-ready ML platform in the cloud that can perform CICD and environment for data scientists to build, train and deploy models. The production scale deployment must incorporate flexibility to scale the environment up or down based on real-time usage.

The solution requirement for real-time ingestion along with analytical needs leads to multiple architecture patterns however a hybrid solution which is a variant of Kappa architecture works perfectly for these kinds of scenarios. Lambda architecture also suffices the requirement; however, we would end up with both code and processing overhead. Kappa architecture is all about streaming data and the serving layer. However, our challenge was to include both analytics (in real-time) and datamart (offline) as part of the serving layer for the streaming data. Hence the choice was to have a variant of Kappa architecture.

The implementation was done using the cloud environment. To be neutral to technologies, the solution is made generic and provides both Azure and AWS as options including their components. One of the cloud options was shortlisted after a deep dive into the solution based on the outcome of the proof of concept.

The essence of having a successful implementation of any real-time architecture depends on the following factors;

•  Minimum hops before the information hit serving layer

•  In-memory validation and transformations

•  A robust and automated rule engine

•  Automated end-to-end processing

•  Identification and separation of real-time, near-real-time and non-real-time use cases


The Solution

A solution begins by extracting the data from various source applications. Any existing or custom-developed API’s can be used for extracting the application data into the ingest layer.

How and to what extent the validations and data cleaning activities are to be incorporated depends on the use case. The solution needs to ensure that all the validations and rules should have a well-defined process to manage these activities without any user intervention. The algorithm with score and confidence level will decide the fate of the data to be fixed. The transform layer will ensure that the data is integrated and the business logic is applied before the data is pushed to the store layer. In many of the use cases, both ingest and transform layers are merged, however, it depends on the complexity and the needs. Most of the real-time solutions have the data validations, DQ and data transformation activities to the minimum required.  The non-transformed data residing in the store layer is pushed to the analyze layer which will be further used by predictive solution whereas the transformed data is pushed to real-time reporting. Another dataset generated from transformed data is moved to the datamart store for specific batch reporting.

The following table mentions various component choices that solutions can have based on the needs and use-cases.

The cloud provider for any solution depends on various factors and these play an important role in designing the solution for various use-cases.

If any queries or comments, please reach out to me on