A big data engagement is all about building a data solution that addresses analytical needs. The scope of any engagement is to extract, transform and load datasets so that users can get their required reports and analytics. These reports/analytics with greater accuracy help the business to make key decisions. In simple terms, the expectation of a business is to have a single version of the truth.
During an initiation phase, a solution provider engages business users to assimilate all the information related to the vision, key asks, challenges, and success criteria. Users can provide vital information about their data sets and the confidence level that they have on the data. Many users underestimate the condition of their datasets while describing the accuracy. However, the users presaging about data sets is like sitting on a bomb hoping it will not explode.
So, what is the workaround? Data quality checks for the data sets. These checks will ensure that we have all the data points required to decide the state of the datasets. In many of the data engagements, the data discrepancy appears trivial however in-depth analysis unearths many important and unresolved topics. The technical data issues can be found with lesser effort however any issues related to the business logic, rules, validations, and compliance procedures are hard to find. For example, in insurance, it is difficult to envision some of the discrepancies like a mismatch between premium and general ledger. These issues can be found only with the help of business users or proficient analysts. These issues, if not checked and corrected, will lead to a loss of important data points, incorrect analysis and may have financial implications.
The first move in Data Quality
Over a period, organizations have been accumulating a massive amount of data and this data has become an asset. These data sets generate many insights and meaningful information for decision making. The value of data has grown multifold compared to what it was a few years back. The data sets, the way it was entered into the system lead to many inconsistencies like duplicates, incomplete and inconsistent data which resulted in users moving away from using the information delivered to them. Hence, there is a need for addressing these issues and refining the data consistency, accuracy, and completeness across the organization by implementing data quality initiatives.
Data auditing/profiling is the first step towards ensuring the resolution of data-related issues. It is the only method to find out what’s happening with data sets across various applications. Auditing provides the ability to analyze data/Big data in a systematic and continuous process. It is a well thought out process with a methodical, repeatable, consistent and metrics-based means to evaluate the data. Data auditing has 3 primary methods namely column, Rule/logic-based, and dependency/relationship. Column auditing consists of finding out date and number related inconsistencies. This article’s prime focus is on the latter two.
Basics before the deep dive
Standardization is one of the important auditing activities to ensure the data across the organization is standardized based on rules.
In the above figure, the values have a different set of addresses though all leading to the same address. The standardization process will lead to correction and streamlining these values so that the organization can have a unified version. It is essential to look at some of the dimensions like Customer and Product data sets for possible standardization. Many organizations face serious problems with their customer and product dimensions as many of the reports are directly dependent on the accuracy of data. Any data inconsistencies will lead to serious issues and may lead to a trust deficit in their reporting systems.
The data is collected through multiple channels with no consistency or standards in the attributes leading to duplicates across customer, products and many other dimensions. The problem becomes severe when the same data is pulled into the analytics resulting in a disastrous outcome. The biggest hit is on Customer and Product related data which impacts customer-related communication and results in missing sales opportunities.
In the figure as mentioned above, there are many versions for the same customer due to inconsistency while keying the values.
Due to the nature and requirement of various applications in an organization, the data gets keyed into multiple applications. These data sets that are spread across multiple applications should be consolidated into a single record or merge to achieve data completeness.
In the figure as mentioned above, all the 3 different entities are merged into a single record to create one dataset.
Enhancement is the process of adding value to the existing data sets by collecting additional, related and reference information to complete the base data set of entities and integrating all the sets of information to ensure completeness of data sets.
Matching is a process of data linkage of records from various applications so that there is a link between various records through the unique key. The matching is done using various attributes in order to identify duplicates in the respective data sets. In the de-duplicate and merging section, the two or more identical or duplicate records will be merged into one. As part of the data governance process, the exhaustive rule engine will be established to match the data based on various rules.
Another important aspect of data auditing is the identification of noisy data. In data science, the noise is defined as any unwanted or meaningless data which can not be interpreted to derive any meaningful insights. The noise data often mislead algorithms to generate error-prone patterns. Outliers are one such example of noise data. Noise data removal can be done in 3 ways i.e. distance, density, and clustering outlier detection methods.
Rule validation is probably the most difficult exercise compared to any of the techniques mentioned above. This exercise requires the participation of someone who has a broad understanding of the business. Therefore, it is necessary to include an analyst who understands the domain/business to be part of the data quality initiative. Before the commencement of the auditing activities, the preparation of all the business use cases with validation needs to be initiated.
For example, in the insurance domain;
- Missing policy details for the claims data
- Premium data which is generated from the policy administration system should match with the general ledger
- Accuracy and a single version of policy data like premium and related information like benefits, contract terms
- Sudden jump/slump in a premium collection or claims count as compared to average values
- The claim date cannot be earlier than the policy inception date
- Claim date should be 30 days or later than inception date
- The policy creation date cannot be 99/99/99
- A mismatch between policy management and general ledger application for the Premium data
Apart from the above-mentioned use cases, additional scenarios must be considered as well. At this juncture, participation from business especially those who understand the business, applications, and data is key for the success of the engagement.
Where to address
Once the profiling is done and ready with a detailed analysis of all the data related issues, the best place to fix these issues is at the application. Fixing the issues at the application is always beneficial as this will ensure that all the issues are fixed permanently and less scope for the repetition of such issues. However, in many cases, we may not have the liberty and scope to address the applications. Hence the best place to address is at the staging area of the architecture.
Fixing data quality issues is an ongoing and multi-stage exercise with constantly changing data sets. In this context, addressing these issues on an ongoing basis require a framework that not only addresses these issues but also monitors continuously. The rule engine is one such important component of the data quality framework.
The rule engine is a repository where all the business rules and algorithms are stored. These rules are defined by business users or derived during the auditing activity. Rules are dynamic as the data changes over the period. These rules can be triggered as and when based on the latency. Along with simple rules, it is vital to define complex domain-specific rules which set the stage to resolve some of the key, complex and important data issues. The effectiveness of the rule engine depends on how comprehensive the rules are. As we include more and more rules in terms of patterns, trends, scenario, complex logic and basic mathematical functions, the correction of issues through automated process increases. Though it is not practically possible to address all the issues automatically, the robust rule engine ensures that the maximum percentage of issues are resolved.
The data which fails in the data quality stage will be directed to the manual mode of addressing. A data specialist validates a dataset at various stages before deciding on the final outcome.
The error log will have 2 sets of data. They are;
- The attempted data where the rule engine is not sure to fix the issues due to low confidence scores.
- No rule in the existing engine to address the issues.
Continuous monitoring is another aspect of a rule engine that keeps all the data. The treated data will finally reside in datalake. The data from datalake will be further loaded into two different storage areas, the datamart, and the analytical data store.
The current quality level of data can be shown using the scoring engine. This gives a clear indication to the users about the confidence level of the data sets. The data score can be generated using many algorithms. It is vital to ensure that the algorithm uses the right set of parameters as mentioned below;
- Audit records
- Total records
- Risk level
The above-mentioned factors ensure the right score highlighting the areas which needed immediate attention where the risk/impact is more. Also, the above-mentioned parameters and algorithm can be used to predict the data score.
As mentioned, data quality checks are an important activity and it should be well planned and executed. The complexity increases as we incorporate Bigdata especially, unstructured data. A well-organized data quality check ensures the data asset’s accuracy, availability, validity, and completeness. Timely identification and action on data ensure trust and confidence in datasets resulting in organizational growth.
In this article, our prime focus is on data quality however another important topic that needs attention is Master data. I will bring another article to discuss master data aspects in detail.
If any queries/comments/required details, please reach out to me on my email – firstname.lastname@example.org