Understanding the Role and Attributes of Data Access Governance in Data Science & Analytics

Data scientists and business analysts need to not only find answers to their questions by querying data in various repositories, but also transform it in order to build sophisticated analysis and models. Read and write operations are at the heart of the data science process and are essential to helping them make quick and highly informed decision-making. It is also an imperative capability for data infrastructure teams that are tasked with democratizing data while complying with privacy and industry regulations.

Understanding and meeting the necessary components for both groups require a data governance platform capable of accelerating the data sharing process to satisfy the unique requirements of the data consumers, while ensuring the organization as a whole is remaining in compliance with regulations such as GDPR, CCPA, LGPD, and HIPAA.

Data is the raw material for any type of analytics – whether it is related to the historical analysis presented in reports and dashboards by business analysts, or predictive analysis that involves building a model by data scientists that anticipates an event or behavior that has not yet occurred. To be truly useful, the raw information that forms the basis of reports and dashboards must be converted into data ready for consumption so business analysts can create reports, dashboards, and visualizations to paint a picture of the overall health of the organization.

Data scientists too can benefit from converted data as they can now leverage it to build and train statistical models using techniques such as linear regression, logistic regression, clustering, and time series. The output of which can be used to automate decision-making using sophisticated techniques such as machine learning.

But this task is becoming increasingly difficult due to the rise in compliance regulations such as GDPR, CCPA, LGPD, and HIPAA and the need for organizations to secure sensitive data across multiple cloud services. In fact, according to Gartner's Hype Cycle for Privacy, 2021 report^[1], "By year-end 2023, 75% of the world's population will have its personal data covered under modern privacy regulations, up from 25% today"…and that "before year-end 2023, more than 80% of companies worldwide will be facing at least one privacy-focused data protection regulation".

Because data analytics is an exploratory exercise, it requires data consumers such as business analysts and data scientists to analyze large bodies of data to reveal patterns, behaviors, or insights to inform some decision-making process. Machine learning, on the other hand, specifically attempts to understand the features with the biggest influence on the target variable. This requires access to a large amount of data that may contain sensitive elements, personally identifiable information (PII) such as a person's age, social security number, address, etc.

In many instances, this data is owned by different business units and is subjected to strict data sharing agreements; presenting infrastructure teams with unique challenges such as balancing the need to provide data consumers with access to enterprise data at the required granularity while complying with privacy regulations and requirements set by the actual data owners themselves. Another major challenge for the data infrastructure team is to support the rapid demand for data by the data science team for their analytics and innovation projects.

Data science requires not only reading data but also updating it in the above-mentioned preprocessing steps. Put simply, data science by nature is a read and write-intensive activity. To address this, data infrastructure teams usually create sandbox instances for these data consumers whenever they start a new project. However, these too require robust data access governance so as to not expose any sensitive or confidential data during data exploration.

According to the previously mentioned, Gartner Hype Cycle for Privacy, 2021 report, "through 2024, privacy-driven spending on data protection and compliance technology will breakthrough to more than $15 billion worldwide". To support the growing data science activities in a company, data infrastructure teams need to implement a unified data access governance platform that has four important attributes:

Encrypt Data: The ability to encrypt data when it is being extracted from source systems, on its way to the sandbox instance, and at rest is the first requirement. The ability to mask a column that contains sensitive elements enables the organization to give data consumers access to PII data like social security numbers for analysis based on predefined rules and permissions. It also provides data infrastructure teams with the ability to redact or partially mask data at the individual customer level to further protect customer privacy.
Implement Read & Write Access Control: The data access governance platform must also have the ability to natively enforce read as well as write access control for on-prem and cloud services. For example, imagine if a business analyst needed to insert a column in a table in order to reflect the change in sales territories. In order to perform this operation, the analyst would need permission from the administrator to both read data from the table as well as write new data into it. Read and write operations are critical for data science, yet some products can enforce read-only access policies.
Identify, Classify & Tag Sensitive Data Elements: Another consideration is to ensure the data access governance solution can provide the functionality to identify, classify and tag sensitive elements found in the data itself. This capability is often made available through a combination of mechanisms such as data dictionaries, pattern matching, and models and is generally more effective in detecting sensitive data components.
Provide Visibility: Finally, the data access governance solutions must offer visibility to IT administrators into the analytics process itself. However, to do this requires it to have real-time capabilities that inform administrators about who requested access to what data (and whether access is granted or denied) and which policy was in effect to grant access to data. The ability to log the status of every access request is critical to comply with privacy and industry regulations as part of internal and external audits.

Enterprises can only thrive in this economy if data can flow to the far reaches of the organization to help make decisions that improve the company's profitability and competitive position. However, every company must share data with proper guardrails in place so that only authorized personnel can access the required data. This is mandated by an ever-increasing list of privacy regulations, as well as to foster the trust that customers have placed with the company. A data governance solution that companies need to securely extract insights from their data must support both read and write operations, as well as automate the process of identifying and classifying sensitive data, take action on it by encrypting it, and providing visibility into the company's data ecosystem.

About the Author

Balaji Ganesan is CEO and co-founder of both Privacera, the cloud data governance and security leader, and XA Secure, which was acquired by Hortonworks. He is an Apache Ranger committer and member of its project management committee (PMC). To learn more visit www.privacera.com or follow the company on Twitter.

Data Access

Understanding the Role and Attributes of Data Access Governance in Data Science & Analytics

About the Author

Related Stories