Five Important Considerations For Data Scientists

Data science as a strategic business tool is growing in prominence. It therefore isn't surprising that a recent study by Deloitte found that with many businesses are planning on tripling their data science teams over the next 24 months. However, with the advent of GDPR and increasing scrutiny over data privacy, data scientists have increased responsibility over how they process and model data. Here are the top five most important considerations for data scientists over the coming months.

1. Explainability and transparency

As we know last May saw the introduction of GDPR which changed the way that global organisations collected, managed and processed the data of citizens residing in the European Union. The impact it has had on data science has been far reaching as it governs what data can be used for modelling and how transparent models need to be. Under GDPR organisations must be able to explain how they have arrived at data-made decisions. This means that organisations must have a very firm handle of the provenance of their data and be sure that any customer data has the appropriate consent. It is also thought that harsher regulations relating to ePrivacy could be introduced later this year, which will further impact what data can be used. With an increased permissioned landscape data architecture will have to become a greater priority for data scientists.

2. Version control

Relating to GDPR and ePrivacy is data version control. Keeping track of changes that you or your collaborators make to data and software is a critical part of any project. This is because in explaining an outcome from a model at a given point in time, you may need to reference or retrieve an older iteration. This is important if you have built models that are frequently retrained or partially retrained on the latest data, you need to be sure to store historic versions both of the model and the training data are available should an audit be required.

The same is true if you frequently iterate and develop models. Model development is often an iterative process, with new techniques and packages becoming available all the time. It is great for businesses to pay attention to their full suite of models, not just their new ones, however versioning must be implemented to ensure compliance (and best practice). Whether you turn to manual control, use GIT or turn to commercial solutions, one thing is clear – version control will have to be a priority for every data scientist or risk the wrath of the Information Commissioner and its hefty fine book.

3. Data as the new IP

Our theory on data becoming the new IP (you can read more about that here) propounds that the training data is often now as important as the code when creating proprietary models. As the standard of open source packages grows and the price of computer resource drops, many more organisations are able to build high quality models without a large budget. The differentiator in the models is often the volume and quality of training data available. This is true for both fast moving industries where models are frequently retrained and adapting to new market conditions, and slower, static industries where there is a sparsity of data. As a result we believe that training data is fast becoming the new intellectual property and a major source of competitive advantage – look at Amazon and Google.

4. Data bias

Automated model retraining is all well and good. The problem we have however, is that human bias (the thing that machine learning and algorithms is supposed to eliminate) can be passed on to the machine when they are being trained if the underlying data is reflective of that bias. For instance in the financial industry, for example, biased data may cause results that offend the Equal Credit Opportunity Act (fair lending). As we've already mentioned under GDPR consumers have the right to understand how a decision, in this case, maybe a credit card rejection, has been arrived at and if the data is biased this could be difficult to explain. We have seen a number of incidents where image recognition models have returned racist results due to skewed training data and speech recognition is famously poor at recognising regional accents.

5. Data aggregation

Under GDPR customer data must be aggregated to a specific group size to ensure anonymity. While this may seem restrictive, we believe it is an opportunity to think more creatively about the build and outcomes of models and what benefit they are to the consumer. Innovative techniques in feature generation and clustering mean that you can establish patterns in data previously unseen. Instead of dragging heels and merely complying with GDPR, instead seize the opportunity to reset thinking and approach problems in a new, customer centric way.

Data science is at an extremely exciting stage of its development. Every day new breakthroughs are being made around the possibilities that the discipline affords. However, at the centre of these advancements must be an appreciation of data privacy and the recognised responsibility that data scientists have to the consumers whose data they are using to create and train machine learnt applications.

Data Science