Methodologies of data science, in some way, come from machine learning, and both are often associated with mathematics, statistics, algorithms and data wrangling. Data scientists make data models that need to run in production environments. And most DevOps practices are germane to production-oriented data science applications, but these practices are typically unheeded in data science training.
Many organizations may not be ready to invest in data science platforms, or maybe they have small data science teams for only basic operations. In this case, companies must apply DevOps best practices to data science teams instead of picking and orchestrating a platform. To do so, several agile and DevOps paradigms being utilized for software development teams can be employed to data science workflows with some significant tunings.
DevOps encompasses infrastructure provisioning, configuration management, continuous integration and deployment, experimenting and monitoring. The teams in DevOps have been closely working with the development teams to manage the applications’ lifecycle efficiently.
Applying DevOps to Data Science
Data science teams add extra responsibilities to DevOps. And data engineering, a niche domain which deals with multifaceted pipelines to transform the data, demands the close collaboration of data science teams with DevOps. Additionally, operators are also anticipated to supply highly available clusters of Apache Hadoop, Apache Kafka, Apache Spark and Apache Airflow to address data extraction and transformation.
Data scientists discover transformed data to explore insights and correlations. They embrace a diverse set of tools like Jupyter Notebooks, Pandas, Tableau and Power BI to visualize data. So, the DevOps teams are expected to support data scientists by creating environments for data exploration and visualization.
Begin with Delivering Assistance to Data Scientists
Data scientists, similar to application developers, are most involved in solving problems, interested in configuring their tools, and often have less curiosity in configuring infrastructure. But they may not have the same experience and background, as software developers have, to fully configure their development workflows. This provides an opportunity to DevOps engineers to treat data scientists as customers, assist them to define their requirements, and take ownership in delivering solutions.
A DevOps engineer can also help in selecting and standardizing a development environment. This can be performed traditionally on a computing device or on a virtualized desktop. Also, imitating their applications and configurations to the development ecosystem is significantly the first step for DevOps engineers when working with data scientists. Afterward, they should review where data scientists store their code, how the code is versioned, and how code is packaged for implementation.
Most of the data scientists are relatively new to using version control tools like Git, and maybe using a code repository but have not automated any integrations. So, deploying continuous integration is an important second place for DevOps engineers to lend a hand to data scientists, as it creates standards and confiscates some of the manual work in testing new algorithms.
Moreover, developing machine learning models is essentially different from traditional application development. When a fully-trained machine learning model is available, DevOps teams are expected to host the model in a scalable environment. They could also take benefit of orchestration engines like Apache Mesos or Kubernetes to scale the model implementation.