Editor’s Note: Countless companies fail to implement data management and advanced analytics properly — and that’s understandable, given the changing data landscape, its complexity, the rapidly increasing amount of data, and the accompanying integration challenges. In this piece from Data Management at Scale, Principal Architect, Piethein Strenholt, provides principles, observations, best practices, and patterns to overcome these challenges.
Advanced analytics focuses on projecting future trends, events, and behaviors. It is the most complex form of value creation because it requires statistical models for newer technologies, such as machine learning and artificial intelligence.
While it is getting easier to train and develop accurate models, deploying them into production — especially at scale — is a major challenge. The reasons for this include:
- Models strongly depend on data pipelines. Using an offline, developed data from the data preparation environment is easy, but in production everything must be automated, data quality must be guaranteed, models must be automatically retrained and deployed, and human approvals must be required to validate accuracy after using fresh data.
- Many models are typically built in an isolated data-science sandbox environment, without scalability in mind. Different frameworks, languages, from the internet-pulled libraries, and custom code are often all mixed and combined. With an organizational structure with different teams and no proper “handover,” it is difficult to integrate everything in production.
- Managing models in production is a different ballgame because in production everything needs to be continuously monitored, evaluated, and audited.
With these challenges in mind, I want to lay out a reference architecture and share a number of basic principles. These are supported with a number of components to make the architecture manageable.
Standard Infrastructure for Automated Deployments
The first principle is to use preconfigured and isolated platforms for data science experiments, model development, and model testing. These must be exactly the same as the production platforms.
Although you can go for virtual machines, the most popular choice these days is containers.
The big benefit is that it will behave exactly the same in development, testing, and production. Alternatively, you can go for serverless models. Popular cloud vendors offer low-code/no-code services that allow you to easily use machine learning without a lot of platform maintenance.
The next principle is that model runtime platforms are stateless. They don’t persist and hold data, although they can create temporary data or hold reference data.
Any data that they must use will come from, and will be written back to, an output folder.
This is important because it means you can easily replace models without moving or migrating the data. This principle applies to all of the model serving patterns, which brings us to the next principle.
Prescripted and Configured Workbenches
Rather than having data scientists spend a lot of their time and effort setting up environments, I recommend developing, standardizing, and setting principles on the usage of data science workbenches:
preconfigured software with standard technologies, such as languages and libraries, that allow data scientists to work efficiently.
Some large enterprises such as Uber are finding success with deployed versions of Jupyter Server, VSCode Server, and RStudio Server. Giving data scientists direct access to these tools, with preconfigured project folders for input data, output data, and code, will really help accelerate their work.
Standardize on Model Integration Patterns
The next step is to acknowledge there are different integration patterns on which you can standardize:
Models as batches: Models, in this pattern, work with input and outputs mini-batches or batches. The input and output format is typically a set of files, such as CSV.
Model as stream: Models interact reactively with events streams of data, from which they can also generate and publish new events.
Model as an API: Models are deployed as a web service so they can be used by other applications and processes. To call the model, an API call has to be made in order to get the predictions.
Depending on how much you want to standardize, you can link the different approaches to various workbenches or languages. Processing large volumes of data in this model, for example, can preferably be done with Spark.
For advanced analytics, a well-designed data pipeline is a prerequisite, so a large part of your focus should be on automation.
This is also the most difficult work. To be successful, you need to stitch everything together. For orchestrating data pipeline steps, I highly recommend Apache AirFlow. For continuous delivery, look at GoCD, and for continuous integration you might consider Jenkins, CircleCI, or Bamboo.
For source code repositories, the most popular option is any of the frameworks based on Git. You could also rely on cloud or service providers.
The big ones offer Machine Learning as a Service (MLaaS), integrated environments that work very well for both development and operationalizing in production. Databricks, Qubole, Azure Machine Learning, AWS SageMaker and Google AI Platform are popular options.
The last focus area for setting principles is productionizing models with metadata. You might want to apply versioning to track which data have been used for training by which different models.
DVC is a popular open source version control system for machine learning projects. For versioning the model itself, consider applying serialization.
Store the model as a version and version it with the same framework you used for versioning the data. Consider storing all of this metadata in a central code repository.
Instead of versioning your data, you can also version data pipelines. This allows you to regenerate the same data over and over again.
However, you must guarantee that the outcome of all data pipelines is exactly the same, which is why versioning data pipelines would make sense.
You should also capture information about the modeling frameworks, containers, and model techniques for every model.
For example, what models use Monte Carlo and Random Forest methods? Here you might want to add classifiers if the model is fully transparent in the way it makes decisions or acts more as a black box. Regulators might ask questions about what type of models are used, so it is important to be prepared for this.
Let’s connect the dots to make a consistent, repeatable, and reliable process for building and automatically deploying analytical models.
Your architecture, as you learned, must support both batch and streaming data pipelines and contain the capabilities needed to develop models.
Finally, it must offer tooling to move a model through the stages of development and release, supported by logging, monitoring, and metadata capabilities to oversee everything. When everything is brought together, we get the outline as pictured in Figure 8–8.
This reference architecture is just an example of how an automated model development and management process can be built.
In this model, I have made the data science development environment and versioned training snapshot data part of the data consumer’s data environment.
This is because the requirements and context vary between consumers. The supporting capabilities, such as image repositories and catalogs, are positioned centrally for stricter governance and control.
Preconfigured images, for example, should be provided centrally because for scalability you want to standardize on the most-used frameworks and languages.
The architecture that we quickly have been building throughout this article helps manage advanced analytics such as machine learning at scale.
You also learned that to achieve a faster time to value, it’s important to remove manual efforts like versioning, analytics monitoring, and deploying.
Countless companies fail to implement data management and advanced analytics properly — and that’s understandable, given the changing data landscape, its complexity, the rapidly increasing amount of data, and the accompanying integration challenges. My view and perspective on data management and integration differs from others.
If you are curious, I engage you to have a look at my book Data Management at Scale.
Piethein Strengholt is passionate about technology, innovation and data! He likes to solve problems at scale and is very familiar with topics such as Data Management, Data Integration and Cloud.
Piethein is Chief Data Architect for a large enterprise, where he oversees data strategy and its impact on the organization. Prior to this role he worked as a strategy consultant, designing many architectures and participating in large data management programs, and as a freelance application developer. He lives in the Netherlands with his family.