Exploring the Essentials of Machine Learning Operations (MLOps)
Written on
Chapter 1: Introduction to MLOps
In recent years, I have been involved in deploying Machine Learning systems within real-world settings for clients in the Consumer Packaged Goods (CPG) and Healthcare sectors. While deciphering business needs, designing models, and working with raw data present significant challenges, the most captivating aspect has been the process of industrializing these projects.
A persistent issue we face is determining the best practices to adopt for maintaining a machine learning system in a production environment. A recent study by Kreuzberger et al. provides a thorough examination of Machine Learning Operations (MLOps), detailing its foundational principles, architectural components, necessary roles, and potential system architectures. This work also offers valuable academic insights into best practices in the field.
Section 1.1: Core Principles of MLOps
To grasp the essence of MLOps and its objectives, it's vital to recognize several key principles that a project should adhere to:
- P1 - CI/CD Automation: This principle facilitates rapid building, testing, and deploying of code, enhancing overall team productivity.
- P2 - Workflow Orchestration: Essential for coordinating the various steps in an ML workflow, such as processing raw data, training/testing models, and deployment.
- P3 - Reproducibility: The ability to replicate past experiments (both code and models) is crucial.
- P4 - Versioning: Tracking versions of code, data, and models is imperative for reproducibility.
- P5 - Collaboration: Effective communication between technical teams and business stakeholders is essential to align on objectives and expectations.
- P6 - Continuous ML Training & Evaluation: The system should facilitate regular retraining and evaluation of models.
- P7 - ML Metadata Tracking and Logging: Each model should be associated with metadata, including evaluation metrics and code versions to manage production effectively.
- P8 - Continuous Monitoring: Monitoring is vital to assess model performance and determine when retraining or new models are needed.
- P9 - Feedback Loops: The iterative nature of the process allows for continuous improvement based on feedback.
Subsection 1.1.1: Technical Components Supporting MLOps Principles
Each principle is supported by specific technical components:
- C1 - CI/CD Component: Enables continuous integration and delivery while supporting ongoing ML training and evaluation.
- C2 - Source Code Repository: Essential for collaboration and versioning.
- C3 - Workflow Orchestration Component: Facilitates pipeline orchestration, ensuring reproducibility and continuous training.
- C4 - Feature Store System: Consists of offline and online databases crucial for training and production predictions.
- C5 - Model Training Infrastructure: Necessary resources (CPUs, GPUs) for continual model training and evaluation.
- C6 - Model Registry: Stores model images and associated metadata, aiding in deployment and distribution.
- C7 - ML Metadata Stores: Manages the diverse metadata generated from various components.
- C8 - Model Serving Component: Responds to requests, typically via REST API, and should be scalable.
- C9 - Monitoring Component: Tracks model performance, enabling feedback loops.
Section 1.2: Roles in an MLOps Project
Implementing an MLOps project can be complex, but it facilitates agile methodologies by delineating various engineering roles that foster rapid experimentation and iteration. Key roles include:
- R1 - Business Stakeholder: Defines business goals and acts as a liaison with company stakeholders.
- R2 - Solution Architect: Determines the technologies to be employed.
- R3 - Data Scientist: Converts business requirements into analytical needs and develops ML models.
- R4 - Data Engineer: Constructs and manages data and feature pipelines.
- R5 - Software Engineer: Applies best practices in software design to the overall project.
- R6 - DevOps Engineer: Builds and maintains pipelines, ensuring effective CI/CD automation and workflow orchestration.
- R7 - ML / MLOps Engineer: A cross-functional role that automates ML infrastructure, workflows, and model deployment.
The subsequent visual summarizes these roles and their interactions.
Chapter 2: MLOps Architecture and Workflow
Following the insights from Kreuzberger et al., a general, technology-agnostic architecture for MLOps is proposed. The workflow aligns with the Team Data Science Process (TDSP) and comprises several key steps.
Video: Introduction to Machine Learning Operations | MLOPs - YouTube
This video offers a foundational understanding of MLOps, discussing its significance and core components.
A) Initiating an MLOps Project
In the initiation phase, also referred to as Business Understanding in TDSP, the Business Stakeholder defines the project goals. The Solution Architect identifies suitable technologies, while the Data Scientist collaborates with the Product Owner and others to clarify the business problem and data availability.
B) Feature Engineering Pipeline
Kreuzberger et al. describe a Feature Engineering Pipeline where Data Engineers and Data Scientists work together to identify features, ingest, preprocess, and transform data—mirroring the Data Acquisition phase of TDSP.
C) Experimentation
During this phase, the Data Scientist analyzes and prepares the data, develops a model, and eventually exports it to the model registry.
D) Deployment
Once a model is trained, it is deployed to the serving layer.
Putting it all together, we can visualize how these components interact within the architecture.
As the project progresses, data sources are shared with the Data Scientist. Connecting to raw data can be challenging due to varying data storage systems, constituting a crucial step in the Data Ingestion / Feature Engineering Pipeline.
Once processed, data is stored in a feature store system for offline training and online predictions. An event-based approach or scheduled retraining can trigger model updates, guided by the monitoring component which assesses model performance in production.
Conclusion
This research highlights the importance of understanding Machine Learning models in real-world settings, providing a comprehensive overview of what a machine learning system should entail and what MLOps encompasses. It serves as one of the initial toolkits to enhance the success of Machine Learning projects. However, numerous other factors can lead to project failures, and organizational changes required to adapt processes can be daunting.
If you enjoyed this content, please consider giving it a clap! For more articles, follow me on Medium.
Main Reference:
[1] Dominik Kreuzberger, Niklas Kühl, Sebastian Hirschl, "Machine Learning Operations (MLOps): Overview, Definition, and Architecture"
Further Readings on Leading Cloud Providers:
For those interested in kickstarting an MLOps project, Microsoft offers accelerators to assist in the process.
Video: A Primer on Machine Learning Operations (MLOps) - YouTube
This video delves deeper into the principles and practices surrounding MLOps, offering insights into its implementation.