Databricks has announced the latest generation of its machine learning offering, with the launch of Databricks Machine learning, a new data-native platform built on top of an open lakehouse architecture.
According to the company, the new platform enables new and existing ML capabilities on the lakehouse platform to be integrated into a collaborative, purpose-built experience that provides ML engineers with everything they need to build, train, deploy, and manage ML models from experimentation to production, combining data, and the full ML lifecycle.
The platform also includes two new capabilities:
- Databricks AutoML to augment the machine learning process by automating all the steps that data scientists have to manually do, while still exposing enough control and transparency.
- Databricks Feature Store to improve discoverability, reuse, and governance of model features in a system integrated in the enterprise's data engineering platform.
"Many ML platforms fall short because they ignore key challenges in machine learning; they assume that data is available at high quality and ready for training," Databricks says.
"That requires data teams to stitch together solutions that are good at data but not AI, with others that are good at AI but not data.
"To complicate things further, the people responsible for data platforms and pipelines (data engineers) are different from those that train ML models (data scientists), which are different from those who deploy product applications (engineering teams who own business applications)," the company explains.
"As a result, solutions for ML need to bridge gaps between data and AI, the tooling required, and the people involved."
Built on an open lakehouse foundation, the platform ensures customers can easily work with any type of data, at any scale, from machine learning across traditional structured tables, to unstructured data like videos and images, to streaming data from real-time applications and IoT sensors, and quickly move through the ML workflow to get more models to production faster.
AutoML has the potential to allow data teams to more quickly build ML models by automating a lot of heavy lifting involved in the experimentation and training phases.
But, enterprises who use AutoML tools today often struggle with getting AutoML models to production. This happens because the tools provide no visibility into how they arrive at their final model, which makes it impossible to modify its performance or troubleshoot it when edge cases in data lead to low confidence predictions.
According to Databricks, the introduction of the AutoML capabilities within the platform takes a unique 'glass box' approach instead. It allows data teams to not only quickly produce trained models either through a UI or API, but also auto-generates underlying experiments and notebooks with code, so data scientists can easily validate an unfamiliar data set or modify the generated ML project.
Data scientists have full transparency into how a model operates and can take control at any time. This transparency is critical in highly regulated environments and for collaboration with expert data scientists.
Machine learning models are built using features, which are the attributes used by a model to make a prediction. To work most efficiently, data scientists need to be able to discover what features exist within their organisation, how they are built, and where they are used, rather than wasting significant time repeatedly reinventing features.
Additionally, feature code needs to be kept consistent across several teams that participate in the ML workflow, otherwise, model performance will drift apart between real-time and batch use cases - a problem called online/offline skew.
According to Databricks, its Feature Store is the first of its kind that is co-designed with a data and MLOps platform. It says tight integration with the popular open source frameworks Delta Lake and MLflow guarantees that data stored in the Feature Store is open and that models trained with any ML framework can benefit from the integration of the Feature Store with the MLflow model format.
According to the company, the Feature Store eliminates online/offline skew by packaging feature store references with the model, so that the model itself can look up features from the Feature Store instead of requiring a client application to do so. As a result, features can be updated without any changes to the client application that sends requests to the model.
Databricks say the Feature Store knows exactly which models and endpoints consume any given feature, facilitating end-to-end lineage as well as safe decision-making on whether a feature can be updated or deleted.