Onehouse's new tool simplifies AI vector embeddings management

Fri, 23rd Aug 2024

Onehouse has introduced a new solution aimed at simplifying and enhancing the management of vector embeddings for generative AI applications.

The company has launched a vector embeddings generator as part of its managed ELT cloud service, designed to automate embeddings pipelines for efficient and scalable data processing on data lakehouses.

Vector embeddings, essential for applications like natural language processing (NLP), content generation, and intelligent search, can now be rapidly and cost-effectively generated on the data lakehouse using Onehouse's new tool. This new feature is expected to streamline the process, reducing the time and costs associated with building vector embeddings.

"As AI initiatives accelerate, there is a growing pain around managing data across numerous siloed vector databases to power RAG applications, leading to excessive costs and wasteful regeneration of vectors," noted a representative from Onehouse. The data lakehouse, with its open data formats and scalability on inexpensive cloud storage, is rapidly becoming the preferred platform for centralising and managing the extensive amounts of data crucial for AI models.

The new vector embeddings generator will automate the process of delivering data from various sources like streams, databases, and cloud storage to foundation models such as those from OpenAI and Voyage AI. These models will then return the embeddings to Onehouse, which will store them in highly optimised tables on the user's data lakehouse.

Prashant Wason, Staff Software Engineer at Uber and a member of the Apache Hudi Project Management Committee, highlighted the significance of this development in the context of AI projects. "Data processing and storage are foundational for AI projects," he said. "Hudi, and lakehouses more broadly, should be a key part of this journey as companies build AI applications on their large datasets. The scale, openness, and extensible indexing that Hudi offers make this approach of bridging the lakehouse and operational vector databases a prime opportunity for value creation in the coming years."

The integration of the vector embeddings generator within Onehouse's platform aims to facilitate the storage of embeddings directly on the lakehouse, which includes features for update management, handling late-arriving data, and concurrency control. This set-up is anticipated to impart scalability to the data volumes required to power large-scale AI applications. As a result, organisations will find it easier to augment the data that AI models operate with, including audio, text, and images, helping to support a wide array of AI use cases.

Additionally, the product integrates with vector databases for the high-scale and low-latency serving of vectors in real-time scenarios. While the data lakehouse stores all of an organisation's vector embeddings and serves them in batches, hot vectors are dynamically moved to vector databases for real-time serving. This architecture promises advantages in terms of scale, cost, and performance for building AI applications like large language models (LLMs) and intelligent searches.

Vinoth Chandar, CEO of Onehouse and creator of Apache Hudi, underscored the importance of managing data efficiently for the success of AI initiatives. "AI is going to be only as good as the data fed to it, so managing data for AI is going to be a key aspect of data platforms going forward," he stated.

Chandar further elucidated that Hudi's incremental processing capabilities extend to the creation and management of vector embeddings across massive data volumes. This provides both the open-source community and Onehouse customers with competitive advantages, such as continuously updating vectors with changing data while reducing the costs of embedding generation and vector database loading.

A recent survey by MIT Technology Review and Databricks revealed that nearly three-quarters of organisations have adopted a lakehouse architecture, with 99 percent of respondents affirming that it aids in achieving their data and AI goals. This shift towards data lakehouses reflects a growing consensus on its effectiveness in managing and leveraging data for AI applications.

Share on: