Vinoth Chandar, CEO and Founder of Onehouse discusses the recent technological advancements in SaaS and emphasizes the challenges and solutions for data interoperability and management in modern data architectures.
_______
Hi Vinoth, tell us about yourself and the story behind Onehouse?
I’ve been working in distributed systems and databases since I graduated from college. After studying at Anna University, India – and then the University of Texas at Austin – I joined the Oracle database server team, where I focused on streaming and data replication.
I then moved on to LinkedIn, where I worked on their Voldemort NoSQL key-value store; I began to fully understand the benefits of moving away from expensive, proprietary databases. For data analytics, we were moving towards an open Hadoop data lake that enabled us to serve multiple query engines for a number of use cases from a single data repository.
I then joined Uber, where they faced a very similar challenge. But it was compounded by the need for real-time data during Uber’s hyper-growth phase. So I set out to build what would become Apache Hudi, significantly reducing Uber’s data infrastructure costs while providing a variety of teams with near real-time data, with all the attributes around consistency, updates, and more that you expect from a traditional database.
After that, I moved to Confluent, and explored how to capture real-time events the same way you would store and analyze more traditional data sets. During this time, I also continued to watch the Apache Hudi community grow, and eventually decided that a company needed to unite all the open lakehouse formats, along with an easy-to-use cloud service for open, fast data lakehouses.
I founded Onehouse to make this dream a reality….
Also Read: Bridging Data Silos: Effective Data Integration Techniques for IT Teams
We’d love to hear about your recent funding and how it will lead to an evolution in the product roadmap?
We just completed a $35M Series B funding round. This allows us to continue building out a robust product roadmap while growing our customer list.
One of the most game-changing pieces of technology we’ve delivered in recent months – aside from our core platform – is Apache XTable (incubating), which we open-sourced alongside Google and Microsoft a few months back. We’ve since donated it back to the Apache Software Foundation, where it’s incubating as probably the first mainstream effort in the industry on data interoperability.
XTable provides interoperability between Apache Hudi, Apache Iceberg, and Delta Lake, so you can ingest your data once, and then XTable will present it in whatever format your query engine expects. We fundamentally oppose vendor lock-in, and XTable delivers on our vision in leaps and bounds.
Additionally, our two newest products – which we just announced concurrently with funding – include:
Table Optimizer: Table Optimizer offers hassle-free table optimizations and management with no maintenance required. Managing lakehouse tables in-house normally requires manual tasks like cleaning, clustering, compaction, and file-sizing, along with optimizing numerous configurations such as frequency, size budgets, triggers, partition spread, parallelism, retention, and concurrency. Improper tuning can cause performance swings as large as 2x to 100x. Onehouse provides everything required for the users’ pipelines, from core infrastructure to automatic optimizations. Users can automate services like compaction and clustering to enhance query performance, write efficiency, and storage usage. In fact, automated clustering and compaction with Onehouse can accelerate queries by 2.5x to 10x and writes by 40% while optimizing table layouts and automating data cleaning to reduce storage and processing costs, cutting compute costs by 4x.
LakeView (free):Â LakeView, Onehouse’s free lakehouse observability tool, enhances users’ data operations by providing critical insights into Apache Hudi tables. Users can access essential insights and fine-tune data lakehouse operations using pre-built dashboards that track performance. Customized weekly reviews via email help manage data partitions and address data skew effectively. Installation is straightforward, requiring no changes to pipelines or permissions, as Lakeview has no access to Parquet files and leaves no compute footprint. Proactive alerts assist with debugging, offering a searchable timeline of commits and table service runs. Additionally, users can stay informed with compaction backlog stats and alerts for Apache Hudi Merge-on-Read and Copy-on-Write tables.
What are some of the top challenges you see businesses face when it comes to making data open and interoperable?
To understand this, we must examine what makes data closed and siloed. Typically, data is first replicated to a data warehouse that stores data in proprietary formats, locked away onto its own SQL/compute engines. Vendors tout Open data lakehouse table formats as a panacea for making data truly open. However, lock-in points still exist on data catalogs and compute services that operate on the data. Interoperability is largely controlled by vendor roadmaps and slows down innovation dramatically.
We believe in the importance of truly open data and are building a platform that is easy, fast and certainly the most open. We employ the best-of-breed solutions – Apache Hudi (for minute-level freshness, powerful incremental processing), Apache Iceberg (to unlock open data from cloud data warehouses) and Delta Lake (for integration into Databricks’ ecosystem) – while Apache XTable, with backing from almost all major vendors, ensures data interoperability. You can consume your data in any format, including Iceberg and Delta Lake, and use pretty much any downstream query engine – Databricks, Snowflake, and many, many more. Ultimately you get a plug-and-play data lakehouse that works with almost every tool in your data toolbox.
Also Read: Intel’s Lunar Lake Processors: Arriving Q3 2024
What should data teams keep in mind when deploying data lakehouses: factors that are often overlooked?
We believe that lakehouses are the future of open data architectures. However, building your own lakehouse can be challenging, particularly for small engineering teams. You have to design the architecture, build ETL pipelines, optimize your tables, and keep your lakehouse running optimally. This process is typically more involved than purchasing a commercial product like a cloud data warehouse, which is why some teams shy away from building their own lakehouse.
And that’s why we’re building Onehouse. We want to enable every organization to benefit from an open data architecture, and to use the same data infrastructure that is currently deployed by the most advanced data teams. Onehouse is delivered as a fully-managed cloud service, and makes it easy to deploy your own lakehouse.
At Onehouse, with a fully managed cloud-based data lakehouse, we eliminate all of that, so you can get started in hours or even minutes. And with our Universal Data Lakehouse architecture, you’ll be able to connect to almost any imaginable source and query a single copy of your data from any engine – Snowflake, Databricks, Starburst, Pinot, StarRocks and more.
How do you feel AI will lead to a shift in these processes and on the whole segment in the future?
The most important success factor in AI is the quality of data available to train models and generate outputs. We’re seeing this play out right now in the world of LLMs, where the quality of the models is starting to be constrained by the data available for model training. There’s only so much public data available for these models to be trained on, and we’ve already consumed most of it.
Organizations that want to build their own AI models are facing the same challenges. Their models will always be constrained by the data available for training. As a natural consequence, having a centralized data platform that can interoperate with all your AI tools and serve fresh, high quality data at low latency and high volume, will become increasingly important.
The data lakehouse is the perfect architecture to centralize an organization’s data. It’s cost-effective and open, can ingest data at low latency, scales efficiently, and interoperates with all downstream AI tools. From that perspective, it offers clear advantages over traditional cloud data warehouses.
On the data serving side, additional excitement around AI has centered on what I would consider to be real-time use cases. You enter a prompt of some sort and immediately get your results or answer. But the architecture to support this is very expensive—see NVIDIA’s stock price ballooning more than 200% in the previous 12 months.
But what many organizations will start to find is that there are a number of use cases where responses don’t necessarily need to occur in real-time. If results take a second – or maybe even return with sub-second responses – that will be perfectly acceptable. This is where we see the universal data lakehouse coming in to serve as something of a staging ground for the vector databases that are currently all the rage, making AI much more affordable for a number of use cases.
So the data lakehouse is the perfect platform not just to build a single source for all of an organization’s data, but also to serve vectors cost-effectively when real-time isn’t a hard requirement.
Also Read: Top 10 Test Data Management Tools for Clean and Secure Data
A few data optimization thoughts you’d leave us with before we wrap up?
I would like to touch upon an aspect that can significantly impact your cloud data costs- moving toward incremental data processing.
Nearly 50% of your data warehouse costs today are from expensive ELT/ETL processing, used to ingest, process, and transform data to be then consumed by BI to AI tools. While the consumption layer gets a lot of mindshare, this ELT/ETL transformation layer remains expensive and outdated with old-school batch/bulk data processing, which results in massive wastage of warehouse compute resources. Moreover, with data silos, these expensive processes are repeated once for each warehouse/lake engine in different formats, and costs add up dramatically.
We have witnessed many organizations adopt a more incremental approach to their data pipelines with great success. Frameworks like Apache Hudi enable this by replacing bulk processing that runs every few hours with incremental data pipelines that deliver data in a few minutes. This means your data is ingested in near real-time, and your downstream ETLs are reading/writing/processing orders of magnitude lower amounts of data. You can write/process data once across all your warehouses and data lake engines, and then with advancements like XTable, access this data everywhere.
Also Read: The Evolution of Private Cloud: Addressing Modern Enterprise Needs
[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]
Vinoth Chandar is CEO and Founder at Onehouse
Onehouse, is a leader in open data lakehouse technology, empowers enterprises to deploy and manage a world-class data lakehouse in minutes on Apache Hudi, Apache Iceberg, and Delta Lake. Delivered as a fully-managed cloud service in your VPC, Onehouse offers high-performance ingestion pipelines for minute-level freshness and optimizes tables for maximum query performance. Thanks to its truly open data architecture, Onehouse eliminates platform, table format, and catalog lock-ins, guarantees interoperability with virtually any processing engine, and ensures exceptional ELT and query performance for all your workloads.