Leading enterprise data management and data lakehouse company Databricks has set new ETL benchmark for pricing and performance. The company performed Extraction, Transformation and Loading of 1 billion records into Enterprise Data Warehouse (EDW) dimensional models for less than $1! This was possible by using conventional ETL technique with Delta Live Tables — an easy-to-scale platform for building production-ready streaming or batch ETL pipelines in SQL and Python.
The $1 pricing for 1 billion data records would be a hard one to chase considering the huge deployment challenges that modern ETL pipelines bring to a CIO’s table. Databricks has illustrated that it is possible to manage large-sized EDW models in ETL pipelines using Delta Live Tables (DLT). DLT is available for Microsoft Azure, AWS and Google Cloud Platform. With DLT, it is possible to “declaratively express entire data flows in SQL and Python.” You can do parameterization, testing and documentation alongside operational management and monitoring in the same environment. This significantly accelerates the speed of deployment and seamless handling of batches and streams in a single API.
Databricks used TPC-DI Â for enterprise ETL, bringing DLT’s automatic orchestration into the picture.
Automatic Orchestration to Speedup ETL
While ingesting 1 billion data records, Databricks did not lose sight of the three key things:
- Data size
- Data quality
- ETL performance
Speed-wise, TCD-PI on DLT was almost twice as fast compared to the non-DLT Databrocks benchmarks, leading to believe that ETL pipelines could be speed up further with automated orchestration. But, will they deliver on accuracy? According to Gartner, data quality has emerged as a priority in data management and data analytics, giving organizations with matured data resources a clear competitive advantage over others. DLT simplifies the data orchestration process with automation, helping DAG teams write SQL statements build the ETL pipelines with zero error. Moreover, the use of DLT for ETL pipelines helped DAG teams to orchestrate parallel workflows across 35+ worker nodes, delivering perfectly aligned computing performance during the same streaming process.By running parallel tests with the same DLT, DAG teams could optimize their work, reduce errors and remove bugs and corrupt records more efficiently.
The data lakehouse company has been making some rapid strides in the way enterprise’s handle data management pipelines across different cloud environment.
Here are the recent developments within Databricks that led to the new ETL benchmark in the industry.
Databricks Model Serving for AIOps and Machine Learning
Databricks Model Serving will now simplify the production machine learning (ML) natively within the Databricks Lakehouse Platform. AI-based deep integration within the Lakehouse Platform offers data and model lineage, governance and monitoring throughout the ML lifecycle, from experimentation to training to production. Databricks Model Serving is now generally available on AWS and Azure.
[To share your insights with us, please write to sghosh@martechseries.com]
Here’s how Delta Live Tables work for batch and streaming data with SQL and Python