CIO Influence
AIOps Big Data Data Management Featured

Databricks Sets a New ETL Benchmark for 1 Billion Data Records

Databricks Sets a New ETL Benchmark for 1 Billion Data Records

Leading enterprise data management and data lakehouse company Databricks has set new ETL benchmark for pricing and performance. The company performed Extraction, Transformation and Loading of 1 billion records into Enterprise Data Warehouse (EDW) dimensional models for less than $1! This was possible by using conventional ETL technique with Delta Live Tables — an easy-to-scale platform for building production-ready streaming or batch ETL pipelines in SQL and Python.

Recommended: SAP Introduces SAP Datasphere to Simplify Customers’ Data Landscape – Partners with Collibra, Confluent, Databricks and DataRobot

The $1 pricing for 1 billion data records would be a hard one to chase considering the huge deployment challenges that modern ETL pipelines bring to a CIO’s table. Databricks has illustrated that it is possible to manage large-sized EDW models in ETL pipelines using Delta Live Tables (DLT). DLT is available for Microsoft Azure, AWS and Google Cloud Platform. With DLT, it is possible to “declaratively express entire data flows in SQL and Python.” You can do parameterization, testing and documentation alongside operational management and monitoring in the same environment. This significantly accelerates the speed of deployment and seamless handling of batches and streams in a single API.

Databricks used TPC-DI  for enterprise ETL, bringing DLT’s automatic orchestration into the picture.

Automatic Orchestration to Speedup ETL

While ingesting 1 billion data records, Databricks did not lose sight of the three key things:

  • Data size
  • Data quality
  • ETL performance

Speed-wise, TCD-PI on DLT was almost twice as fast compared to the non-DLT Databrocks benchmarks, leading to believe that ETL pipelines could be speed up further with automated orchestration. But, will they deliver on accuracy? According to Gartner, data quality has emerged as a priority in data management and data analytics, giving organizations with matured data resources a clear competitive advantage over others. DLT simplifies the data orchestration process with automation, helping DAG teams write SQL statements build the ETL pipelines with zero error. Moreover, the use of DLT for ETL pipelines helped DAG teams to orchestrate parallel workflows across 35+ worker nodes, delivering perfectly aligned computing performance during the same streaming process.By running parallel tests with the same DLT, DAG teams could optimize their work, reduce errors and remove bugs and corrupt records more efficiently.

The data lakehouse company has been making some rapid strides in the way enterprise’s handle data management pipelines across different cloud environment.

Top News: Cybersecurity Company Hunters Announces the Availability of its SOC Platform on Databricks’ Lakehouse

Here are the recent developments within Databricks that led to the new ETL benchmark in the industry.

Databricks Model Serving for AIOps and Machine Learning

Databricks Model Serving will now simplify the production machine learning (ML) natively within the Databricks Lakehouse Platform. AI-based deep integration within the Lakehouse Platform offers data and model lineage, governance and monitoring throughout the ML lifecycle, from experimentation to training to production. Databricks Model Serving is now generally available on AWS and Azure.

[To share your insights with us, please write to sghosh@martechseries.com]

Here’s how Delta Live Tables work for batch and streaming data with SQL and Python

Related posts

Why Banks Should Make ‘Personalized Customer Experiences’ a Priority

Billy Loizou

New Study Reveals Data Management Is a Top Challenge in the AI Revolution

PR Newswire

Alation’s Data Intelligence Project Goes Global with New Cohort of Universities

CIO Influence News Desk