Databricks is enhancing Data Management capabilities with Delta Live Tables and Unity Catalog. New features would help data teams in many ways. For instance, it will provide streamlined and reliable data pipelines. It would also help users to easily discover and govern enterprise data assets across multiple clouds and data platforms.
The new announcements from Databricks will enhance its lakehouse platform through reliability, governance, and scale. These announcements were made at the Data + AI Summit.
Let’s understand what Databricks is offering.
Delta Live Tables
First, the company revealed Delta Live Tables to simplify the development and management of reliable data pipelines on Delta Lake.
Delta Live Tables would help in building the foundation of the lakehouse with reliable data pipelines. Delta Live Tables is a cloud service in the Databricks platform that makes ETL – extract, transform and load capabilities – easy and reliable on Delta Lake to help ensure data is clean and consistent when used for analytics and machine learning.
Unity Catalog
The company also announced Unity Catalog, a new, unified data catalog that makes it easy to discover and govern all of an organization’s data assets, with a complete view of data across clouds and existing catalogs.
The Unity Catalog is underpinned by Delta Sharing, a new open source protocol for secure data sharing also announced by Databricks today. It allows organizations to use Unity Catalog to also manage secure data sharing with business partners and data exchanges, further emphasizing the flexibility provided by an open lakehouse platform.
How Poor Reliability Affect ETL Pipelines?
Today, building reliable ETL pipelines at scale is a difficult challenge for enterprises. Poor reliability leads to missing or incorrect data in business-critical systems, often resulting in costly errors for the organization.
The process to build pipelines is highly manual today, requiring very granular work to both define the instructions for how data should be manipulated and how the accuracy of those manipulations should be tested. Also, as the number of pipelines grows in response to more and more data being gathered and used, managing and updating pipelines becomes a heavy operational burden.
How Delta LiveTables Work?
Delta Live Tables solves this challenge by abstracting away the low-level instructions, removing many potential sources of error. Instead of requiring a data engineer to explain how every step of a pipeline should work, with Delta Live Tables, they only specify the outcomes the pipeline needs to achieve using high-level languages like SQL.
Delta Live Tables then automatically creates the instructions for both the data transformations and the data validations, as well as implementing uniform error handling. Managing pipelines at scale is improved through chained dependencies that automatically execute downstream changes when a table is modified.
Read More: ITechnology Interview with Muhi Majzoub, EVP & Chief Product Officer at OpenText
Additionally, Delta Live Tables is able to restart pipelines to resolve transient errors. If the failure requires manual intervention, or if new business logic requires changes to the data, Delta Live Tables makes it easy for data engineering teams to pinpoint the source of the error for fast remediation of the issue and then reprocess data from that location.
“At Shell, we are aggregating all of our sensor data into an integrated data store – working at the multi-trillion record scale. Delta Live Tables has helped our teams save time and effort in managing data at this scale. We have been focusing on continuously improving our AI engineering capability and have an Integrated Development Environment (IDE) with a graphical interface supporting our Extract Transform Load (ETL) work. With this capability augmenting the existing lakehouse architecture, Databricks are disrupting the ETL and data warehouse market which is important for companies like ours. We are excited to continue to work with Databricks as an innovation partner.”
Unity Catalog: Simplified governance of data and AI across multiple cloud platforms
Today, the vast majority of data within enterprises is flowing into cloud-based data lakes. But data lakes present significant governance challenges. First, cloud providers don’t offer fine-grained access controls. Privileges stop at the file-level, rather than the contents of the file, making access an all or nothing proposition. The only way around this is to copy subsets of a file’s data into new files,and this proliferation of files is one of the major reasons why data lakes become data swamps.
ITechnology Insights: ITechnology Interview with Amod Bavare, Principal and Global Cloud Migration Leader at Deloitte Consulting, LLP
With multi-cloud adoption on the rise, the problem gets even harder, because each cloud provider has a different set of APIs for managing access. Second, the world has moved beyond simply trying to govern well-structured data. Modern data assets take many forms, including dashboards, machine learning models, and unstructured data like video and images that legacy data governance solutions simply weren’t built to manage.
Unity Catalog addresses these challenges by providing one interface to provide fine-grained governance for all data assets, both structured and unstructured, across all cloud data lakes to make it easier for enterprises to unify their data on the Databricks Lakehouse Platform. Unity Catalog is based on industry standard ANSI SQL to streamline implementation and standardize governance across clouds. Unity Catalog also integrates with existing data catalogs to allow organizations to build on what they already have and establish a future proof and centralized governance model without expensive migration costs. Already, strategic Databricks partners like Alation, Collibra, Immuta and Privacera have committed to contribute to an ecosystem of powerful integrations for Unity Catalog.
Read Also: Cybersecurity: The Reputational Winners and Losers in the Tech Industry
[To share your insights, please write to us at sghosh@martechseries.com]