Data Jobs Monitoring detects and helps resolve job failures and latency spikes across data pipelines
Datadog, the monitoring and security platform for cloud applications, announced the general availability of Data Jobs Monitoring, a new product that helps data platform teams and data engineers detect problematic Spark and Databricks jobs anywhere in their data pipelines, remediate failed and long-running-jobs faster, and optimize overprovisioned compute resources to reduce costs.
Also Read:Â Importance of Data Protection in Cybersecurity
Data Jobs Monitoring immediately surfaces specific jobs that need optimization and reliability improvements while enabling teams to drill down into job execution traces so that they can correlate their job telemetry to their cloud infrastructure for fast debugging.
“Data Jobs Monitoring enables my organization to centralize our data workloads in a single place—with the rest of our applications and infrastructure—which has dramatically improved our confidence in the platform we are scaling,” said Matt Camilli, Head of Engineering at Rhythm Energy. “As a result, my team is able to resolve our Databricks job failures 20% faster because of how easy it is to set up real-time alerting and find the root cause of the failing job.”
“When data pipelines fail, data quality is impacted, which can hurt stakeholder trust and slow down decision making. Long-running jobs can lead to spikes in cost, making it critical for teams to understand how to provision the optimal resources,” said Michael Whetten, VP of Product at Datadog. “Data Jobs Monitoring helps teams do just that by giving data platform engineers full visibility into their largest, most expensive jobs to help them improve data quality, optimize their pipelines and prioritize cost savings.”
CIO Influence Latest News:Crowdstrike Falcon for Insurability Fast Tracks Companies for Cyber Insurance Eligibility
Data Jobs Monitoring helps teams to:
- Detect job failures and latency spikes:Â Out-of-the-box alerts immediately notify teams when jobs have failed or are running beyond automatically detected baselines so that they can be addressed before there are negative impacts to the end user experience. Recommended filters surface the most important issues that are impacting job and cluster health, so that they can be prioritized.
- Pinpoint and resolve erroneous jobs faster:Â Detailed trace views show teams exactly where a job failed in its execution flow so they have the full context for faster troubleshooting. Multiple job runs can be compared to one another to expedite root cause analysis and identify trends and changes in run duration, Spark performance metrics, cluster utilization and configuration.
- Identify opportunities for cost savings:Â Resource utilization and Spark application metrics help teams identify ways to lower compute costs for overprovisioned clusters and optimize inefficient job runs.
[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]