CIO Influence
Analytics Data Management Guest Authors Machine Learning

There’s Not Enough Data!

There’s Not Enough Data!

Tackling productivity to feed GenAI

There just isn’t enough data! The companies with pockets deep enough to train the foundational ”frontier” models that are powering the AI revolution are running into a somewhat unexpected problem. For a planet that is creating data at a rate of 420M TB per day, there simply isn’t enough data that is suitably curated and of high enough quality to feed the ever-evolving models.

The problem has become so acute that the latest thinking is to start creating synthetic data to train AI models which could well be fraught with danger and potential inaccuracy.

Also Read: CIO Influence Interview with Jason Hardy, CTO at Hitachi Vantara

This lack of high quality, curated data is not a problem unique to tech giants’ Large Language Models (LLMs). If you ask any business or organization if they have enough data, more often than not, they’ll say no. Again, the challenge isn’t a lack of data; it is a lack of the right quality data.

What has become very clear is that LLMs in their current state can be deployed to help us manage data. In fact, they are ready right now to be applied to one of the greatest perennial challenges in data management: Overcoming the productivity constraints of data engineers.

For years data engineering has been an underserved and underappreciated profession. It requires combining a large array of complex technology, deeply understanding the needs of the business and being able to work with fast moving production data often under the pressure of crucial data governance. Add to that the exponential growth in data that businesses collect and it’s unsurprising that data engineers are difficult to recruit and retain.

This problem is about to get worse.

All businesses will need to compete with AI. The productivity AI can bring to nearly all business processes is coming into increasingly sharp focus. We’re seeing 50% productivity increases already from internal AI use cases within technical teams. Imagine how much more you and your teams could achieve with that level of productivity. Imagine how much more your competitors can achieve with that level of productivity.

But AI eats data, and the more it eats the better it works. The best data to feed the AI use cases that can make your business successful is the private data you already have. Unfortunately that’s not the core metrics and KPI’s on which management reporting is currently built. A lot of the best data lies in the huge mass of messy unstructured data that businesses generate daily.

Being able to tap into that unstructured data – the call recordings, the PDFs – is invaluable.

At Matillion, we are seeing right now that forward-thinking businesses are figuring out how to augment what they do with AI.  These experimental use cases fed with ad-hoc data are giving the demo and the proof of concept. Now, in mid 2024 it’s time to move these concepts to production. That, of course, means the AI needs to be constantly fed with fresh data in production. This needs properly managed, audited production-ready data pipelines to feed AI.

Also Read: Role of Cloud-Based Unified Testing in Agile and DevOps Practices

A new task for the already burdened data engineers.

So a perennial problem is about to become an acute one.

How do we tackle this? Now is the time, more than ever before, for CIOs to focus on the productivity and scalability of the data teams in the business. What steps can be taken to prepare for the coming storm.

  • People
  • Tooling
  • AI-augmented processes

If we take it right back to the core challenge – there are not enough data engineers and data engineers are already overburdened in most businesses. Whilst media headlines and doom mongers may say that this is where AI will replace jobs, I disagree. This is where AI can step in to support data engineers to deliver what they can do best while taking away the busywork that overburdens them.

The real skill of a great data engineer is not the SQL ability but how they apply it to the data in front of them to sniff out the anomalies, the quality issues, the missing pieces and historical mishaps that must be navigated to get to some semblance of accuracy. For this reason the data engineers need three increasingly hard to master skills.

  1. The simplest of the three, they need to be able to construct data queries for both discovering and understanding their data but also for transforming it into a form they can use.
  2. They need a deep understanding of the technology at their disposal, everything from the way change-data-capture is set up on Oracle to the record ordering guarantees in Kafka through to the most efficient way to partition  data in an iceberg table and everything that glues all that together.
  3. The thing that really needs experience however is that they need an intuitive nose for data quality past present and future. How was a data set built? How has it evolved over time? What mistakes have been made in the past and what mistakes are likely to be made in the future?

What’s exciting for us beleaguered data engineers is that AI is showing great ability to be a very helpful tool in three of these hard to master skills that will ultimately make us better and more productive at our jobs. I’m sure we’ve all seen the advancements in AI’s ability to take plain text queries and turn them into increasingly complex SQL, thus lightening the load of remembering all the advanced syntax for whichever data platform is on trend.

GenAI’s infinite ability to absorb and digest technical knowledge about complex systems is an incredible aid for the second hardest skill, and can quickly accelerate our knowledge of that funky new NoSQL database we just learned we need to integrate with.

Also Read: Role of Cloud-Based Unified Testing in Agile and DevOps Practices

Surely, however, the most exciting innovations come from LLMs’ rapidly increasing ability to understand the meaning and nuance of the data itself. This is being driven by two evolving technological advancements:

  1. The LLM’s ability to think laterally which has been an arms race kicked off since the release of ChatGPT a year and a half ago.
  2. The ever increasing amount of context that the LLM can reasonably connect to get to an answer.

Together these are increasingly giving the LLM the power, with guidance, to understand a real world dataset, what can be done with it and what the problems might be.

I am not among the doom-mongers predicting the imminent end of the data engineer usurped by a general intelligence capable of doing all of the above. It seems silly to predict the future when there are a mountain of pressing data problems in every organization that need to be fixed today, all of which are exacerbated by the increasing demand from AI.

As it stands, and for the foreseeable future, GenAI cannot replace human roles. If the data feeding the model is not 100% accurate, you will see hallucinations. Precision isn’t within AI’s wheelhouse right now. What AI is fantastic at is speeding up processes, taking away the busywork that prevents people from doing what they do best – problem solving, being creative, tackling high value tasks.

For me, having a tool that tackles the mundane tasks that take far too much of my time is ideal. The AI tools I already use day-to-day give me the freedom to work on the things that light my fire, that add true value to the business, and the tasks that need that human in the loop and always will.

It is exciting to explore what is possible, and what becomes possible as the tech evolves so quickly. Whilst I understand why many data engineering teams are hesitant or just don’t have capacity to introduce the tools, I think we’re on the precipice of a truly exciting era for data engineering.

[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]

Related posts

Hazelcast to Highlight How to Accelerate Measurable Business Outcomes from Investments in Kafka for Streaming Data

GlobeNewswire

Satori Launches Self-Service Data Access To Streamline Enterprise Data Access Management

CIO Influence News Desk

ArangoDB Unveils ArangoGraph Insights Platform to Bridge the Gap Between Data and Insights

CIO Influence News Desk