Data is essential to AI workloads, yet it is often an afterthought for organizations in planning, purchasing, and even early-stage implementations. It is common for companies to acquire as many GPUs as their budgets allow, and then, when reality hits, they scramble to figure out what to do with their data, which is often mistakenly equated with storage. However, storage is only a piece of a larger data strategy and arguably not even the most important one. Data is not synonymous with storage; organizations need to change this mindset to truly optimize their massive investments in AI.
Also Read: Top Misconceptions Around Data Operations and Breaking Down the Role of a VP of Data Ops
Data Orchestration
AI workloads are about the movement of data, so orchestration of that data needs to be the essential element of every organization’s data strategy. Data orchestration lifecycles should include assimilating existing data across heterogeneous storage systems, cloud services, and multiple geographic locations into a single global namespace. Data orchestration is critical to AI data strategies, providing the ability to immediately feed data to your GPU resources regardless of their physical location, automation of workflows, access distributed data from geographically dispersed sites, replication of data without creating copy sprawl, tiering of data based on active/inactive parameters, and doing all of this transparently, programmatically, and at a file-granular level.
Eliminating Data Gravity
It is currently the case that data is treated as subordinate to storage systems, which is antithetical to the entire concept of digital transformation. The reverse should actually be true— storage should be subordinate to data. Achieving this requires the elimination of data gravity for tiering, hardware upgrades, load balancing, scalability, data protection, data accessibility, collaboration, and burst-to-compute.
Storage system vendors benefit from data gravity. It locks them in and ensures that your reliance on them leads to ongoing expansion. At best, they may offer some remote site replication and snapshot capabilities. Often, these data services create lots of copies, increasing the vendor’s footprint, management complexity, and revenue. Many storage systems also provide some tiering functionality, but this is typically limited to tiering within their own storage platforms.
Truly eliminating data gravity means making sure data can move freely between heterogeneous storage systems, whether they are in the same data center or thousands of miles away. A system that eliminates data gravity will efficiently leverage available networks, support any storage medium, eliminate or minimize copies, not require bolt-on features, provide absolute control of what data is orchestrated, and be completely transparent to both users and applications when data is in flight or has been placed in a new location.
Data Assimilation
The process of migrating data from heterogeneous data sources is slow, tedious, and extremely inefficient. The concept that “time is money” is compounded when you have invested so much capital in the infrastructure, software, people resources, energy costs, physical real estate, and so on needed for AI workloads. Additionally, and perhaps more importantly, waiting weeks or even months to find and utilize your data for analytics, research, training, processing, or rendering of data can directly impact your top-line revenue.
Data assimilation is the process of using existing data from a wide range of sources and constantly feeding it to GPU servers, beasts that must be fed quickly and perpetually. Very few solutions provide data assimilation, but that doesn’t mean it shouldn’t be a requirement. Data assimilation significantly improves your time to value and optimizes the utilization of GPUs, a resource that should never, or rarely, remain idle.
Distributed Data
For enterprises and other organizations, data is often distributed, whereas compute resources, especially GPUs, tend to be more centralized. Unfortunately, enterprise data is typically restricted to being a local asset because of the limitations of legacy storage systems.
Providing a single namespace that spans multiple sites and offers automated granular data orchestration becomes extremely valuable, making data a global asset feeding GPU resources regardless of geographic locality.
Also Read: Leveraging AI and Machine Learning for DataSecOps
Burst-to-GPUs
Some companies don’t have enough GPU servers, are waiting for systems to arrive but need to access them immediately or cannot justify acquiring these servers but periodically require access to them.
The ability to orchestrate data to GPU services in the public cloud and/or to GPU-as-a-Service providers can be highly beneficial for these use cases. This capability is greatly enhanced by the elimination of data gravity, a single namespace that spans multiple sites, data assimilation, and, of course, transparent data mobility.
Data Lifecycles
It is generally expected that, on average, at least 80% of data is dormant. Companies can realize economic advantages by tiering inactive data to lower-cost, denser systems. A number of storage systems do provide some levels of tiering, but it is typically performed within a storage system or by using an external solution that is part of the vendor’s product family. A greater level of value is derived when metadata is always accessible about all data, no matter where the data is stored, making finding data fast and easy, even if the data is stored in archive tiers. Tiering in a data-driven, AI architecture needs to be enabled across heterogeneous storage solutions that span existing storage systems regardless of vendor, commodity storage, cloud storage, etc.
Power efficiency should also be a factor in the calculation. GPU servers consume a massive amount of power. It is estimated that it would take half the power a nuclear power plant generates to support one million GPUs annually. That may seem like a lot of GPUs, but hyperscalers are already reaching these levels and are still growing at an astronomical rate. And we are seeing large enterprises and other organizations already in the tens of thousands scaling to hundreds of thousands of GPUs. Therefore, storage solutions with superior power efficiency are desirable to offset the massive power requirements of GPU servers. Every watt saved on storage is a watt available for GPU-driven intelligence. Implementing an automated data lifecycle process that tiers data to power efficient storage should be a part of your data strategy.
Parallel File Systems for AI Workloads
Due to the specific type of performance required, a parallel file system is often necessary for AI workloads and any type of GPU computing. However, a few details must be considered when choosing a parallel file system.
Also Read: AMD MI300 Seen In The Wild: Liftr Insights Data
Complexity. Some parallel file systems practically require a PhD to implement, manage, and continuously optimize. The expertise to do so is limited and often comes at a high price. What happens if you can’t hire or retain these specialists? We have seen this movie many times before and know how it ends. That is why the smart thing to do is to select a file system that is easy at the onset and remains so over time.
Proprietary Agents and Modified Clients. Avoid parallel file systems that leverage proprietary agents and modified clients. We learned this lesson decades ago that the process of managing non-standard agents, drivers, and clients is cumbersome, difficult, and time-consuming. These non-standard agents are also often a security risk and represent a form of vendor lock-in. Proprietary approaches have no place in the modern data center, and leveraging standards-based clients should be required.
Enterprise Standards. Many of the parallel file systems that were designed for the HPC and research markets do not meet the standards of Enterprise compliance, security, and software build requirements. Enterprises and hyperscalers pursuing GenAI initiatives need the parallel file system capabilities coupled with the Enterprise standard design of NAS and are turning to hyperscale NAS architectures.
Performance should be sufficient initially, over time, and at scale. Make sure that whatever solution you evaluate has customers who support the vendor’s claims that these performance levels have proven to be true in real-world use cases. And details matter. Understand what it takes to achieve performance in terms of cost and management. Performance should scale linearly as you add more resources with little to no tuning.
Conclusion
When implementing AI strategies and workloads, the role of data is essential but often an afterthought for most organizations. Data is not storage and it’s important to change your mindset and treat data on its own terms. Since AI workloads require data movement, orchestration of that data becomes of paramount importance. Orchestration will allow you to eliminate data gravity for tiering, hardware upgrades, load balancing, scalability, data protection, data accessibility, collaboration, and burst-to-compute. You’ll be able to assimilate data from a wide range of sources and feed it to GPU servers. Orchestration will also allow you to provide a single namespace that spans distributed sites and moves data to cloud GPU services on an as-needed basis. You can tier inactive data to lower-cost, denser systems and offset the power requirements of GPU servers with a system that provides superior power efficiency. The right type of parallel file system can provide the answer when it comes to AI workloads but don’t forget to do your homework to ensure the system is easy to implement and manage, doesn’t use proprietary agents and modified clients, and delivers enough performance initially, over time, and at scale.
[To share your insights with us as part of editorial or sponsored content, please write to psen@itechseries.com]