GPUs Are Fast, I/O is Your Bottleneck

Published: October 2nd, 2023

- Hope Wang

Unless you’ve been living off the grid, the hype around Generative AI has been impossible to ignore. A critical component fueling this AI revolution is the underlying computing power, GPUs. The lightning-fast GPUs enable speedy model training. But a hidden bottleneck can severely limit their potential – I/O. If data can’t make its way to the GPU fast enough to keep up with its computations, those precious GPU cycles end up wasted waiting around for something to do. This is why we need to bring more awareness to the challenges of I/O bottlenecks.

GPUs are data-hungry

GPUs are blazingly fast and capable of performing tera floating point operations per second (TFLOPS), which is 10~1,000 times faster than CPUs. However, data must be available in GPU memory for GPUs to perform these operations. The faster you load data into GPUs, the quicker it can perform the computation.

Now, a new bottleneck is emerging – GPUs are increasingly starved by slow I/O. This is usually called I/O bound or data stall. I/O means input/output, a term measuring the performance of reading and writing data from the source to its destination. Studies from Google and Microsoft have shown up to 70% of model training time is taken up by I/O. Put another way, your GPUs spend 70% of their time sitting idle, wasting your time and money.

Let’s look at the typical machine learning pipeline. At the beginning of each training epoch, training data is stored on object storage and then moved to the local storage of the GPU instance, which is finally delivered to the GPU memory. Data retrieval through network, copying data between storage tiers, and metadata operations all contribute to the duration of each training epoch.

In the past, it was best to feed data to GPUs locally from NVMe storage. Now, data is distributed across various locations, datasets have outgrown local GPU storage capacity, and GPU speed has increased while I/O has not kept pace. Sadly, many may not even know you are not maximizing the full potential of its GPUs.

Why is it becoming even more important now?

Before we get into solutions, let’s talk about why this is all becoming a bigger deal today.

First, generative AI is having a moment right now. People around the world are getting excited, making businesses take notice. These days, organizations are in a hurry to get AI products to market. So they’re asking a lot from their AI infrastructure and want to see results fast. But it’s often when they finally roll out AI platforms from pilot to production that they realize the infrastructure needs to be optimized.

Second, as the AI hype goes on, GPUs are hard to get. Now, you ‘rent’ them in the cloud, like AWS’s EC2 P4 instances, packing up to 8 NVIDIA Tesla A100 GPUs. Cloud training has become the new norm. Once upon a time, GPUs were used only to process local datasets, but now you will have to move your datasets around or copy data between regions and clouds to make them closer to wherever GPUs are. Developers often find some instances in a region far from where they store their data. This is problematic because geo-separated computation and storage mean slow I/O.

Last but not least, there’s a higher demand for better results at a lower cost. Foundation models and deep learning models require many experiments to determine optimal parameters. Machine learning engineers are doing more experiments because more iterations = better final models. Meanwhile, organizations prioritize ROI and cloud cost optimization, like FinOps. This makes it urgent to resolve the I/O bottleneck to better utilize expensive GPUs.

Architectural considerations

We need to optimize I/O in such a way that the GPU never has to wait for data to perform its computations. Here are key considerations for machine learning and AI infrastructure engineers to get more out of GPUs:

Load data in parallel – Use distributed data engines like DistributedDataParallel (DDP) in PyTorch to load, transform, and normalize datasets in parallel before feeding to GPUs.

Strategically cache data – Accelerate I/O by caching frequently used training data in a high-performance caching layer, like Alluxio, or directly in GPU memory.

Optimize storage format – Partitioning cold and hot data will prevent loading unnecessary data. Use columnar formats like Parquet to efficiently store and compress analytic data to save the I/O bandwidth before loading data to CPU.

Real-time ingest and data collecting using high throughput frameworks – Collect data in parallel from myriad sources from Kafka, Kinesis, or similar data pipelines.

Increase mini-batch size – Allow more efficient parallelization and utilization of GPU compute power and GPU memory during training.

Shard data across GPUs – Distribute data across multiple GPU devices in a scale-out fashion to train models faster.

Continuous monitoring – Monitor the model training performance to identify and alleviate bottlenecks quickly. For example, you can use TensorBoard to see how much time is spent on data loading. Also pay attention to the performance of storage, including throughput, IOPS and metadata performance.

Key takeaways

As deep learning models and datasets grow exponentially, it is critical to build scalable data pipelines that efficiently deliver data to on-premise or cloud GPUs. Be aware of I/O speed and ensure that I/O is not the bottleneck of valuable AI business outcomes. To maximize GPU investment value for your AI projects, your infrastructure teams should proactively consider the I/O optimizations.

Article Tags

AI, data, gpu, machine learning

About Hope Wang

Hope Wang has a decade of experience in Data, AI, and Cloud. An open-source contributor to Trino, PrestoDB, and Alluxio, she also holds AWS Certified Solutions Architect – Professional status. She currently works at Alluxio as developer advocate and previously worked in venture capital and as a Data Architect. She earned a BS in Computer Science, a BA in Economics, and an MEng in Software Engineering from Peking University, as well as an MBA from USC.

View all posts by Hope Wang

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
__atuvc	1 year 1 month	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.
__atuvs	30 minutes	AddThis sets this cookie to ensure that the updated count is seen when one shares a page and returns to it, before the share count cache is updated.

Cookie	Duration	Description
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_WTGVKVXEZJ	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_107693958_2	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
_jsuid	1 year	This cookie contains random number which is generated when a visitor visits the website for the first time. This cookie is used to identify the new visitors to the website.
at-rand	never	AddThis sets this cookie to track page visits, sources of traffic and share counts.
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
iutk	5 months 27 days	This cookie is used by Issuu analytic system to gather information regarding visitor activity on Issuu products.
uvc	1 year 1 month	Set by addthis.com to determine the usage of addthis.com service.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
loc	1 year 1 month	AddThis sets this geolocation cookie to help understand the location of users who share the information.
mc	1 year 1 month	Quantserve sets the mc cookie to anonymously track user behaviour on the website.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
__gpi	1 year 24 days	No description
_heatmaps_g2g_101137905	10 minutes	No description
cf_7167_id	20 years	No description
cf_7167_person_last_update	session	No description
GoogleAdServingTest	session	No description
prism_252377639	1 month	No description
querylyvid	3 months	No description
xtc	1 year 1 month	No description

GPUs Are Fast, I/O is Your Bottleneck

Article Tags

Subscribe to SDTimes

About Hope Wang

Related Articles

Catchpoint launches new solutions for monitoring AI workloads

The CTO’s guide to AI-powered reliability engineering

Report: Majority of employees are engaging in shadow AI

Applying agentic AI to legacy systems? Prepare for these 4 challenges