Why Fast NFS Storage Matters for ML Training

When scaling ML workloads across multiple nodes, storage performance often becomes the hidden bottleneck. You've invested in expensive H100s or B200s, but your GPUs sit idle waiting for data.

The Problem

Traditional cloud storage struggles with distributed ML training. AWS EBS and Google Persistent Disk typically deliver ~250 MB/s reads with limited IOPS. When you're paying $3/hour per GPU, having your GPUs wait on storage is an expensive inefficiency.

TensorPool's NFS: Built for Speed

We've engineered our NFS infrastructure specifically for the demands of ML workloads. Here's what sets it apart:

Performance That Scales

For a 100TB volume, you get:

43 GB/s reads - Load data faster than your GPUs can consume it
11 GB/s writes - Checkpoint without slowing down training
150k read IOPS - Handle millions of small files efficiently
75k write IOPS - Support intensive logging and metrics

Need even more? Our 1000TB+ volumes deliver:

322 GB/s reads - Saturate even the largest multi-node clusters
161 GB/s writes - Checkpoint massive models in seconds
1.5M read IOPS - Support hundreds of concurrent workers
750k write IOPS - Handle extreme write-intensive workloads

Why This Matters

Let's put these numbers in context:

Loading a 100GB dataset:

AWS EBS (~250 MB/s): ~7 minutes
TensorPool NFS (43 GB/s): ~2.3 seconds

Checkpointing a 50GB model:

Traditional cloud storage (~250 MB/s): ~3.5 minutes
TensorPool NFS (11 GB/s): ~4.5 seconds

When you checkpoint every epoch, these seconds add up to hours saved over a training run.

On-Demand Multi-Node: The TensorPool Advantage

Here's what sets TensorPool apart: true on-demand access to multi-node clusters.

Other providers force you to choose between long-term reservations or single-node instances. Want a 4-node H100 cluster? You'll need to commit to weeks or months upfront, or you simply can't get it.

TensorPool's architecture changes this. Because our high-performance NFS storage is independent from compute, we can offer genuinely on-demand multi-node clusters:

Spin up 2 nodes for a quick experiment
Scale to 4 nodes when you're ready to train
Jump to 8 nodes for your production run
Scale back down to 0 when you're done—storage persists

No long-term commitments. No data migration between cluster sizes. Your datasets and checkpoints stay on blazing-fast shared NFS while you elastically scale compute to match your workload.

This is what on-demand multi-node training should look like: instant access to clusters of any size, backed by persistent high-performance storage.

Getting Started

Setting up high-performance NFS with TensorPool takes just a few commands:

# Create a 100TB NFS volume
tp nfs create -s 100000 --name ml-data

# Create a multi-node cluster
tp cluster create -i ~/.ssh/id_ed25519.pub -t 8xH100 -n 4 --name training

# Attach NFS to your cluster
tp nfs attach <storage_id> <cluster_ids>

Your NFS volume will be mounted at /mnt/nfs-<storage_id> on all nodes, ready for immediate use.

Conclusion

Don't let slow storage bottleneck your GPU investment. TensorPool's high-performance NFS delivers speeds up to 322 GB/s and 1.5M IOPS, combined with the flexibility to scale your compute on-demand without data migration.

Ready to stop waiting on I/O? Get started with TensorPool.