Managing GPU training jobs shouldn't feel like juggling SSH sessions and manually moving files around. That's why we built TensorPool Jobs: a git-style interface that lets you push ML workloads to GPUs as easily as you push code to GitHub.
The Problem with Traditional GPU Workflows
The typical GPU training workflow is tedious:
- Manually SSH into a GPU instance
- Set up your environment
- Start training and keep the terminal open
- Babysit the process or risk losing output
- Manually download results when done
- Remember to shut down the instance (or pay for idle time)
If you're running multiple experiments, multiply this complexity by the number of jobs. It's time-consuming, error-prone, and expensive.
Jobs: A Better Way to Train
TensorPool Jobs flips the script. Instead of managing long-running SSH sessions, you define your job once, push it, and let TensorPool handle the rest.
Here's what makes it powerful:
Pay Only for Compute Time
Jobs spin up a cluster, run your training, save your outputs, and automatically shut down. You're billed only for the time your code is actually running—not for idle instances you forgot about.
Configuration as Code
Define your job in a tp.config.toml file:
[job]
name = "llama-finetune"
instance_type = "8xH100"
commands = [
"pip install -r requirements.txt",
"python train.py --epochs 100 --batch-size 32"
]
[output]
files = [
"checkpoints/",
"logs/",
"results.json"
]
This configuration is version-controlled, shareable, and reproducible.
Submit Jobs in Seconds
Once your config is ready, push your job:
tp job push tp.config.toml
TensorPool handles:
- Spinning up the specified GPU cluster
- Uploading your code
- Running your commands sequentially
- Capturing all logs
- Saving specified output files
- Shutting down the cluster
Monitor and Retrieve Results
Once your job is running, you can:
- Stream logs in real-time:
tp job listen <job_id> - Download outputs when complete:
tp job pull <job_id> - List all your jobs:
tp job list - Cancel a running job:
tp job cancel <job_id>
Real-World Workflows
Single Experiment
Quick model training with automatic cleanup:
# Initialize config
tp job init
# Edit tp.config.toml with your settings
# Push and forget
tp job push tp.config.toml
Your job runs, saves checkpoints, and shuts down automatically.
Hyperparameter Sweeps
Create multiple configs for different hyperparameters:
tp.config.lr-0.001.toml
tp.config.lr-0.0001.toml
tp.config.lr-0.00001.toml
Push them all at once:
tp job push tp.config.lr-0.001.toml
tp job push tp.config.lr-0.0001.toml
tp job push tp.config.lr-0.00001.toml
Jobs queue and run automatically. Retrieve the best results when they're done.
Getting Started
The workflow is straightforward: initialize a config, customize it for your training job, push it to TensorPool, and retrieve your results when done.
1. Initialize Your Config
In your project directory:
tp job init
This creates a tp.config.toml template with sensible defaults.
2. Customize Your Job
Edit the config with your training command, GPU requirements, and output files:
[job]
name = "my-training-job"
instance_type = "1xH100"
commands = [
"python train.py"
]
[output]
files = [
"model.pth",
"metrics.json"
]
3. Push and Monitor
Push your job and get a job ID for tracking:
tp job push tp.config.toml
That's it. Your cluster spins up, runs your training, saves outputs, and shuts down automatically. You can monitor progress with the commands described above, and pull your results when complete.
Under the Hood
Jobs leverage TensorPool's core infrastructure:
- Fast provisioning: Clusters spin up in minutes
- Automatic cleanup: Resources deallocate when jobs finish
- Persistent storage: Optional NFS volumes for large datasets
The result? A seamless experience that feels like running code locally, but with on-demand access to H100s, H200s, and B200s.
What's Next
Top startups and labs already use TensorPool Jobs to ship faster and train better models. Jobs will continue to get better with your feedback in our Slack:
- Scheduled jobs: Run training at specific times
- Job templates: Share configs across teams
- Enhanced monitoring: Real-time GPU utilization graphs
- Error Handling: Agentized job error fixing
- Multi-node Jobs: Don't need to explain this one :)
Try It Today
If you've been putting off GPU-intensive experiments because managing infrastructure is tedious, Jobs removes that friction.
Install the TensorPool CLI:
pip install tensorpool
Initialize your first job:
tp job init
Push it and see how simple GPU training can be:
tp job push tp.config.toml
Ready to train smarter, not harder? Check out the full Jobs documentation.
Questions? Feedback? We'd love to hear from you. Join our community or reach out on our website.
