Modern AI training has a hardware problem.
Most distributed training stacks assume that serious model training requires expensive GPU clusters with high-bandwidth GPU-GPU communication. In practice, that often means NVLink, NVSwitch, InfiniBand, or other premium interconnects.
That assumption makes custom model training expensive and inaccessible for many companies.
At PocketBrains, we are building NexTrain to challenge that assumption.
Our goal is simple: train custom AI models on the GPUs customers can actually access.
The problem with modern training infrastructure
Most companies do not start with a perfect training environment. They may have:
- A few RTX 4090 or 5090 machines
- L40S or A6000 GPUs
- A100 or H100 instances without ideal interconnects
- Rented GPUs from multiple providers
- On-prem GPUs not designed as a frontier-model cluster
- Valuable data, but no clean training pipeline
Traditional distributed training often struggles in this environment because it assumes high-bandwidth communication between GPUs.
When GPU-GPU communication becomes the bottleneck, hardware choice becomes restrictive. The training stack starts dictating the infrastructure.
That is backwards.
The customer should be able to choose the GPU configuration based on cost, availability, and business needs. The infrastructure should adapt to the hardware — not the other way around.
The NexTrain approach
NexTrain is designed around a different principle:
Instead of depending on direct GPU-GPU collective communication as the primary path, NexTrain uses a training architecture based on:
- CPU/RAM-mediated parameter movement
- Parameter streaming
- Memory-aware scheduling
- Prefetching and double buffering
- Compute and communication overlap
- Flexible GPU assignment
- Hardware-aware training plans
This does not mean that every GPU can train every model. GPU memory, CUDA support, host RAM, PCIe bandwidth, storage throughput, and kernel compatibility still matter.
But it does mean that NexTrain is designed to reduce the need for premium GPU-GPU interconnects. That changes the economics of custom model training.
Why removing NVLink from the critical path matters
NVLink is powerful. For many workloads, it is extremely useful.
But requiring NVLink-class infrastructure for custom model training creates a major barrier. Many companies have access to GPUs, but not the right kind of GPU cluster. They may have compute, but not the premium interconnect.
NexTrain is designed to make that compute usable.
The key idea is not to replace NVLink as hardware. The key idea is to avoid making NVLink mandatory.
If training can be structured so that GPUs do not need to constantly communicate with each other, then customers can train models on more accessible hardware. This enables:
- Lower training cost
- Broader GPU compatibility
- Better use of existing hardware
- More flexible GPU sourcing
- Faster experimentation
- Reduced dependency on a single cloud or cluster type
Early benchmark results
In early experiments, we observed near-linear scaling from a single GPU to eight GPUs.
| Configuration | Throughput | vs. Theoretical |
|---|---|---|
| 1 GPU (baseline) | 226 TFLOPS | — |
| 8 GPUs (measured) | 1.72 PFLOPS | 95% efficiency |
| 8 GPUs (theoretical linear) | 1.808 PFLOPS | 100% (target) |
A single GPU reached approximately 226 TFLOPS. An eight-GPU configuration reached over 1.72 PFLOPS, against a theoretical linear target of 1.808 PFLOPS — representing approximately 95% scaling efficiency.
This result suggests that a training architecture with minimized GPU-GPU communication can preserve strong multi-GPU efficiency.
We are continuing to benchmark NexTrain across different GPU types, model sizes, sequence lengths, and training workloads.
What NexTrain is not
To be clear about scope:
- NexTrain is not a GPU cloud.
- NexTrain is not an inference platform.
- NexTrain is not a model hosting service.
- NexTrain is not trying to replace NVIDIA hardware.
NexTrain is training infrastructure that helps companies train custom AI models on flexible GPU configurations. The goal is not to sell GPUs. The goal is to make more GPUs usable for training.
How PocketBrains uses NexTrain
PocketBrains combines NexTrain with PocketAgentic, our data preparation layer.
PocketAgentic prepares messy enterprise data into training-ready datasets. NexTrain trains custom models from those datasets. The workflow looks like this:
↓ PocketAgentic
Training-Ready Dataset
↓ NexTrain
Custom AI Model
This lets customers go from raw internal data to a trained model without needing to build the entire data and training stack themselves.
What customers can train
NexTrain supports SFT (Supervised Fine-Tuning) and RL (Reinforcement Learning) training algorithms. Customers can bring their own model definition as a Python file, or use a standard base model.
Use cases include:
- Domain-specific language understanding
- Structured output generation
- Formatting and extraction models
- Reasoning and planning models
- Enterprise knowledge models
- Customer support workflows
- Internal automation
- Vertical AI applications
Why this matters now
The next wave of AI adoption will not be powered only by general-purpose frontier models. Enterprises want smaller, cheaper, specialized models trained on their own data.
But custom training is still too hard. The dataset is messy. The training stack is complicated. The GPU requirements are expensive.
NexTrain is built for this gap.
We believe the future of custom AI model training will be more flexible, more hardware-aware, and less dependent on premium interconnect clusters.
Our vision
PocketBrains is building the data-to-model training stack for custom AI.
PocketAgentic prepares the data. NexTrain trains the model. Customers choose the GPU path that works for them.
Bring your data. Choose your GPUs. Train your model.