UC San Diego's Hao AI Lab just got its hands on one of NVIDIA's most powerful systems, and it's already reshaping how researchers think about serving large language models in production. The Hao AI Lab, which has been quietly influencing how companies like NVIDIA architect their AI infrastructure, is now using the DGX B200 to push the boundaries of fast, low-latency AI responses. The research coming out of UC San Diego doesn't stay in the lab—it's already powering real-world systems.
The Hao AI Lab at UC San Diego just leveled up its research capabilities with access to one of NVIDIA's most powerful AI systems. The team received a DGX B200 that's now housed at the university's San Diego Supercomputer Center, giving researchers immediate access to enterprise-grade computing power most academic labs can only dream about.
Here's what makes this significant: the Hao AI Lab isn't just consuming AI infrastructure—it's literally designing how AI infrastructure should work. The lab's research on DistServe, a novel approach to serving large language models, directly influenced the architecture of NVIDIA Dynamo, an open-source framework now deployed in production systems worldwide. This new DGX B200 is essentially giving the team the hardware to push that research even further.
"DGX B200 is one of the most powerful AI systems from NVIDIA to date, which means that its performance is among the best in the world," said Hao Zhang, an assistant professor at UC San Diego's Halıcıoğlu Data Science Institute. "It enables us to prototype and experiment much faster than using previous-generation hardware." Translation: the team can now test more ideas, faster, with more compute—which is exactly what drives breakthrough research.
Two projects are already moving forward at full speed. FastVideo focuses on training video generation models that can produce a five-second video from a text prompt in roughly five seconds. That's the kind of real-time capability that changes what's possible in production systems. The team's also tapping NVIDIA H200 GPUs alongside the DGX B200 for this research phase, effectively throwing serious horsepower at the problem.
The second project, Lmgame-Bench, sounds deceptively playful—it's a benchmarking suite that tests large language models using popular video games like Tetris and Super Mario Bros. But there's serious research underneath. By measuring how different LLMs handle game-playing tasks, the team can compare model performance in real-time gameplay scenarios. Users can test one model or pit two against each other, giving researchers a new dimension for understanding model capabilities beyond traditional benchmarks.
But the real breakthrough here traces back to something the Hao AI Lab figured out earlier: how to think about LLM serving differently. For years, the industry obsessed over throughput—the raw number of tokens a system can generate per second. More throughput meant lower cost per token. Simple math. Except it wasn't the complete picture.
Throughput says nothing about what users actually experience. If you're waiting for a response, you don't care how many tokens per second the system generates in aggregate. You care about latency—how fast your response comes back. This created a fundamental tension: push for high throughput and latency suffers. Optimize for latency and you sacrifice throughput.
The DistServe team proposed a solution: a new metric called "goodput." Here's how it works: goodput measures throughput while also satisfying user-specified latency objectives, what the industry calls service-level objectives. It's the sweet spot where cost and user experience actually align. A system with great goodput is handling many requests efficiently and keeping individual users happy.
Turning this insight into reality required rethinking how GPUs do the work. When an LLM processes a user request, it happens in two phases. First comes prefill—the system takes the user input and generates the first token. Then comes decode—the system generates subsequent tokens one at a time, predicting future tokens based on what came before.
Here's the problem nobody solved elegantly until now: these two operations have fundamentally different hardware demands. Prefill is compute-intensive. Decode is memory-intensive. When they run on the same GPU, they compete for resources, creating bottlenecks. Split them onto separate GPUs—one cluster handling prefill, another handling decode—and you eliminate that interference. Both operations run faster. That's prefill/decode disaggregation in practice.
"Previously, if you put these two jobs on a GPU, they would compete with each other for resources, which could make it slow from a user perspective," explained Junda Chen, a doctoral candidate working on the project. "Now, if I split the jobs onto two different sets of GPUs—one doing prefill, which is compute intensive, and the other doing decode, which is more memory intensive—we can fundamentally eliminate the interference between the two jobs, making both jobs run faster."
This isn't theoretical. NVIDIA's Dynamo framework baked in the ability to implement disaggregated inference at scale. The framework, designed to accelerate and scale generative AI models with maximum efficiency and minimum cost, now powers production deployments.
The DGX B200 gives the Hao AI Lab room to keep pushing this envelope. Beyond video generation and game-based benchmarking, the team's exploring the next frontier: even lower-latency LLM serving. Making models respond faster without sacrificing quality. It's the kind of research that sounds niche until you realize every production AI system in the world cares about it.
UC San Diego is also pursuing cross-departmental collaborations using the DGX B200—healthcare, biology, and other fields that increasingly depend on AI acceleration. The hardware's unlocking research across the university that simply wasn't feasible before.
The Hao AI Lab's access to the DGX B200 represents something bigger than a single university getting new equipment. This is academic research infrastructure feeding directly back into production AI systems. The research concepts born in UC San Diego laboratories are already shaping how companies deploy AI at scale, and now the team has the horsepower to push that boundary further. The lab's work on disaggregated inference and the "goodput" metric is likely to influence the next generation of AI serving frameworks, which means the research happening right now matters for every application relying on fast, efficient LLM responses.