{"id":219828,"date":"2025-08-11T00:02:23","date_gmt":"2025-08-11T05:02:23","guid":{"rendered":"https:\/\/lifeboat.com\/blog\/2025\/08\/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge"},"modified":"2025-08-11T00:02:23","modified_gmt":"2025-08-11T05:02:23","slug":"delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge","status":"publish","type":"post","link":"https:\/\/lifeboat.com\/blog\/2025\/08\/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge","title":{"rendered":"Delivering 1.5 M TPS Inference on NVIDIA GB200 NVL72, NVIDIA Accelerates OpenAI gpt-oss Models from Cloud to Edge"},"content":{"rendered":"<p style=\"padding-right: 20px\"><a class=\"aligncenter blog-photo\" href=\"https:\/\/lifeboat.com\/blog.images\/delivering-1-5-m-tps-inference-on-nvidia-gb200-nvl72-nvidia-accelerates-openai-gpt-oss-models-from-cloud-to-edge2.jpg\"><\/a><\/p>\n<p>NVIDIA and OpenAI began <a href=\"https:\/\/youtube.com\/shorts\/vRWj8WbI6AU?feature=shared\" target=\"_blank\" rel=\"noreferrer noopener follow external\">pushing the boundaries<\/a> of AI with the launch of NVIDIA DGX back in 2016. The collaborative AI innovation continues with the OpenAI gpt-oss-20b and gpt-oss-120b launch. NVIDIA has optimized both new open-weight models for accelerated inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second (TPS) on an NVIDIA GB200 NVL72 system.<\/p>\n<p>The gpt-oss models are text-reasoning LLMs with chain-of-thought and tool-calling capabilities using the popular mixture of experts (MoE) architecture with SwigGLU activations. The attention layers use RoPE with 128k context, alternating between full context and a sliding 128-token window. The models are released in FP4 precision, which fits on a single 80 GB data center GPU and is natively supported by Blackwell.<\/p>\n<p>The models were trained on NVIDIA H100 Tensor Core GPUs, with gpt-oss-120b requiring over 2.1 million hours and gpt-oss-20b about 10x less. NVIDIA worked with several top open-source frameworks such as <a href=\"https:\/\/huggingface.co\/blog\/welcome-openai-gpt-oss\" target=\"_blank\" rel=\"noreferrer noopener follow external\">Hugging Face Transformers<\/a>, Ollama, and vLLM, in addition to <a href=\"https:\/\/github.com\/NVIDIA\/TensorRT-LLM\" target=\"_blank\" rel=\"noreferrer noopener follow external\">NVIDIA TensorRT-LLM<\/a> for optimized kernels and model enhancements. This blog post showcases how NVIDIA has integrated gpt-oss across the software platform to meet developers\u2019 needs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>NVIDIA and OpenAI began pushing the boundaries of AI with the launch of NVIDIA DGX back in 2016. The collaborative AI innovation continues with the OpenAI gpt-oss-20b and gpt-oss-120b launch. NVIDIA has optimized both new open-weight models for accelerated inference performance on NVIDIA Blackwell architecture, delivering up to 1.5 million tokens per second (TPS) on [\u2026]<\/p>\n","protected":false},"author":732,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1522,6],"tags":[],"class_list":["post-219828","post","type-post","status-publish","format-standard","hentry","category-innovation","category-robotics-ai"],"_links":{"self":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/219828","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/users\/732"}],"replies":[{"embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/comments?post=219828"}],"version-history":[{"count":0,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/posts\/219828\/revisions"}],"wp:attachment":[{"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/media?parent=219828"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/categories?post=219828"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lifeboat.com\/blog\/wp-json\/wp\/v2\/tags?post=219828"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}