OpenAI Teams Up With NVIDIA, AMD, Intel on MRC to Fix AI Supercomputer Failures

OpenAI has introduced the MRC networking protocol with Nvidia, Microsoft and AMD to improve AI supercomputer performance, reduce network failures and speed up advanced AI model training across massive GPU clusters powering ChatGPT and future AI systems.
OpenAI Teams Up With NVIDIA, AMD, Intel on MRC to Fix AI Supercomputer Failures
Written By:
Somatirtha
Reviewed By:
Sankha Ghosh
Published on
Updated on

OpenAI has introduced a new networking protocol called Multipath Reliable Connection (MRC), developed in partnership with major technology companies including NVIDIA, Microsoft, AMD, Intel, and Broadcom. The company says the protocol will help make AI supercomputer networks faster, more reliable, and more efficient during the training of advanced AI models.

The announcement comes as OpenAI scales its infrastructure to support growing demand for ChatGPT. The company recently said that more than 900 million people now use the chatbot every week, increasing the need for high-performance systems capable of processing large volumes of data without delays.

What is MRC?

MRC is a networking protocol designed to improve communication between GPUs in large AI training clusters. OpenAI described it as a “novel protocol” built into the latest 800Gb/s network interfaces to boost performance and resilience during AI model training.

Traditional AI systems typically send data through a single network path. Congestion or hardware failures on that path can slow training or interrupt workloads entirely. OpenAI says MRC addresses this issue by distributing data packets across hundreds of network paths simultaneously.

The company said this approach reduces congestion and allows systems to quickly bypass failed connections, helping training operations continue without major interruptions.

Also Read: OpenAI and Anthropic Step into Services Business as Indian IT Adapts

Why it Matters for AI Training

Training advanced AI models requires a massive computing infrastructure where GPUs constantly exchange information. OpenAI explained that a single training step can involve millions of data transfers across thousands of GPUs.

Even a single delayed transfer can leave expensive GPUs idle, wasting computing resources and increasing training time. OpenAI said MRC aims to reduce these bottlenecks by delivering more predictable network performance, even in the presence of failures.

“Our goal was not just to build a fast network, but also to build one that delivers very predictable performance, even in the presence of failures, to keep training jobs moving,” the company said in a blog post.

OpenAI said the protocol has already been deployed on its largest NVIDIA GB200 supercomputers. The company has also shared the MRC specification through the Open Compute Project, allowing other companies and researchers to adopt the technology.

Join our WhatsApp Channel to get the latest news, exclusives and videos on WhatsApp
logo
Analytics Insight: Top Tech & Crypto Publication | Latest AI, Tech, Crypto News
www.analyticsinsight.net