This year’s annual Red Hat Summit is all about AI inferencing. The open-source company sees a major role for itself, similar to how it made Linux important, in the technology that unlocks the full potential of artificial intelligence. With two new initiatives, AI Inference Server and the llm-d community, the resources should be in place to further professionalize companies’ infrastructure for the AI era.
As far as Red Hat is concerned, AI inferencing needs to be pushed in the right direction. Companies are investing heavily in training models. They spend much time preparing large datasets and feeding them to the model. This enables it to make connections and identify anomalies. But ultimately, data must produce usable output.
AI inferencing aims to do just that: it is the component that makes AI operational. The model can apply what it has learned during training to real-world situations. The ability to recognize patterns and draw conclusions distinguishes AI from other technologies. This inferencing capability can help with everyday tasks and highly complex computer programming. Its strength lies in the speed and accuracy with which systems can make decisions based on large amounts of data.
However, according to Red Hat, due to the complexity of generative AI models and the increasing scale of production implementations, AI inference is becoming a bottleneck for companies. Inference consumes enormous amounts of hardware resources, which can reduce responsiveness and increase operational costs. The new AI Inference Server and the llm-d community are designed to nip these challenges in the bud.
Democratization of AI inference
The Red Hat AI Inference Server is designed for high performance and features tools for model compression and optimization. To achieve this, Red Hat relies on two core components. The first comes from Neural Magic, a startup acquired by Red Hat late last year. Neural Magic’s technology optimizes AI models so they run faster on standard processors and GPUs. The Neural Magic software makes clever use of the available memory of the processors to achieve this performance. As a result, AI workloads achieve speeds comparable to those of specialized AI chips.
Neural Magic also plays a role in the second component of the Red Hat AI Inference Server. Before the acquisition, the startup was involved in the open source project vLLM for model serving, and will continue to be involved as a commercial contributor. This community project provides an inference engine for LLMs, perfect as a second foundation for the server. Neural Magic is closely involved in model and system optimization for improved vLLM performance at scale. It improves latency and resource efficiency and delivers high throughput for generative AI inferencing. By optimizing memory management during token generation, models can efficiently and quickly support many users. Support is available for large input contexts, multi-GPU model acceleration, and continuous batching.
Broad support for AI models
When the new server becomes available, important generative AI models will be supported immediately. DeepSeek, Gemma, Llama, Llama Nemotron, Mistral, and Phi are among the supported models. More are available, although based on the information we currently have, it is unclear which ones these are. The group of companies building generative AI models is generally embracing vLLM to an increasing extent. This bodes well for the success of the AI project.
The Red Hat AI Inference Server is available as a standalone containerized solution or as part of both RHEL AI (Red Hat Enterprise Linux AI) and Red Hat OpenShift AI. In each deployment environment, Red Hat AI Inference Server provides users with a hardened, supported distribution of vLLM.
In addition, the server features intelligent LLM compression tools for reducing the size of foundational and fine-tuned models. This minimizes the required computing power while maintaining model accuracy and increasing potential. On top of that, there is an optimized model repository, hosted in the Red Hat AI organization on Hugging Face, which provides direct access to a validated and optimized collection of AI models. The models are ready for inference deployment, which Red Hat says helps increase efficiency by two to four times without compromising model accuracy. The server also offers third-party support, which means it can be deployed on non-Red Hat Linux and Kubernetes platforms.
Every model, every accelerator, every cloud
During the keynote at the Red Hat Summit, CEO Matt Hicks also made it clear how the new server aligns with the open-source company’s vision. The company envisions a future full of AI, defined by unlimited possibilities and not limited by infrastructure silos. The company sees a horizon where organizations can implement any model, on any accelerator, and in any cloud environment. This should result in an exceptional, consistent user experience without excessive costs.
Companies need a universal inferencing platform to unlock the full potential of generative AI investments. This platform will serve as the standard for smoother, high-quality AI innovation, both now and in the years to come. Just as Red Hat pioneered the open enterprise by transforming Linux into the foundation of modern IT, the company now wants to be ready to shape the future of AI inferencing.
To achieve this, Red Hat will do everything in its power to build a thriving ecosystem around the vLLM community, so that it becomes the definitive open standard for inferencing in the hybrid cloud. But this is where the new open source initiative comes in, also via llm-d. This project for distributed inferencing at scale will see the light of day at the Red Hat Summit.
Scalable inferencing as a critical factor
The llm-d initiative was created in collaboration with CoreWeave, Google Cloud, IBM Research, and Nvidia. It aims to make production-grade generative AI as ubiquitous as Linux. This technology allows organizations to run AI models more efficiently without skyrocketing costs and latency.
Red Hat describes llm-d primarily as a visionary project that can further address growing resource demands. It is intended to extend the power of vLLM and break through the limitations of single-server solutions. To do this, it leverages the orchestration capabilities of Kubernetes to implement llm-d inferencing capabilities in existing IT infrastructures. This gives IT teams the means to meet the diverse requirements of business-critical workloads. At the same time, the TCO costs of high-performance AI accelerators must be drastically reduced.
Text continues after the box below
Technical possibilities
From a technical perspective, the llm-d project offers six interesting options. These are listed below.
High-performance communication APIs for faster and more efficient data transfer between servers, with support for Nvidia Inference Xfer Library (NIXL).
vLLM, which has become the de facto standard inference server in open source, with support for the latest frontier models and a wide range of accelerators, including Google Cloud Tensor Processing Units (TPUs).
Prefill and Decode Disaggregation, which splits the input context and token generation phases of AI into separate operations so they can be distributed across multiple servers.
KV (key-value) Cache Offloading, based on LMCache, shifts the memory load of the KV cache from GPU memory to more cost-efficient and widely available standard storage, such as CPU memory or network storage.
Kubernetes-driven clusters and controllers for more efficient planning of compute and storage resources across varying workloads, while maintaining performance and lower latency.
AI-Aware Network Routing for forwarding incoming requests to servers and accelerators that are most likely to have hot caches from previous inference computations.
Industry support
The companies mentioned above—CoreWeave, Google Cloud, IBM Research, and Nvidia—are founding contributors. AMD, Cisco, Intel, Lambda, and Mistral AI have also joined as partners. The llm-d community is further supported by the University of California’s Sky Computing Lab, creators of vLLM, and the LMCache Lab at the University of Chicago, creators of LMCache.
“The launch of the llm-d community, backed by a vanguard of AI leaders, marks a pivotal moment in addressing the need for scalable gen AI inference, a crucial obstacle that must be overcome to enable broader enterprise AI adoption,” explained Red Hat AI CTO Brian Stevens at the launch. “By tapping the innovation of vLLM and the proven capabilities of Kubernetes, llm-d paves the way for distributed, scalable and high-performing AI inference across the expanded hybrid cloud, supporting any model, any accelerator, on any cloud environment and helping realize a vision of limitless AI potential.”