
Intel and Weizmann researchers unveil breakthrough to make AI models run faster
New speculative decoding technique promises up to 2.8x faster language model performance.
A team of researchers from Intel Labs and the Israel’s Weizmann Institute of Science has presented a new approach they say could significantly boost the performance of large language models (LLMs), the generative AI systems behind tools like ChatGPT and other chatbots, while helping make AI development more open and cost-efficient.
The technique, presented at this year’s International Conference on Machine Learning (ICML) in Vancouver, focuses on a method known as speculative decoding, an inference optimization that pairs a smaller, faster “draft” model with a larger, more precise one. The draft model quickly generates a probable chunk of text in response to a user’s prompt, which the larger model then checks for accuracy. By offloading much of the heavy computational work to the faster model, the combined system can deliver responses up to 2.8 times faster than conventional approaches, according to the research team.
The advance addresses one of the core technical challenges in generative AI: the large computing demands required to run powerful models at scale. “We have solved a core inefficiency in generative AI,” said Oren Pereg, a senior researcher in Intel Labs’ Natural Language Processing Group. “Our research shows how to turn speculative acceleration into a universal tool. This isn’t just a theoretical improvement; these are practical tools already helping developers build faster and smarter applications today.”
While speculative decoding is not a new idea in itself, past approaches have typically required the small and large models to share the same vocabulary or to be trained together. This limited its practical use, as developers had to build custom small models compatible with each specific large model.
Related articles:
The researchers’ new method removes that constraint. By introducing three new algorithms that decouple speculative decoding from vocabulary alignment, the team says developers can now combine models from entirely different sources, or even different vendors, without needing to retrain them together. In an AI market increasingly fragmented by proprietary ecosystems, the ability to mix and match models could make it easier to deploy generative AI systems across a wider range of hardware, from cloud data centers to edge devices.
The technique is already integrated into the widely used open-source Hugging Face Transformers library, making it accessible to millions of developers without the need for custom implementation.
“This work removes a major technical barrier to making generative AI faster and cheaper,” said Nadav Timor, a Ph.D. student working under Prof. David Harel at the Weizmann Institute. “Our algorithms unlock state-of-the-art speedups that were previously available only to organizations that train their own small draft models.”