An inventory of transformers inference optimisation methods in the HuggingFace echosystem

January 26, 2025 · 3 min read

Photo by Raychel Sanner / Unsplash

On this page

Latency is one of the main challenges to making machine learning impactful for an organisation. Depending on the latency requirements and the inference methods, the emphasis on latency can be either about cost efficiency and / or about scalability.

Machine Learning Engineering teams need to be able to provide value and iterate on their project as efficiently as possible (I will go over ML/Data/AI roles in another post). Echosystems like HuggingFace have rendered this much more accessible. From a unified set of libraries to develop models but also sharing pertained weights, the team has been exceptionally reactive in implementing state of the art techniques across the model lifecycle (e.g. they just released open-r1 only mere weeks after Deepseek R1 was released).

Latency of transformers prediction can be improved by optimising:

runtime
model architecture
requests processing

In this post I make the inventory and provide references for optimising inference of transformers via runtime and model architecture changes. I will cover requests processing in another post as the best-practice are not specific to Huggingface.

Runtime

ONNX

Optimum library

Intel

Nvidia

Neuron

TPU

Model architecture

Quantisation

Knowledge Distillation

Read Next

AI is changing the way we build SaaS

And it's here to stay. Let's for a moment go down memory lane. Before "software ate the world", before software platforms took over? How would people access services? Either via word of mouth or a phone book, you would call and get someone to

Oct 3

The rise of nanoservices

While microservices are motivated by domain driven design, which defines bounded business domains and ultimately manifests itself via loosely coupled services that can be independently updated, nanoservices are a response to the needs of machine learning engineering teams to manage an increasing number of models that act as a single

Sep 20

Unified Architecture Design for AI & ML In Digital Products

As someone that has been working in this field for about 12 years, it's very odd to see how the term machine learning has fallen out of fashion for the flashier AI. Though AI has always been quite popular, I do remember a distinctive period before the emergence

Sep 14

Evaluating engineering productivity in the age of AI

There is so much noise about the emergence of coding leveraging the latest AI advancements. I do think that a lot of it is very reactionary, some are quite sober takes. I've always been skeptical about hype because I'm a firm believer of the precautionary principle

Sep 8