Latency is one of the main challenges to making machine learning impactful for an organisation. Depending on the latency requirements and the inference methods, the emphasis on latency can be either about cost efficiency and / or about scalability.

Machine Learning Engineering teams need to be able to provide value and iterate on their project as efficiently as possible (I will go over ML/Data/AI roles in another post). Echosystems like HuggingFace have rendered this much more accessible. From a unified set of libraries to develop models but also sharing pertained weights, the team has been exceptionally reactive in implementing state of the art techniques across the model lifecycle (e.g. they just released open-r1 only mere weeks after Deepseek R1 was released).

Latency of transformers prediction can be improved by optimising:

  • runtime
  • model architecture
  • requests processing

In this post I make the inventory and provide references for optimising inference of transformers via runtime and model architecture changes. I will cover requests processing in another post as the best-practice are not specific to Huggingface.

Runtime

ONNX

GitHub - huggingface/optimum: 🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools - huggingface/optimum

Optimum library

Intel

GitHub - huggingface/optimum-intel: 🤗 Optimum Intel: Accelerate inference with Intel optimization tools
🤗 Optimum Intel: Accelerate inference with Intel optimization tools - huggingface/optimum-intel

Nvidia

GitHub - huggingface/optimum-nvidia
Contribute to huggingface/optimum-nvidia development by creating an account on GitHub.

Neuron

GitHub - huggingface/optimum-neuron: Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips. - huggingface/optimum-neuron

TPU

GitHub - huggingface/optimum-tpu: Google TPU optimizations for transformers models
Google TPU optimizations for transformers models. Contribute to huggingface/optimum-tpu development by creating an account on GitHub.

Model architecture

Quantisation

GitHub - huggingface/optimum-quanto: A pytorch quantization backend for optimum
A pytorch quantization backend for optimum. Contribute to huggingface/optimum-quanto development by creating an account on GitHub.

Knowledge Distillation

GitHub - huggingface/setfit: Efficient few-shot learning with Sentence Transformers
Efficient few-shot learning with Sentence Transformers - huggingface/setfit