An inventory of transformers inference optimisation methods in the HuggingFace echosystem
Latency is one of the main challenges to making machine learning impactful for an organisation. Depending on the latency requirements and the inference methods, the emphasis on latency can be either about cost efficiency and / or about scalability.
Machine Learning Engineering teams need to be able to provide value and iterate on their project as efficiently as possible (I will go over ML/Data/AI roles in another post). Echosystems like HuggingFace have rendered this much more accessible. From a unified set of libraries to develop models but also sharing pertained weights, the team has been exceptionally reactive in implementing state of the art techniques across the model lifecycle (e.g. they just released open-r1 only mere weeks after Deepseek R1 was released).
Latency of transformers prediction can be improved by optimising:
- runtime
- model architecture
- requests processing
In this post I make the inventory and provide references for optimising inference of transformers via runtime and model architecture changes. I will cover requests processing in another post as the best-practice are not specific to Huggingface.
Runtime
ONNX
Optimum library