Sharing Portal | Chameleon Cloud

Does Quantization Improve Inference Speed? It Depends

Quantization is often cited as a technique for reducing model size and accelerating deep learning. However, past literature suggests that the effect of quantization on latency varies significantly across different settings, in some cases even increasing inference time rather than reducing it. To address this discrepancy, we conduct a series of systematic experiments on the Chameleon testbed to investigate the impact of three key variables on the effect of post-training quantization: the machine learning framework, the compute hardware, and the model itself. Our experiments demonstrate that each of these has a substantial impact on the overall effect of the inference time of a quantized model.

1 1 - 1 Feb. 20, 2025, 10:39 AM

reproducible research experiment

Authors

Ahmed Farrukh, New York University Abu Dhabi (ahmed.farrukh@nyu.edu)
Mohamed Saeed, Microsoft (mohamedsaeed@microsoft.com)
Fraida Fund, NYU Tandon School of Engineering (ffund@nyu.edu)

Launch on Chameleon

Launching this artifact will open it within Chameleon’s shared Jupyter experiment environment, which is accessible to all Chameleon users with an active allocation.

Download Archive

Download an archive containing the files of this artifact.

Download with git

Clone the git repository for this artifact, and checkout the version's commit

git clone https://github.com/AhmedFarrukh/QuantizationExperiments.git
# cd into the created directory
git checkout 71a90f1d97ccf276f40e52164da2132899af7756

Feedback

Submit feedback through GitHub issues

Versions

Version 2025-02-20 Feb. 20, 2025, 10:22 AM

Version Stats

1 1 -