In this talk, we will discuss how we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel® Xeon® Cascade Lake processors to improve inference performance while maintaining less than 0.5% drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has a high impact as it clearly demonstrates the various complexities of quantizing the language-translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating-point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph.
Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost the performance of quantized deep learning inference and establishes best practices to run an inference with high efficiency on Intel CPUs.