MobileBERT: BERT for Resource-Limited Devices

Guest Blog Last Updated : 28 Jul, 2020
11 min read

Overview

Introduction

The MobileBERT architectures

MobileBERT : Architecture

Architecture visualization of transformer blocks within (a) BERT, (b) MobileBERT teacher, and (c) MobileBERT student. The green trapezoids marked with “Linear” are referred to as bottlenecks. Source

Linear

 

Multi-Head Attention

MobileBERT : Multi-head Attention

Stacked FFN

Operational optimizations

Image for post

NoNorm equation to replace batch normalization operation in the transformer blocks. The “dot” denotes Hadamara product — element-wise multiplication between the two vectors

The motivation of teacher and student size

Proposed knowledge distillation objectives

Image for post

Feature map transfer objective function. T is the sequence length, N the feature map size, and l the layer index.

Image for post

Attention map transfer objective function. T is the sequence length, A the number of attention heads, and l the layer index.

Image for post

Knowledge transfer techniques. (a) Auxiliary knowledge transfer, (b) joint knowledge transfer, (c) progressive knowledge transfer. Source

Experimental results

MobileBERT : Experimental results on the GLUE benchmark

Experimental results on the GLUE benchmark. Source

It’s, therefore, safe to conclude that it’s possible to create a distilled model which both can be performant and fast on resource-limited devices!

It’s been fine-tuned by itself on GLUE which proves that it’s possible to create a task agnostic model through the proposed distillation process!

Conclusion

If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find this one interesting👋🏼🤖

About the Author

Author Viktor Karlsson – Software Engineer

I am a Software Engineer and MSc of Machine Learning with a growing interest in NLP. Trying to stay on top of recent developments within the ML field in general, and NLP in particular. Writing to learn!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details