BERT — Bidirectional Encoder Representations from Transformer

State-of-the-art Language Model for NLP


Why we Moving on to BERT,

  • One of the biggest challenges in the Language Model is the lack of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples.
  • Language models could only read text input sequentially — either left-to-right or right-to-left — but couldn’t do both at the same time. BERT is different because it is designed to read in both directions at once. This capability, enabled by the introduction of Transformers, is known as bi-directionality.

BERT Model Architecture

BERT is released in two versions of the pre-trained model, which is trained on a huge dataset.

Model Input/Output Representation

The input of the model is unambiguously represented in a single sentence and a pair of sentences (eg: Question and Answer)in one token sequence.

  1. The first token of every sequence of input is always a[CLS], which means the classification token.
  2. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
  3. If the Sentence pairs are packed together into a single sequence, they will differentiate the sentence in two ways. First, using [SEP] token. Second, add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.
Input embedding as E, the final hidden vector of the special [CLS] token

How does BERT work?

The new model features an important improvement when it comes to context understanding. Especially with certain texts which are “context-heavy,” for which context understanding is so important for analytics purposes, BERT is a great solution.

  1. You were right.
  2. Make a right turn at the light.
  1. My favorite flower is a rose.
  2. He quickly rose from his seat.
  1. MLM (Masked Language Modeling)
  2. Next Sentence Prediction (NSP)

MLM (Masked Language Modeling)

Usually, Language Model can only be trained in one specific direction. BERT can handle this issue with MLM.

  • For 80% of the time: Replace the word with the [MASK] token.
  • For 10% of the time: Replace the word with a random word
  • For 10% of the time: Keep the word unchanged

Next Sentence Prediction (NSP)

Most downstream tasks such as Question and Answer(QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences. It is not directly captured by language modeling.

Fine-Tuning with BERT

Using BERT for a specific task is relatively straightforward:

  1. For Classification Tasks(eg. Sentiment Analysis), add a classification (FFN) layer on top of the Transformer output for the [CLS] token.
  2. For Question Answering Tasks (e.g. SQuAD v1.1), BERT train two extra vectors that are responsible for marking the beginning and the end of the answer.

Fine Tune BERT for Different Tasks

Task-specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch.

E represents the input embedding, Ti represents the contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences
  • Sentence Pair Classification Tasks
  • Single Sentence Classification Tasks
  • Question Answering Tasks
  • Single Sentence Tagging Tasks

Evaluation for BERT: GLUE

General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to train, validation, test, where labels for the test set are only held in the server.

  • MNLI, Multi-Genre Natural Language Inference
  • QQP, Quora Question Pairs
  • QNLI, Question Natural Language Inference
  • STS-B The Semantic Textual Similarity Benchmark
  • MRPC Microsoft Research Paraphrase Corpus
  • RTE Recognizing Textual Entailment
  • WNLI Winograd NLI is a small natural language inference dataset
  • SST-2 The Stanford Sentiment Treebank
  • CoLA The Corpus of Linguistic Acceptability
Result of BERT on GLUE NLP task

Evaluation for BERT: SQUAD

The Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs.

Result on SQUAD


BERT's major contribution was in adding more generalization to existing Transfer Learning by methods using bidirectional architecture. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. It has achieved state-of-the-art results in different tasks thus can be used for many NLP tasks. It is also used in Google Search in 70 languages as of Dec 2019.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store