BERT — Bidirectional Encoder Representations from Transformer

8 min readNov 2, 2021

State-of-the-art Language Model for NLP

BERT — is a Natural Language Processing Model developed by researchers in Googe AI. When it was proposed it achieved start-of-the-art accuracy on 11 NLP and NLU tasks including the very competitive Stanford Question Answering Dataset (SQuAD v1.1), GLUE (General Language Understanding Evaluation), SWAG (Situation With Adversarial Generations)…..

The BERT model was pre-trained using text from Wikipedia(that’s 2,500 million words!) and Book Corpus (800 million words) and can be fine-tuned with the question and answer datasets.

Why BERT?

Why we Moving on to BERT,

One of the biggest challenges in the Language Model is the lack of training data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets contain only a few thousand or a few hundred thousand human-labeled training examples.

To overcome the above issue, BERT trains a language model on a large unlabelled text corpus (unsupervised or semi-supervised). And we can also be Fine-tuning this large model to specific NLP tasks to utilize the large repository of knowledge this model has gained (supervised)

Language models could only read text input sequentially — either left-to-right or right-to-left — but couldn’t do both at the same time. BERT is different because it is designed to read in both directions at once. This capability, enabled by the introduction of Transformers, is known as bi-directionality.

BERT Model Architecture

BERT is released in two versions of the pre-trained model, which is trained on a huge dataset.

BERT also used Many previous NLP algorithms and architectures such that semi-supervised training, OpenAI Transformers, ELMo Embeddings, ULMFit, Transformer

BERT is basically an Encoder stack of transformer architecture.

A transformer architecture is an encoder-decoder network that uses self-attention on the encoder side and attention on the decoder side. An encoder that reads the text input and a decoder that produces a prediction for the task.
The detailed workings of the Transformer are described here.

BERT BASE Model — It has 12 layers in the Encoder stack, 12 Attention Heads, 110 million parameters, 768 Hidden Sizes

BERT LARGE Model — It has 24 layers in the Encoder stack, 16 Attention Heads, 340 million parameters, 1024 Hidden Sizes

Model Input/Output Representation

The input of the model is unambiguously represented in a single sentence and a pair of sentences (eg: Question and Answer)in one token sequence.

BERT uses WordPiece embeddings with a 30,000 token vocabulary.

The first token of every sequence of input is always a[CLS], which means the classification token.
The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
If the Sentence pairs are packed together into a single sequence, they will differentiate the sentence in two ways. First, using [SEP] token. Second, add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

Input embedding as E, the final hidden vector of the special [CLS] token

How does BERT work?

The new model features an important improvement when it comes to context understanding. Especially with certain texts which are “context-heavy,” for which context understanding is so important for analytics purposes, BERT is a great solution.

Say we have two sentences like:

You were right.
Make a right turn at the light.

In the first sentence, the word “right” refers to a decision; in the second sentence, “right” refers to direction. We know that because of the context.

Another example:

My favorite flower is a rose.
He quickly rose from his seat.

In the first sentence, the word “rose” refers to a flower; in the second one, it’s referred to the past tense of rise. Again, we know that because of the context.

Because BERT practices to predict missing words in the text, and because it analyzes every sentence with no specific direction, it does a better job at understanding the meaning of homonyms than previous NLP methodologies, such as embedding methods.

To overcome this challenge, BERT is pre-trained on two different NLP tasks:

MLM (Masked Language Modeling)
Next Sentence Prediction (NSP)

MLM (Masked Language Modeling)

Usually, Language Model can only be trained in one specific direction. BERT can handle this issue with MLM.

Masked Language Modeling(MLM) training is to hide a word in a sentence and have the program predict what word has hidden(masked) based on the context of the hidden words.

BERT does this by masking 15% of all WordPiece tokens in each sequence at random.

Note: [mask] token appear only in the pre-training and not during the fine-tuning.

For every mask input sequence,

Randomly select 15% of tokens (not all masked in the same way) and don’t replace them with [MASK] 100% of the time

Example Sentence: This is going to be so long

For 80% of the time: Replace the word with the [MASK] token.

This is going to be so long → This is going [MASK] be so long

For 10% of the time: Replace the word with a random word

This is going to be so long → This is going the be so long

For 10% of the time: Keep the word unchanged

This is going to be so long → This is going to be so long

Next Sentence Prediction (NSP)

Most downstream tasks such as Question and Answer(QA) and Natural Language Inference (NLI) are based on understanding the relationship between two sentences. It is not directly captured by language modeling.

In the BERT training process, the model receives a pair of sentences as an input and it will be trained to predict whether the second sentence is subsequent to the first sentence.

It is Binary Classification, having two labels IsNext, NotNext.

When choosing the sentences A and B for each pre-training example, 50% of the time Sentence B is an actual sentence that follows Sentence A and it is labeled as IsNext. Next 50% of the time Sentence B will be random sentences from the corpus and it is labeled as NotNext.

Example:

Input: [CLS] the man went to [MASk] store [SEP] he bought a gallon [MASK] milk [SEP]

Output: IsNext

Another Example:

Input: [CLS] the man [MASk] to the store [SEP] penguin [MASK] are flight ##less birds [SEP]

Output: NotNext

Fine-Tuning with BERT

Using BERT for a specific task is relatively straightforward:

BERT can be used for a wide variety of language tasks, while only adding a small layer to the core model:

For Classification Tasks(eg. Sentiment Analysis), add a classification (FFN) layer on top of the Transformer output for the [CLS] token.
For Question Answering Tasks (e.g. SQuAD v1.1), BERT train two extra vectors that are responsible for marking the beginning and the end of the answer.

Fine Tune BERT for Different Tasks

Task-specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch.

E represents the input embedding, Ti represents the contextual representation of token i, [CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences

Sequence Level Task

Sentence Pair Classification Tasks
Single Sentence Classification Tasks

Token Level Tasks

Question Answering Tasks
Single Sentence Tagging Tasks

BERT provides fine-tuned results for 11 NLP tasks. Here, we discuss some of those results on benchmark NLP tasks.

Evaluation for BERT: GLUE

General Language Understanding Evaluation (GLUE) benchmark: Standard split of data to train, validation, test, where labels for the test set are only held in the server.

Datasets in GLUE are mentioned below in two categories:

Sentence pair tasks

MNLI, Multi-Genre Natural Language Inference
QQP, Quora Question Pairs
QNLI, Question Natural Language Inference
STS-B The Semantic Textual Similarity Benchmark
MRPC Microsoft Research Paraphrase Corpus
RTE Recognizing Textual Entailment
WNLI Winograd NLI is a small natural language inference dataset

Single sentence classification

SST-2 The Stanford Sentiment Treebank
CoLA The Corpus of Linguistic Acceptability

Evaluation for BERT: SQUAD

The Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs.

Example:

Input: Where do water droplets collide with ice crystals to form precipitation?

Input Paragraph: …precipitation forms as smaller droplets coalesce via a collection with other rain drops or ice crystals within a cloud…

Output: within a cloud

The best performing BERT (with the ensemble and TriviaQA) outperforms the top leaderboard system by 1.5 F1-score in ensembling and 1.3 F1-score as a single system. In fact, a single BERTBASE outperforms the top ensemble system in terms of F1-score.

Conclusion

BERT's major contribution was in adding more generalization to existing Transfer Learning by methods using bidirectional architecture. The fact that it’s approachable and allows fast fine-tuning will likely allow a wide range of practical applications in the future. It has achieved state-of-the-art results in different tasks thus can be used for many NLP tasks. It is also used in Google Search in 70 languages as of Dec 2019.

Below are some examples of search queries in Google Before and After using BERT.