🇬🇧 FDBK— Hugging Face Course — Chapter — 1/2

6 min readSep 5, 2021

Hugging Face recently released a course about NLP and Transformers. At that date, they only released the first part (4 chapters).

I spent a couple of days on this course. So, I’ll make it shorter for you. Obviously, I’ll be fast on some items. I can only recommend you to attend this course if you are really interested by these topics.

The goal is not to give you a complete sum-up of the course. Actually, it’s pretty selfish as my goal is just to re-formulate to make sure I understood. And the best way to do it is to share it with you. Let’s get started!

Transformer Models

The first chapter is a gentle introduction into Transformer models.

First, let’s define NLP.

NLP stands for Natural Language Processing. It’s a field of linguistics and machine learning focused on understanding human language. What it means it’s not only understand words, but also context of sentences and paragraphs.

For instance:

Classifying whole sentences: sentiment, spam or not…
Classifying each word: named entities
Generating text content
Extracting answer from a text
Generating new sentence from an input text: translation, summarization…

Transformer models are used to solve all these kinds of tasks. Hugging Face enables companies to use and share their models easily through its Model Hub. A lot of organizations are using Hugging Face.

Pipeline

The most basic object in the library is “pipeline”. It connects a model with its necessary preprocessing and post-processing steps. Some available pipelines (don’t worry, we’ll explain them just below) are:

feature-extraction
fill-mask
NER
question-answering
translation
zero-shot classification

Let’s dive into these examples with some code!

Zero-shot classification

This task allows you to specify which labels to user for the classification. And your labels don’t have to rely on the labels of the pre-trained model. It’s called zero-shot because you don’t need any fine-tuning. It can directly return probability score for any list of labels you want!

Text generation

Well, this one is pretty straightforward. You give the beginning of a sentence, and it outputs the rest. Of course, you can specify any particular model you want.

Mask filling

This task is about masking some words in your text and the model outputs the most pertinent solutions.

NER

NER stands for Named Entity Recognition. This task looks if words from your text is an entity. A entity is previously group to categorize words: person, organization, location, … It’s particularly useful when you want to analyze a text and understand which entities interact between themselves.

Question answering

Well, ask a question, and the model will answer according to the context.

Summarization

Easy one as well. It reduces a text into a shorter text while keeping the most important aspects.

Translation

Do I really need to explain what translation is? Est-ce vraiment nécessaire que je définisse ce qu’est la traduction ?

A bit of history

Transformer architecture was introduced in 2017. The first pre-trained model shared was GPT in June 2018. Since then, it’s a crazy race between companies.

Transformers are language models. They have been trained on large amount of raw text in a self-supervised fashion: there is no need to label data. The model learnt on its own: learn to predict the next word, learn to do mask-filling…

Transformers are big models: more parameters involves better performance.

But as you can imagine, train a heavier model costs a lot.

This is why there is Hugging Face. But wait a minute. Let’s explain how.

It all starts with Transfer Learning. What is it? This is a very good question! Thanks for asking.

At the beginning, you have an empty model. You need to find a very large corpus, a lot of money and wait some days (maybe some weeks) before it’s trained and it’s ready to use. Once done, congratulations, you have e pre-trained language model. But you may not have enough money or time to use it.

Well, let’s do transfer learning instead. You get a pre-trained model. It’s been trained by awesome people at Google, Facebook, Hugging Face, Stanford, Microsoft… I’m sure they are adorable.

Now, you fine-tune this model. This means you take your small dataset and your set of GPUd and train this pre-trained model. You’ll get a fine-tuned language model for your data.

There are several types of architecture.

Encoder

It receives an input and builds a representation of it. It is optimized to acquire knowledge from the input.

Decoder

It uses the encoder’s representation (see it as “features”) to generate a target sequence.

They can work independently:

encoder-only for tasks that require understanding of the inputs (NER, classification)
decoder-only for generative tasks

Encoder-decoder

Of course, they can work together with encoder-decoder (or sequence-to-sequence) architecture. It’s good for generative tasks that needs inputs: translation, summarization.

Behind the magic of transformer models, there is a main thing: Attention. This feature was introduced in 2017 in the paper “Attention is all you need”. Basically, it tells the model to pay specific attention to certain words in a sentence. A word is highly affected by its context (the words surrounding it).

Encoder models

In an encoder models, there are attention layers.They can access all the words in the initial sentences. They are often characterized by a bi-directional attention and called auto-encoding models. The output is feature vectors. Not only they contain a representation of the word, but also of its context.

Decoder models

In these models, attention layers, for a given word, can only access words positioned before it in the sentence. They are often called auto-regressive models, uni-directional or masked self-attention.

Sequence-to-sequence

The sequence-to-sequence is the combination of an encoder followed by a decoder.

Bias

To train these transformer models, researchers scraped all the content they could. And as you can imagine, they scraped the best and the worst. This is why you could (unfortunately) found the same bias you could found in real life: