🇬🇧 FDBK — Hugging Face Course — Chapter — 2/2

5 min readSep 12, 2021

In the first part, we discovered Hugging Face and how Transformer models work. In this post, we’ll review chapters 2, 3 and 4 (the next chapters are not released yet).

Part 2 — Using Transformers

We will focus on using the Transformers’ Library. Its features are :

ease of use
flexibility
simplicity

Let’s see what’s behind. Take this full example:

If we separate each steps, we get something like this:

Preprocessing with a tokenizer

Transfomer models can’t process raw text directly. First step is to convert text inputs into numbers. A tokenizer is responsible for:

splitting the input into words, subwords or symbols (that are called tokens)
mapping each token to an integer
adding addional inputs

Then, the model outputs logits. I won’t get into the details, but for keep it simple, logits are raw results and we need a softmax function to turn them into probabilities that sum up to 1 we can use.

Tokenizers

We focus now into tokenizers. They are core components of the pipeline. We will talk about 3 differents kinds (but thre are much more out there):

word-based
character-based
subword tokenization

Word-based

They split raw text into words (basic, aren’t they?)

The main cons are

they can have huge vocabulary length
tokens like “dog” and “dogs” are similar but they are two unrelated tokens
they need a custom token UNK to match unknown token

Character-based

They split characters.

The pros are:

the vocabulary is small
fewer out-of-vocabulary tokens

Subword tokenization

This one takes the best of both worlds. This is especially useful for languages like German and Turkish.

Handling multiple sequences

Models expect a batch of inputs. But inputs may be different length and our models need a rectangular shape.

The trick is to add padding to smallest inputs to make sure our inputs are rectangular.

Below, in the screenshot, you see that the input_ids of the second sentence has been completed by “0”.

And to be certain that attention layers don’t take into account this padding, we need to add “attention_mask” to tell which token must be taken and which must be ignored.

This is the end of chapter 2. The course is full of well explained examples. I deliberately made it short.

Let’s see the third part.

Part 3 — Fine-tuning a pre-trained model

In this example, we will use MRPC from the GLUE benchmark. This is a famous benchmark researchers use to measure their models. It tells if pairs of sentences and bales are equivalent or not.

We can easily take a dataset from the Hugging Face Hub.

The tokenizer handles pair of sentences like BERT expects. To differentiate sentences the tokenizer introduces “token_types_ids”:

Transformers are already Keras models. So we don’t have a lot of work to begin training. If you are familiar with Keras, it’s straight forward:

After that, we can make predictions. We just need to apply a softmax to get probabilities from our logits:

And this is all. Our model is now fine-tuned.

Part 4 —Sharing models and tokenizers

This is a last part in this course: Sharing models and tokenizers.

Well, it’s all about how to use the Hugging Face Hub. This is “a central platform that enables anyone to discover, use, and contribute new state-of-the-art models and datasets.” I simple quote the course as they really well explain it.

They host a lot of models (public and private). The models in the Hub are not limited to ransformers or even NLP. There are models from Flair and AllenNLP for NLP, Asteroid and pyannote for speech, and timm for vision, to name a few.

Each of these models is hosted as a Git repository, which allows versioning and reproducibility. Sharing a model on the Hub means opening it up to the community and making it accessible to anyone looking to easily use it, in turn eliminating their need to train a model on their own and simplifying sharing and usage.

Additionally, sharing a model on the Hub automatically deploys a hosted Inference API for that model. Anyone in the community is free to test it out directly on the model’s page, with custom inputs and appropriate widgets.

The best part is that sharing and using any public model on the Hub is completely free! Paid plans also exist if you wish to share models privately.