In the first part, we discovered Hugging Face and how Transformer models work. In this post, we’ll review chapters 2, 3 and 4 (the next chapters are not released yet).

Part 2 — Using Transformers

We will focus on using the Transformers’ Library. Its features are :

  • ease of use
  • flexibility
  • simplicity

Let’s see what’s behind. Take this full example:

If we separate each steps, we get something like this:

Preprocessing with a tokenizer

Transfomer models can’t process raw text directly. First step is to convert text inputs into numbers. A tokenizer is responsible for:

  • splitting the input into words, subwords or symbols (that are called tokens)
  • mapping each token to an integer
  • adding addional inputs

Then, the model outputs logits. I won’t get into the details, but for keep it simple, logits are raw results and we need a softmax function to turn them into probabilities that sum up to 1 we can use.

Tokenizers

We focus now into tokenizers. They are core components of the pipeline. We will talk about 3 differents kinds (but thre are much more out there):

  • word-based
  • character-based
  • subword tokenization

Word-based

They split raw text into words (basic, aren’t they?)

The main cons are

  • they can have huge vocabulary length
  • tokens like “dog” and “dogs” are similar but they are two unrelated tokens
  • they need a custom token UNK to match unknown token

Character-based

They split characters.

The pros are:

  • the vocabulary is small
  • fewer out-of-vocabulary tokens

Subword tokenization

This one takes the best of both worlds. This is especially useful for languages like German and Turkish.

Handling multiple sequences

Models expect a batch of inputs. But inputs may be different length and our models need a rectangular shape.

The trick is to add padding to smallest inputs to make sure our inputs are rectangular.

Below, in the screenshot, you see that the input_ids of the second sentence has been completed by “0”.

And to be certain that attention layers don’t take into account this padding, we need to add “attention_mask” to tell which token must be taken and which must be ignored.

This is the end of chapter 2. The course is full of well explained examples. I deliberately made it short.

Let’s see the third part.

Part 3 — Fine-tuning a pre-trained model

In this example, we will use MRPC from the GLUE benchmark. This is a famous benchmark researchers use to measure their models. It tells if pairs of sentences and bales are equivalent or not.

We can easily take a dataset from the Hugging Face Hub.

The tokenizer handles pair of sentences like BERT expects. To differentiate sentences the tokenizer introduces “token_types_ids”:

Transformers are already Keras models. So we don’t have a lot of work to begin training. If you are familiar with Keras, it’s straight forward:

After that, we can make predictions. We just need to apply a softmax to get probabilities from our logits:

And this is all. Our model is now fine-tuned.

Part 4 —Sharing models and tokenizers

This is a last part in this course: Sharing models and tokenizers.

Well, it’s all about how to use the Hugging Face Hub. This is “a central platform that enables anyone to discover, use, and contribute new state-of-the-art models and datasets.” I simple quote the course as they really well explain it.

They host a lot of models (public and private). The models in the Hub are not limited to ransformers or even NLP. There are models from Flair and AllenNLP for NLP, Asteroid and pyannote for speech, and timm for vision, to name a few.

Each of these models is hosted as a Git repository, which allows versioning and reproducibility. Sharing a model on the Hub means opening it up to the community and making it accessible to anyone looking to easily use it, in turn eliminating their need to train a model on their own and simplifying sharing and usage.

Additionally, sharing a model on the Hub automatically deploys a hosted Inference API for that model. Anyone in the community is free to test it out directly on the model’s page, with custom inputs and appropriate widgets.

The best part is that sharing and using any public model on the Hub is completely free! Paid plans also exist if you wish to share models privately.

Conclusion

I really enjoyed this course. I already knew concepts in NLP. This course has refreshed some concepts and made me discover what is Hugging Face.

I see Hugging Face popping more and more on social media, and what they do look very pro and qualitative.

I hope this brief post will make you take a look at their work.

Photo by Joshua Earle on Unsplash

If you have anything to tell me about this post, feel free. I’m always open to learn and to share.

Have a really good day.

Bises

Maxime 🙃

--

--

Maxime Pawlak

#dataScientist #techplorator #prototypeur #entrepreneur