Subtitles section Play video Print subtitles ROBBY NEALE: I'm Robby. Yeah. I'm an engineer at Google, and I'm going to talk to you about how to leverage TF text for your preprocessing and language models inside of TensorFlow. So, for those unfamiliar with language models-- All right, there we go. Forward. They're basically everywhere. You use them in clear understanding, you have related keyword searches, article summaries, spell check, auto-complete, text-to-speech, spam filters, chat bots. You really can't get away from them. And it's really a good time to be into NLP right now, because we're going through somewhat of a Renaissance. Last year, this paper on BERT was released that uses attention and transformers. I'm not going to go too much into it, other than traditionally, obviously when you're working with text, models don't play as well with strings, so you convert those into numbers. And so we've used embeddings, different ways to embed, like GloVe, ELMo, Word2Vec are different ways to create vector representations of your words. And these worked really, pretty well. The one problem is, with some words, when you look them up in your vocab, like bat, am I talking about an animal? Am I talking about baseball equipment? Even words that don't sound the same, entrance and entrance, they're spelled the exact same. So when you're trying to represent these as a vector, you're getting, you're trying to have two different meanings to the same sequence of letters. And so, BERT has gotten around this by-- It's a model that actually uses the context of the sentence to actually create this vector for the words. And so, it's done really well, and this is the Stanford question and answer data set. So BERT was late 2018. The scores before that were in the low 70s. It came out, it jumped up to around 82, and ever since then, people have been iterating on this model through RoBERTa, XLNet, AlBERT And I pulled the scores up from last week. And you can see that the most recent model, AlBERT, is actually outperforming humans, so that's pretty crazy. So it's really exciting right now to be into this. And so let's just jump right in. So what is TF text? Our charter was basically to make programming these language models in TensorFlow easier. Traditionally, it's been very difficult to do this. You would get, let's say you have some data. Like, here's an example of queries, and we want to train on this data, right? Well, before we can do that, we need to do some preprocessing, which is basically tokenization outside of that, because we didn't have that availability inside of TensorFlow. And then once we did this preprocessing, we had to somehow fit it into a tensor. So we would get this preprocessed text, add it into the graph, and then normally we pad out our tenses to make them a uniform shape so that they're available. And then we'd finally train our model, and we'd go to publish it, and we'd put it on our model server. And we're like, OK, we're ready to go, right? And then so, when we get the serving data, well, you can't just plug that serving data right in, right? We had this preprocessing that has to happen. And so, either you're relying on the client to transform the data as well, or you're doing it yourself, and a lot of times it's a different language than what you did your normal scripts in. And I've seen it, even when the preprocessing is exactly the same. It's used the exact same regex, because it's different libraries, one might consider a character class to be punctuation, where the other doesn't. And so you get training skew when these preprocessing steps are different. And so when you actually go to serve the model, you don't have the same performance, and that's problematic, right? So our charter was to make this as easy as possible, to support TensorFlow, or to support text inside of TensorFlow. And to do this, we want to do all the text processing in graph. And we do this through a series of text and sequential APIs that were not available, and actually create a new tensor type called the Ragged Tensors that better represents this text. So if we go back to when it was painful, what we really want to do is just get rid of this preprocessing step, right? Put everything in the graph. And so, all your preprocessing happens in graph, and then when you go to serve the model, you're not relying on the client to perform those same steps when you serve the model, and they call it-- And so really the main thing that was missing was tokenization. So last year, we had a RFC with a tokenizer API. And we wanted to make this as easy as possible and straightforward. So it's very simple, it's an abstract tokenizer class, it has one method, tokenize. It takes input, a string, and gives you back your tokens. And so if we see this, it's very simple. We have a couple of sentences here. We tokenize them into words. The one thing I like to point out-- which is not completely obvious immediately until you see examples-- is that our input is a rank one tensor, and our output is a rank two. The reason why this is, is our tokens are grouped by the string that they're split from. And so it's really easy from the engineer's perspective to be able to tell which string tokens were pulled from which string in the original tensor. The one thing you can't do from this output, is tell where in that originating string it came from. And for that. We have one extra tokenizer with offsets class, abstract class, that has tokenizer with offsets, which is the same thing. You give it an input, tensor of strings, it gives you your tokens, but also gives you where those tokens start and end. So we can see that example here, we call it tokenize with offsets. And we can see the letters. "I" starts at zero, and moves one position, and then "know" is in the second position and moves up six characters. So through these offsets, if you want to know where the tokens are in your originating string, you can do that. And you'll notice the shapes are exactly the same as the shapes of the token. So mapping tokens to starts and limits is very simple from here. So we provide five basic tokenizers. You know, one of the questions when we first did the RFC was, why don't we just have one, one tokenizer to rule them all? The problem is, every model's different. You have different limitations and things you want to get around, and we don't want to push our opinion on you, because they're all different. We just want to build the tools and allow you to make the decision. And so a lot of these are very simple. Whitespace obviously just splits a sentence on Whitespace. Unicode script, so if you know Unicode, characters are grouped together in what they call Unicode scripts. So you would have Latin characters, Greek, Arabic, Japanese are just some examples. And then they also group spaces, punctuation, and numbers as well, and so it splits on those. I would say in the most simple case, if you're just working with English, the main difference between Whitespace is, it splits out the punctuation. So Wordpiece, this was popularized by the BERT model which I mentioned earlier. It basically takes text that you've already tokenized,