Placeholder Image

Subtitles section Play video

  • ROBBY NEALE: I'm Robby.

  • Yeah.

  • I'm an engineer at Google, and I'm

  • going to talk to you about how to leverage

  • TF text for your preprocessing and language models

  • inside of TensorFlow.

  • So, for those unfamiliar with language models--

  • All right, there we go.

  • Forward.

  • They're basically everywhere.

  • You use them in clear understanding,

  • you have related keyword searches, article summaries,

  • spell check, auto-complete, text-to-speech, spam filters,

  • chat bots.

  • You really can't get away from them.

  • And it's really a good time to be into NLP right now,

  • because we're going through somewhat of a Renaissance.

  • Last year, this paper on BERT was

  • released that uses attention and transformers.

  • I'm not going to go too much into it,

  • other than traditionally, obviously when

  • you're working with text, models don't play as well

  • with strings, so you convert those into numbers.

  • And so we've used embeddings, different ways to embed,

  • like GloVe, ELMo, Word2Vec are different ways

  • to create vector representations of your words.

  • And these worked really, pretty well.

  • The one problem is, with some words,

  • when you look them up in your vocab, like bat,

  • am I talking about an animal?

  • Am I talking about baseball equipment?

  • Even words that don't sound the same, entrance and entrance,

  • they're spelled the exact same.

  • So when you're trying to represent these as a vector,

  • you're getting, you're trying to have two different meanings

  • to the same sequence of letters.

  • And so, BERT has gotten around this by--

  • It's a model that actually uses the context of the sentence

  • to actually create this vector for the words.

  • And so, it's done really well, and this is the Stanford

  • question and answer data set.

  • So BERT was late 2018.

  • The scores before that were in the low 70s.

  • It came out, it jumped up to around 82,

  • and ever since then, people have been iterating on this model

  • through RoBERTa, XLNet, AlBERT And I pulled the scores

  • up from last week.

  • And you can see that the most recent model, AlBERT, is

  • actually outperforming humans, so that's pretty crazy.

  • So it's really exciting right now to be into this.

  • And so let's just jump right in.

  • So what is TF text?

  • Our charter was basically to make programming these language

  • models in TensorFlow easier.

  • Traditionally, it's been very difficult to do this.

  • You would get, let's say you have some data.

  • Like, here's an example of queries,

  • and we want to train on this data, right?

  • Well, before we can do that, we need

  • to do some preprocessing, which is basically

  • tokenization outside of that, because we

  • didn't have that availability inside of TensorFlow.

  • And then once we did this preprocessing,

  • we had to somehow fit it into a tensor.

  • So we would get this preprocessed text,

  • add it into the graph, and then normally we

  • pad out our tenses to make them a uniform shape

  • so that they're available.

  • And then we'd finally train our model,

  • and we'd go to publish it, and we'd

  • put it on our model server.

  • And we're like, OK, we're ready to go, right?

  • And then so, when we get the serving data,

  • well, you can't just plug that serving data right in, right?

  • We had this preprocessing that has to happen.

  • And so, either you're relying on the client

  • to transform the data as well, or you're doing it yourself,

  • and a lot of times it's a different language than what

  • you did your normal scripts in.

  • And I've seen it, even when the preprocessing is exactly

  • the same.

  • It's used the exact same regex, because it's different

  • libraries, one might consider a character class to be

  • punctuation, where the other doesn't.

  • And so you get training skew when these preprocessing steps

  • are different.

  • And so when you actually go to serve the model,

  • you don't have the same performance,

  • and that's problematic, right?

  • So our charter was to make this as easy as possible,

  • to support TensorFlow, or to support

  • text inside of TensorFlow.

  • And to do this, we want to do all the text processing

  • in graph.

  • And we do this through a series of text and sequential APIs

  • that were not available, and actually create

  • a new tensor type called the Ragged Tensors that

  • better represents this text.

  • So if we go back to when it was painful,

  • what we really want to do is just

  • get rid of this preprocessing step, right?

  • Put everything in the graph.

  • And so, all your preprocessing happens in graph,

  • and then when you go to serve the model,

  • you're not relying on the client to perform

  • those same steps when you serve the model, and they call it--

  • And so really the main thing that was missing

  • was tokenization.

  • So last year, we had a RFC with a tokenizer API.

  • And we wanted to make this as easy as

  • possible and straightforward.

  • So it's very simple, it's an abstract tokenizer class,

  • it has one method, tokenize.

  • It takes input, a string, and gives you back your tokens.

  • And so if we see this, it's very simple.

  • We have a couple of sentences here.

  • We tokenize them into words.

  • The one thing I like to point out--

  • which is not completely obvious immediately until you

  • see examples-- is that our input is a rank one tensor,

  • and our output is a rank two.

  • The reason why this is, is our tokens

  • are grouped by the string that they're split from.

  • And so it's really easy from the engineer's perspective

  • to be able to tell which string tokens were pulled from which

  • string in the original tensor.

  • The one thing you can't do from this output,

  • is tell where in that originating string

  • it came from.

  • And for that.

  • We have one extra tokenizer with offsets class, abstract class,

  • that has tokenizer with offsets, which is the same thing.

  • You give it an input, tensor of strings,

  • it gives you your tokens, but also

  • gives you where those tokens start and end.

  • So we can see that example here, we

  • call it tokenize with offsets.

  • And we can see the letters.

  • "I" starts at zero, and moves one position, and then "know"

  • is in the second position and moves up six characters.

  • So through these offsets, if you want

  • to know where the tokens are in your originating string,

  • you can do that.

  • And you'll notice the shapes are exactly the same

  • as the shapes of the token.

  • So mapping tokens to starts and limits

  • is very simple from here.

  • So we provide five basic tokenizers.

  • You know, one of the questions when we first

  • did the RFC was, why don't we just have one,

  • one tokenizer to rule them all?

  • The problem is, every model's different.

  • You have different limitations and things

  • you want to get around, and we don't

  • want to push our opinion on you, because they're all different.

  • We just want to build the tools and allow

  • you to make the decision.

  • And so a lot of these are very simple.

  • Whitespace obviously just splits a sentence on Whitespace.

  • Unicode script, so if you know Unicode,

  • characters are grouped together in what

  • they call Unicode scripts.

  • So you would have Latin characters, Greek, Arabic,

  • Japanese are just some examples.

  • And then they also group spaces, punctuation,

  • and numbers as well, and so it splits on those.

  • I would say in the most simple case, if you're just

  • working with English, the main difference between Whitespace

  • is, it splits out the punctuation.

  • So Wordpiece, this was popularized by the BERT model

  • which I mentioned earlier.

  • It basically takes text that you've already tokenized,