Placeholder Image

Subtitles section Play video

  • ROBBY NEALE: I'm Robby.

  • Yeah.

  • I'm an engineer at Google, and I'm

  • going to talk to you about how to leverage

  • TF text for your preprocessing and language models

  • inside of TensorFlow.

  • So, for those unfamiliar with language models--

  • All right, there we go.

  • Forward.

  • They're basically everywhere.

  • You use them in clear understanding,

  • you have related keyword searches, article summaries,

  • spell check, auto-complete, text-to-speech, spam filters,

  • chat bots.

  • You really can't get away from them.

  • And it's really a good time to be into NLP right now,

  • because we're going through somewhat of a Renaissance.

  • Last year, this paper on BERT was

  • released that uses attention and transformers.

  • I'm not going to go too much into it,

  • other than traditionally, obviously when

  • you're working with text, models don't play as well

  • with strings, so you convert those into numbers.

  • And so we've used embeddings, different ways to embed,

  • like GloVe, ELMo, Word2Vec are different ways

  • to create vector representations of your words.

  • And these worked really, pretty well.

  • The one problem is, with some words,

  • when you look them up in your vocab, like bat,

  • am I talking about an animal?

  • Am I talking about baseball equipment?

  • Even words that don't sound the same, entrance and entrance,

  • they're spelled the exact same.

  • So when you're trying to represent these as a vector,

  • you're getting, you're trying to have two different meanings

  • to the same sequence of letters.

  • And so, BERT has gotten around this by--

  • It's a model that actually uses the context of the sentence

  • to actually create this vector for the words.

  • And so, it's done really well, and this is the Stanford

  • question and answer data set.

  • So BERT was late 2018.

  • The scores before that were in the low 70s.

  • It came out, it jumped up to around 82,

  • and ever since then, people have been iterating on this model

  • through RoBERTa, XLNet, AlBERT And I pulled the scores

  • up from last week.

  • And you can see that the most recent model, AlBERT, is

  • actually outperforming humans, so that's pretty crazy.

  • So it's really exciting right now to be into this.

  • And so let's just jump right in.

  • So what is TF text?

  • Our charter was basically to make programming these language

  • models in TensorFlow easier.

  • Traditionally, it's been very difficult to do this.

  • You would get, let's say you have some data.

  • Like, here's an example of queries,

  • and we want to train on this data, right?

  • Well, before we can do that, we need

  • to do some preprocessing, which is basically

  • tokenization outside of that, because we

  • didn't have that availability inside of TensorFlow.

  • And then once we did this preprocessing,

  • we had to somehow fit it into a tensor.

  • So we would get this preprocessed text,

  • add it into the graph, and then normally we

  • pad out our tenses to make them a uniform shape

  • so that they're available.

  • And then we'd finally train our model,

  • and we'd go to publish it, and we'd

  • put it on our model server.

  • And we're like, OK, we're ready to go, right?

  • And then so, when we get the serving data,

  • well, you can't just plug that serving data right in, right?

  • We had this preprocessing that has to happen.

  • And so, either you're relying on the client

  • to transform the data as well, or you're doing it yourself,

  • and a lot of times it's a different language than what

  • you did your normal scripts in.

  • And I've seen it, even when the preprocessing is exactly

  • the same.

  • It's used the exact same regex, because it's different

  • libraries, one might consider a character class to be

  • punctuation, where the other doesn't.

  • And so you get training skew when these preprocessing steps

  • are different.

  • And so when you actually go to serve the model,

  • you don't have the same performance,

  • and that's problematic, right?

  • So our charter was to make this as easy as possible,

  • to support TensorFlow, or to support

  • text inside of TensorFlow.

  • And to do this, we want to do all the text processing

  • in graph.

  • And we do this through a series of text and sequential APIs

  • that were not available, and actually create

  • a new tensor type called the Ragged Tensors that

  • better represents this text.

  • So if we go back to when it was painful,

  • what we really want to do is just

  • get rid of this preprocessing step, right?

  • Put everything in the graph.

  • And so, all your preprocessing happens in graph,

  • and then when you go to serve the model,

  • you're not relying on the client to perform

  • those same steps when you serve the model, and they call it--

  • And so really the main thing that was missing

  • was tokenization.

  • So last year, we had a RFC with a tokenizer API.

  • And we wanted to make this as easy as

  • possible and straightforward.

  • So it's very simple, it's an abstract tokenizer class,

  • it has one method, tokenize.

  • It takes input, a string, and gives you back your tokens.

  • And so if we see this, it's very simple.

  • We have a couple of sentences here.

  • We tokenize them into words.

  • The one thing I like to point out--

  • which is not completely obvious immediately until you

  • see examples-- is that our input is a rank one tensor,

  • and our output is a rank two.

  • The reason why this is, is our tokens

  • are grouped by the string that they're split from.

  • And so it's really easy from the engineer's perspective

  • to be able to tell which string tokens were pulled from which

  • string in the original tensor.

  • The one thing you can't do from this output,

  • is tell where in that originating string

  • it came from.

  • And for that.

  • We have one extra tokenizer with offsets class, abstract class,

  • that has tokenizer with offsets, which is the same thing.

  • You give it an input, tensor of strings,

  • it gives you your tokens, but also

  • gives you where those tokens start and end.

  • So we can see that example here, we

  • call it tokenize with offsets.

  • And we can see the letters.

  • "I" starts at zero, and moves one position, and then "know"

  • is in the second position and moves up six characters.

  • So through these offsets, if you want

  • to know where the tokens are in your originating string,

  • you can do that.

  • And you'll notice the shapes are exactly the same

  • as the shapes of the token.

  • So mapping tokens to starts and limits

  • is very simple from here.

  • So we provide five basic tokenizers.

  • You know, one of the questions when we first

  • did the RFC was, why don't we just have one,

  • one tokenizer to rule them all?

  • The problem is, every model's different.

  • You have different limitations and things

  • you want to get around, and we don't

  • want to push our opinion on you, because they're all different.

  • We just want to build the tools and allow

  • you to make the decision.

  • And so a lot of these are very simple.

  • Whitespace obviously just splits a sentence on Whitespace.

  • Unicode script, so if you know Unicode,

  • characters are grouped together in what

  • they call Unicode scripts.

  • So you would have Latin characters, Greek, Arabic,

  • Japanese are just some examples.

  • And then they also group spaces, punctuation,

  • and numbers as well, and so it splits on those.

  • I would say in the most simple case, if you're just

  • working with English, the main difference between Whitespace

  • is, it splits out the punctuation.

  • So Wordpiece, this was popularized by the BERT model

  • which I mentioned earlier.

  • It basically takes text that you've already tokenized,

  • and then splits those words into even smaller sub-word units.

  • So this is actually, it greatly reduces

  • the size of your vocabulary.

  • As you're trying to encapsulate more information,

  • your vocabulary will grow.

  • And by actually breaking the words down into sub-word units,

  • you can greatly get that smaller,

  • and encapsulate more meaning in less data.

  • And to generate the vocab, we have

  • a beam pipeline in our GitHub so you can generate your own.

  • Or the original BERT model has a vocab you can use.

  • Sentencepiece is a very popular tokenizer.

  • So this is actually released previously,

  • there's a GitHub where people have downloaded,

  • and it's pretty popular.

  • It basically takes a configuration where you set up

  • a bunch of preprocessing steps already,

  • and you feed that to it, and it does that.

  • And so it does sub-word tokenization, word

  • and character.

  • And finally, we're releasing a BERT one that

  • does all the preprocessing that the original BERT paper did,

  • and so you can use, like I said, their Wordpiece tokenization,

  • and it'll do the pre-tokenization steps,

  • some other normalization, and then

  • the Wordpiece tokenization.

  • So now that we have tokenizers, we really just needed

  • a way to represent these.

  • And that's where we created Ragged Tensors, for this better

  • representation of text.

  • So if we look at an example, we have two sentences.

  • And like I said, normally your sentences

  • are never of the same length.

  • So when you try and create a tensor out of these,

  • you get a value error.

  • It needs to be of a uniform shape.

  • And so traditionally, like I said previously,

  • we padded out the strings.

  • And in this example, you're like, OK,

  • three extra values is not so bad.

  • But when you're actually writing out these models,

  • you don't know how long your sentences are going to be,

  • so you have a fixed size.

  • And so a lot of times, I've seen fixed size of 128 characters

  • or 128 words, and you just pad.

  • It has all this extra information

  • that you don't really need inside your tensor.

  • And then if you try and make that smaller,

  • when you do have a long sentence, then

  • those sentences are truncated.

  • So you might think, well, we have Sparse Tensor.

  • And this is also not quite as good,

  • because there's a lot of waste of data

  • that you're having to supply a Sparse Tensor.

  • As you know, or if you don't, Sparse Tensors--

  • because really, in TensorFlow, everything is made of tensors.

  • So it's actually made of three tensors which

  • is values, a shape, and then where those values exist

  • within your matrix shape.

  • And so you can see, there's actually a pattern.

  • Because Ragged Tensors aren't necessarily, or for our string,

  • it's not necessarily that they're sparse, it's dense,

  • they just have varying lengths.

  • So it would be good if we could say, hey,

  • the first row has indices 0 through 5.

  • The second row has indices 0 through 2,

  • and those make up our sentences.

  • Excuse me.

  • And so that's what we did with Ragged Tensors.

  • It's easy to create, you just had a tf.ragged.constant

  • to create it.

  • It's similar, we built--

  • like a Sparse Tensor, it's made up of values and row splits.

  • And so it minimizes the waste of information.

  • So you can see that all the values are in one tensor,

  • and then we say where we want to split up that tensor

  • to build up our different rows.

  • It's easier to kind of see it in this form

  • where the gray block on the left side

  • is what the Ragged Tensor is in this representation,

  • and on the right is how it would look represented.

  • And down below is how you would actually do that call,

  • or build this if you're using values inside of TensorFlow.

  • And so, this was the original way we had row splits.

  • We had some people come to us, they represented

  • these in different ways, so we also

  • provide row IDs where the ID tells where that value is

  • inside your tensor, and row lengths that

  • says the lengths of each row.

  • So the first one takes the first four values.

  • You could have empty rows, so 0, 2, and so on.

  • And so we want to treat these like any normal tensor.

  • So Ragged Tensors, they have rank,

  • just like you would see with normal tensors.

  • So in this example, we have a rank of 2.

  • They also have shape, the question mark,

  • when we find our shape is, that denotes the ragged dimension.

  • It's not necessarily always on the end, but in this case,

  • it is on the end.

  • And we can use normal TensorFlow functions and ops

  • like we would with normal tensors.

  • And so here, we're just using gather,

  • it grabs the second, and then the first row,

  • gather_nd, which grabs the index.

  • concat, you can concat on the different axes.

  • And of course, we made this for sequential and text processing,

  • so your string ops work well with Ragged Tensors.

  • So here we decode the strings into code points,

  • and code them back into strings, and conditionals work as well.

  • So in this case, where clause, where we use Ragged

  • Tensors inside.

  • The one case here, is that the Ragged Tensors for where,

  • they must have the same row split.

  • So the rows must be the same length.

  • And it's easy to convert into and out of Ragged Tensors.

  • You can just do from_tensor, or from_sparse

  • to create a RaggedTensor, and then to move back,

  • you just have your RaggedTensor and just call

  • to_tensor or to_sparse.

  • And then to_list actually gives you a list

  • if you want to print it out.

  • So we're also adding support to Keras.

  • These are the layers that are currently available,

  • are compatible, and there's a lot

  • left we have to do on this front.

  • So we're pushing to get more layers compatible with Ragged

  • Tensors If you are using them within your Keras model

  • and come across something that's not compatible,

  • in tensorflow_text, there at the bottom,

  • we do provide a ToDense layer that will just convert it

  • for you.

  • And the real cool thing that I want to point out

  • is the RNN support.

  • So we've seen on tests that we get 10% average speed

  • up with large batches, like 30% or more.

  • This is very exciting, and I won't go into details,

  • but it's very intuitive, because if you

  • think about when you are looping through your tensor,

  • you know when you're at the end of that ragged dimension,

  • you can stop computation.

  • Where before, if you're using tensors,

  • you're using mask values, and masks

  • can be not necessarily at the end, but in the middle,

  • so you have to keep computing until you're

  • end of the full tensor length or width.

  • So yeah, you just have a lot less computation,

  • and so you save a lot there.

  • All right.

  • So I want to go over a couple examples,

  • show you how easy it is to work with.

  • First, you can just install tensorflow_text with pip.

  • Our versions now map to TensorFlow versions,

  • so if you're using TensorFlow 2.0, use tensorflow_text 2.0.

  • If you're using TensorFlow 1.15, use tensorflow_text 1.15.

  • Because of the custom ops, versioning must match.

  • And you can import it like this.

  • We generally import TensorFlow text as text,

  • so in these examples, you will see it written as text.

  • So let's go over a basic preprocessor what you might do.

  • So normally, you'll get your input text.

  • Here we have a couple of sentences.

  • We want to tokenize it, split those sentences into words,

  • and then we want to map those words

  • into IDs inside our vocabulary that we

  • will feed into our model.

  • And so the preprocess function might

  • look like this, where we just instantiate the tokenizer,

  • create a Ragged Tensor out of that input,

  • and then map the table.lookup into our vocabulary

  • along the values of that Ragged Tensor.

  • So if you remember, what the Ragged Tensor

  • looked like underneath, when we have our words and tokens,

  • we have the RaggedTensor above, where the values are set,

  • and then the row_splits are separate, in a separate tensor.

  • So really when we want to map those words to IDs,

  • we're keeping the same shape.

  • We only want to map over the values.

  • And so that's why the map over values

  • is there, because we're just converting,

  • we're doing the look-ups on each word individually.

  • And so the resulting Ragged Tensor is there at the end,

  • and we can see what it actually represents above.

  • And so this is our preprocessing.

  • Once we're done, you, using TF data,

  • normally you might create a data set from it.

  • You map that preprocessing function over your data set.

  • I won't go into model details, but you

  • can create a model with Keras pretty simply,

  • and then fit that data set on the model,

  • and that trains the model.

  • And so you can use that same preprocessing function

  • in your serving input function.

  • So you have the same preprocessing

  • that's done at training time, as it

  • is in serving time with your inference,

  • and this prevents training skew that we have seen

  • multiple times in the past.

  • I have, at least.

  • So let's go over another example, character bigram

  • model, here.

  • Before I jump in, I just want to quickly go over Ngrams.

  • So a bigram is like a form of an ngram with the width of two.

  • It's basically, I say like a grouping

  • of a fixed sized over a series.

  • We provide three different ways to join those together.

  • So there's the string joint, and you can sum values, and also

  • take averages.

  • So let's pull in an example here.

  • So here, we're doing a bigram of words.

  • So we have a sentence, we just tokenize it to split it up

  • into words, and then we call the ngram function in TensorFlow

  • text that groups those words together, which

  • basically is joining a string.

  • So every two words are grouped together, as you can see.

  • And so, that's a generally a bigram.

  • So, trigrams is three, so you can see here,

  • we split our sentence into characters

  • and then we grouped them together with every three

  • characters.

  • We set the width to three.

  • And then in this situation, the default separator is a space,

  • and so we just do the empty string.

  • So the other two, it also works with numbers.

  • So if we have a series here, 2, 4, 6, 8, 10 as our tensor,

  • we want to sum up every two numbers.

  • So 2 plus 4 is 6, 4 plus 6 is 10, and so on.

  • And then also, average, which is the mean reduction type.

  • So where this might--

  • generally, when you talk about ngrams,

  • you are talking about it in a language context.

  • But where this would be helpful, let's

  • say if you're taking temperature readings every 20 minutes.

  • And so, you had a series of temperature readings

  • every 20 minutes, but what you want to actually feed

  • in your model is an average of those temperatures over an hour

  • period, every 20 minutes, you can do a trigram

  • with a reduction of mean.

  • So it takes the average of those 20 minute intervals,

  • and so you get average temperature

  • over the hour at every 20 minutes,

  • and you can feed that into your model.

  • But generally, like I said, with the bigrams and trigrams,

  • it's often used in NLP.

  • And how that works is you generally split it

  • up, either into words or characters,

  • and then have a vocabulary dictionary

  • you can look up those groupings in.

  • In our example, we cheat a little bit.

  • We get our codepoints from our input.

  • So we have this input.

  • We can get codepoints.

  • As you see, again, the rank is increased.

  • So we have a shape of three, and then had, now a shape of three

  • with a ragged dimension.

  • And we use merged dimensions to actually combine

  • those two dimensions, because we don't

  • care about it in this case.

  • And so it takes the second to last axis and the last axis

  • and combines them.

  • And then we're just summing those up

  • to create kind of our unique ID in this case

  • that we'll feed into the model.

  • I think generally, like I said, you would do string joins

  • and look those up in a vocabulary,

  • but for this case model, it works.

  • And we just cast those values, and this

  • is our preprocessing function, that again, we

  • create a data set using TFRecordDataset,

  • map our preprocessing function on those values.

  • And then the model that's created,

  • we can train using this preprocessing function.

  • Finally, I was going to go over the BERT preprocessing.

  • There's a little bit more code in this one.

  • So I just want to say that we provide the Bert

  • tokenizer for you, so feel comfortable in knowing

  • that you don't really have to write this if you don't want

  • to, you can just use the BERT tokenizer, tokenize,

  • and it does all this stuff for you.

  • But I feel like there's a lot of good examples,

  • and what this does, and if you're

  • doing text preprocessing, these are things you should probably

  • think about and know about, so I wanted to go over it with you.

  • So this is like a slim version of that.

  • So what it does, what BERT did in its preprocessing,

  • it lower cased and normalized the text,

  • and then it did some basic tokenization.

  • It split out Chinese characters and emoji

  • by character splitting, and then it did Wordpiece

  • on top of all that.

  • So with lowercase and normalizing,

  • this is very common that you would do.

  • When you're looking up words in your vocab,

  • you want the words to match and not have duplicate words.

  • So capitalization kind of gets in the way with that.

  • You know, words at the beginning of a sentence are capitalized,

  • so when you look it up, it would be in your dictionary

  • or vocabulary twice.

  • And so it's generally thought that you would lowercase these.

  • And normalization is, a lot of Unicode characters with accents

  • can be represented in different ways.

  • And so normalization basically normalizes that text

  • so it's represented in a single way.

  • And again, so you don't have the same word

  • multiple times in your vocabulary, which

  • would confuse your model, also, as making

  • your vocabulary larger.

  • So we provide case_fold, which is just an aggressive version

  • of to lower.

  • What it does, is it lower-cases characters.

  • It also works with non-Latin characters,

  • accented characters.

  • It doesn't mess up non-letters, so it keeps them as is.

  • And it does NFKC folding and normalization,

  • so I'll talk a little bit more about that.

  • So we do that in our first step.

  • I have examples of what this would look like.

  • So in this example, it really is just lowercasing our I in it's.

  • And then BERT actually normalized to NFD,

  • and because case_fold does NFKC, we're

  • going to normalize to that next.

  • You know, I won't go over this.

  • Just know, again, that letters have many different forms,

  • so it's good to have a single normalization,

  • so when you're working with international characters

  • they're not represented in different ways.

  • So here we are, we just normalized to NFD.

  • Now we're going to do some basic tokenization.

  • We'll split on Unicode scripts, tokenize our text.

  • And then what you might notice here

  • is, while our sentence, it's a trap, has been tokenized,

  • the Chinese characters have not.

  • And that's because it's a single script

  • throughout that whole sentence, and there

  • are no spaces or any other method of defining separations

  • and words.

  • So we want to do is, we want to split that up.

  • So this is kind of where a lot of code comes in.

  • You can follow, and I think the main point is just

  • to know that these things, we've thought about,

  • and if you run across it, there's

  • ways to work around this.

  • I've prepared you, tried to.

  • It's simple ops, when we step through it, you'll see.

  • So first, we just get your codepoints, or--

  • sorry.

  • Yes, we get codepoints of the characters,

  • and then we just get script IDs of those characters.

  • So you can see that the first sentence is all

  • script 17, which is Han scripts, it's Chinese.

  • And then our Latin characters are 25,

  • and emoji and punctuation is 0.

  • And we can just apply math.equal like you

  • can on our Ragged Tensor.

  • It gives you-- and we're just checking if it's Han script,

  • so we have true, and then we use the slice notation to just grab

  • the first character, because we know

  • they're all the same already from our Unicode script.

  • We also want to check for emoji.

  • In TensorFlow text, we provide a function, Wordshape,

  • which you can ask, basically different questions on words.

  • It's basically like different regular expressions

  • that you want to ask, so here we're asking, is this,

  • does this text have any emoji?

  • Other ones, is there any punctuation?

  • Are there any numbers?

  • Is my string all numbers?

  • And so these are things you might want to find out about,

  • and we provide you a method to do that.

  • And so here we just order two conditions together

  • to say whether we should split or not.

  • It works with Ragged, and then we

  • go ahead and split everything into characters

  • so that when we do our where clause in our conditional,

  • if we should split or not--

  • if we should split, we grab it from the characters

  • that we've already split.

  • If not, we just grab it from our tokens

  • that we used when we tokenized.

  • And here we just do a little reformatting

  • of how the shape looks.

  • So once we've done that, we can finally Wordpiece tokenize,

  • we provide it with our vocab table.

  • We split it up into sub-words, and we have an extra dimension,

  • so we just get rid of that with merge_dims.

  • All right.

  • We made it through that, it wasn't too bad.

  • And so, we apply this just as we did before.

  • We have a data set we've created with tf.data,

  • we map our preprocessing across from that.

  • Here, we can grab a classifier BERT model

  • from the official BERT models, and just

  • train on that classifier.

  • So I know that was a lot to go through.

  • Hopefully you followed along, but the main thing to know

  • is that as TF.Text, we're looking to basically bring

  • in all that preprocessing inside the graph

  • so you don't have a problem.

  • You don't have to worry about training skew,

  • you can just write your TensorFlow and train.

  • So we do that by giving you what I consider a superior data

  • structure for sequence data, as well

  • as text data through Ragged Tensors and the APIs

  • that are required for this preprocessing.

  • Again, you can install it with pip install tensorflow_text.

  • And, thank you.

  • Here's some links, that's our GitHub.

  • If there is anything that you think

  • is missing that we should add, feel free to add an issue.

  • And we also have a Colab tutorial

  • on tensorflow.org that you should check out,

  • and it'll walk through some of this

  • more slowly through the Colab for you.

  • Thanks.

  • [APPLAUSE]

  • ASHLEY: You still have seven minutes, do you want to do Q&A?

  • ROBBY NEALE: Oh, yeah.

  • OK.

  • ASHLEY: OK, so we still have about seven minutes left,

  • so we can open this up for Q&A. There

  • are some microphones on either side,

  • but I can also help provide those if needed.

  • ROBBY NEALE: Do you want to go to the microphone,

  • or grab these mics?

  • OK.

  • Go ahead.

  • AUDIENCE: Hi.

  • Very nice to see that you have all this support.

  • Just a quick question, does can TF.Text handle Japanese text?

  • It's a mixture of Hiragana, Katakana, Kanji,

  • Romaji all thrown in.

  • Really?

  • ROBBY NEALE: Yeah.

  • So, like in this previous example

  • where we go through the characters,

  • a lot of the core Unicode, we've added to core TensorFlow.

  • But I don't know where we are in here.

  • So when we're searching for the scripts,

  • that says ICU, which is like the open source library scripts.

  • And so you can just as well grab that Kanji,

  • and they have different script tokens there.

  • AUDIENCE: Thank you.

  • ROBBY NEALE: Sure.

  • We'll just switch sides, back and forth.

  • Here you go.

  • AUDIENCE: Thanks for the information.

  • For inferencing, do you send in text,

  • or do you have to put it in a tensor and send it?

  • ROBBY NEALE: Yeah, no.

  • You can send text.

  • So at inference time, in here, the training,

  • we use this preprocessing function.

  • And so you can use that same preprocessing function.

  • When you save your model, and save the model,

  • you give it a serving input function

  • that basically does preprocessing on your input.

  • And so if you send in those full string sentences,

  • you can use the same function or a variation

  • of that in that input function, and so it should process.

  • AUDIENCE: Thank you.

  • ROBBY NEALE: Sure.

  • Back over here.

  • AUDIENCE: Thank you very much.

  • My question kind of relates to his question.

  • So what's the advantages of applying it

  • with the map versus having a layer that does it?

  • Because you could, even with a Lambda layer,

  • or with the new RFC for preprocessing layer,

  • have a layer that does it.

  • ROBBY NEALE: Oh, a layer that does the preprocessing?

  • AUDIENCE: Yeah, just supplies that function,

  • and then it's saved as a check-able as part

  • of the model.

  • ROBBY NEALE: Yeah, no.

  • You could certainly do some of this with layers.

  • Actually, you know, we're looking

  • at what layers we should provide,

  • and someone on our team is helping Keras

  • and building out their preprocessing layer, which

  • is basic.

  • And that there's added functionality

  • that we find that people need, we'll

  • supply it in our TensorFlow text library.

  • So it's really up to you as someone

  • who wants to model it how you want to apply those.

  • AUDIENCE: Thank you for the talk.

  • ROBBY NEALE: Sure.

  • AUDIENCE: Two quick questions.

  • The first one is, do the tokenizers

  • you provide also have a decoding function,

  • so that you go from the tokens, from the integers

  • to the sequence of text?

  • ROBBY NEALE: From the integers--

  • Yeah.

  • There's TF strings, Unicode, decode.

  • Is that what you're talking about?

  • From codepoints, or are you--

  • AUDIENCE: I'm talking about, so for instance,

  • if you decode them in, for instance,

  • we have a BERT vocabulary, then you

  • will have all these additional characters there,

  • And then you want to concatenate them

  • into again, a sequence of text that's proper text, right?

  • ROBBY NEALE: Yeah.

  • So I think what you're asking is,

  • so if you send your word through a BERT model

  • and you get a vector representation,

  • if you can translate that vector representation back to text?

  • AUDIENCE: Yeah, because that's the case when you,

  • for instance, generate text.

  • And so, then you may want to map that text back,

  • the sequence of generated tokens back into a string of text.

  • ROBBY NEALE: Yeah.

  • I mean, this is more along the lines

  • of, I think you want an encoder, decoder model.

  • There's models that do this.

  • It's not something that we provide inside the library.

  • AUDIENCE: Well, you can take it-- fine.

  • It's a slightly different question.

  • The second question is why this is not in TensorFlow.

  • ROBBY NEALE: Yeah.

  • So with 2.0, I mean, you might know that TensorFlow has gotten

  • rid of contrib, and it's kind of,

  • for the core team has gotten un-handily.

  • It's too much to handle, the tests are running too long,

  • and it's really, it's too much for one team to maintain.

  • And so I think we'll see more kind of modules

  • like TensorFlow.text that are focused on one

  • particular area like this.

  • As a team, we want to make it easier across,

  • so a lot of the stuff we've done with Ragged Tensors

  • some of the string ops are actually in core TensorFlow.

  • But for some of these things that are outside the scope,

  • like ngrams and tokenization, it's just a separate module.

  • AUDIENCE: Thank you.

  • ROBBY NEALE: Sure.

  • AUDIENCE: Hi.

  • ROBBY NEALE: Go ahead.

  • AUDIENCE: So since TF.text can be incorporate

  • into the TensorFlow graph, is it intended

  • that you actually build a model with this preprocessing step?

  • And if true, what are, are there performance implications

  • in TensorFlow serving?

  • If that's there, has that been measured?

  • ROBBY NEALE: Yeah.

  • There's definitely some performance.

  • I think, you know, it's done at the input level.

  • I'm just going to skip ahead to right here.

  • So this is actually a problem that TF data is looking at,

  • as far as consuming this data and then paralyzing--

  • I said this wrong earlier today too.

  • Paralyzing these like input functions.

  • And so, if actually your model is on GPU, or TPUs,

  • the input's paralyzed, and then you're

  • feeding as much data as possible.

  • So this is like something you might worry about and look at,

  • but it's also what a lot of other people are looking at.

  • AUDIENCE: OK.

  • And then, yeah.

  • I guess, if it's part of the TensorFlow graph,

  • and TensorFlow serving, how are the nodes

  • allocated and computed, right?

  • Like it's preprocessing on the CPU, or--

  • ROBBY NEALE: Yeah.

  • Most of this is done on the CPU.

  • AUDIENCE: OK.

  • ROBBY NEALE: I'd say all of it.

  • Yeah.

  • AUDIENCE: Can I just quickly ask, are you compatible

  • with TF 2?

  • Because I just pip installed TensorFlow.text,

  • and it uninstalled TensorFlow 2 , and installed 1.14.

  • ROBBY NEALE: So like I said, the versions need to match.

  • If you just do pip install TensorFlow.text equals

  • equals 2.0.0, which I think, maybe why

  • it did that is because that version is actually a release

  • candidate.

  • So just do RC zero.

  • It'll reinstall TensorFlow 2 for you.

  • AUDIENCE: OK.

  • ROBBY NEALE: So, yeah.

  • Sure.

  • Last question.

  • AUDIENCE: Yeah.

  • First of all, I'd like to say that this is really cool.

  • Second, is do TF.text integrate with other NLP

  • libraries, such as SpaCy or anything in that area?

  • Just out of curiosity.

  • ROBBY NEALE: No, our focus is really just on TensorFlow

  • right now.

ROBBY NEALE: I'm Robby.

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it