Placeholder Image

Subtitles section Play video

  • ALEX PASSOS: Hi, my name is Alex, and I'm here again,

  • this time to talk about the TensorFlow eager execution

  • runtime.

  • This is a very broad topic and there

  • are lots and lots of things we could cover,

  • so I'm going to lightly graze many, many different parts

  • of our code base.

  • I'll give you a lot of function names

  • and file names and things like that

  • that you can use to familiarize yourself with something,

  • but if you're in the room right now, by all means,

  • ask questions.

  • Like, this is-- there's some buffer time at the end

  • to account for variability here.

  • And I think the more we can maximize shared understanding

  • of this stuff, the better.

  • So the way I thought we could go about this

  • is to do a very, very deep dive on what

  • actually happens in TensorFlow, starting from TF2--

  • when you type a very simple line of code, in this case

  • a tf.nn.relu off some Python numbers.

  • And I think if you were to start doing this, probably

  • the first thing you'd do is graph the TensorFlow code

  • base to find where would we define ReLU.

  • And if you do that, you will find

  • that we have some definitions of ReLU in Keras,

  • but you won't find a single definition of ReLU itself

  • in core TensorFlow, and that might be

  • a little surprising at first.

  • It might put a damper on this whole,

  • let's find out what actually happens when we run ReLU

  • business, but the way ReLU comes from is because it's enough

  • that we've implemented in C++ and we didn't need to put

  • a complicated Python API around it.

  • We can just generate the Python code to call ReLU.

  • So the way it's actually defined,

  • it's defined using the same mechanism

  • we use to register all the ops in TensorFlow.

  • So for every core operation in TensorFlow

  • that's visible to the runtime, we

  • have a registration that looks like this--

  • REGISTER_OP.

  • It takes a name and you say how many inputs, how many outputs,

  • what attributes it has.

  • If you want to know more about attributes-- what things

  • are allowed in there-- they are how

  • we can make our ops polymorphic and have the same operation

  • have different types of outputs, so different numbers of outputs

  • and things like that.

  • There is a lot of documentation about this in TensorFlow.org,

  • if you search for how to define a new op.

  • And another interesting thing in there

  • is that we also register a shape_inference function.

  • ReLU, thankfully, is one of the simplest

  • ops we have-- it just has one input, one output,

  • they have the same shape.

  • So we can use a pre-built shape_inference function

  • that just says the shape does not change.

  • Other ops will have vastly more complicated

  • shape_inference functions.

  • And the nice thing is that we can

  • run these functions offline for our building graphs

  • without actually having the values of any tensors

  • and still be able to prove things

  • about the shapes of intermediate tenses and outputs

  • of your computation.

  • This is maybe the best tool we have for catching bugs now.

  • So if you want to look at the shape_inference code,

  • that's where you'd hook into.

  • Now that we have that registration,

  • we run some complicated code that generates Python code

  • to actually call ReLU.

  • And if you look in bazel-genfiles, files,

  • you will find a file named gen_nn_ops.py and this file has

  • the actual def ReLU that we call.

  • And as you can see, it's not pretty

  • and there's a lot of stuff going in there.

  • The first line deals are dispatching

  • so that we can define ReLU not just for normal tensors,

  • but also optionally for sparse tensors

  • and ragged tensors and other composite types.

  • The second line has tf_export and what this does

  • is define the TensorFlow public API.

  • Every symbol that you get when you are using TensorFlow by tf

  • dot something is defined somewhere

  • from a tf_export decorator like this one.

  • There will be a future video on how exactly this works

  • and why we do things this way instead

  • of relying on Python's normal, you know,

  • name spacing mechanism.

  • But you can probably guess that it's because TensorFlow

  • is very complicated.

  • But essentially, you'll see this,

  • and this generated code for ReLU has a bunch of cases in it.

  • That are roughly four.

  • You have an eager fast path, an eager slow path,

  • you have a graph mode path, and kind of a side hook

  • for the symbolic execution.

  • But here, let's focus on the eager paths.

  • In the first one, the first thing

  • that we're actually doing here is

  • we're checking to see if we're in eager mode or not.

  • And to do that, we look at this context thing.

  • This context thing is part of the core of the TensorFlow v 2

  • runtime.

  • It's the moral equivalent to the session,

  • but it's longer lived than the session

  • and represents more things.

  • So what is it?

  • From Python, the context is this class

  • that's defined in a file called [? context.ui. ?]

  • And it's a collection of a lot of things

  • that your Python program needs to be aware to connect

  • to the TensorFlow runtime.

  • It stores things like, am I in eager mode or in graph mode?

  • Or if someone used a with tf.device decorator, what

  • device am I supposed to be executing code in?

  • And it stores things like, what's

  • your name scope and many other things.

  • And all of this information that the context

  • stores, in general--

  • the things that can change during the execution

  • of a program--

  • they're all stored in ThreadLocal stacks.

  • Usually stacks, because we have these nested things

  • like with tf.device, with tf.device, with tf.device,

  • so you'd like to be able to pop the stack

  • to go back to where you were.

  • And ThreadLocal because it's very important to us

  • that a TensorFlow runtime itself be thread agnostic,

  • so that if you write two threads and one is doing

  • a reinforcement learning learner and the other's doing

  • an agent that's talking to some game, when the agent wants

  • to use its GPU, it shouldn't necessarily make the learner

  • use the U, and vice versa.

  • Providing some kind of resolution between the threads

  • is what we felt like was the right way,

  • so that at least each single thread

  • can feel like it's its own single-threaded Python program.

  • We use this a lot in distribution strategies,

  • like MirroredStrategy uses a lot of threads under the hood,

  • so it's really important that things are thread-safe.

  • In the Python context, essentially it mostly is

  • a wrapper around the C++ context.

  • It's available for the TensorFlow C API.

  • And this is the core thing.

  • It has a bunch more methods than just these--

  • like, you can do a lot more things

  • than just listing devices.

  • One thing I'd like to call out is that right now,

  • there are some things that are done in Python, like storing,

  • whether in eager mode and graph mode,

  • and some things that are done in C++.

  • And it's not the set of things that are done Python and set

  • of things that are done in C++ are likely to change.

  • I think as TensorFlow evolves, more and more things should

  • migrate from the Python context to the C++ context,

  • which will make things more language agnostic,

  • but also faster.

  • And you know, if everything in the context was in C++,

  • then all the generated Python code could have just been C++

  • code and it would have to--

  • we'd be able to get out of the overhead of executing Python

  • much sooner and remove performance problems

  • in our APIs.

  • So once you know you're in eager mode,

  • we try to do this fast path execution.

  • In this, the fast path is some complicated C code

  • that mostly does the same things that the fallback case is

  • trying to do.

  • So I don't think it's necessarily worth reading that.

  • I would rather look at the simpler

  • code and the fallback path.

  • And it's here.

  • So this is where we actually implement the ReLU function.

  • And what are the interesting things here?

  • First, we have this function called args_to_matching_eager,

  • and what it does is it takes a bunch of tensors,

  • a bunch of things that can be tensors,

  • converts them all to tensors of the same dtype.

  • And in the case of ReLU, there is only one.

  • But in the case of any other ops, like Add or MatMul,

  • they take multiple inputs, some of which might be tensors,

  • others might be numpy.arrays or Python lists or variables

  • or other objects that you can convert to a tensor,

  • but they're not necessarily tensors.

  • And this is the thing that's responsible for canonicalizing

  • everything.

  • Then once we have--

  • but here you might want to stop me a little and ask,

  • what is a tensor at the level of a TensorFlow runtime?

  • And the eager tensor class is a thing

  • that's half implemented in Python

  • and half implemented in the Python C API,

  • but it's a relatively thin wrapper around this thing

  • that we call TensorHandle that you can see in our TensorFlow C

  • API.

  • And this TensorHandle, it represents a potentially yet

  • to be computer tensor which is going to live on some device.

  • And we know what device it's going

  • to live on, because we know what device we executed

  • the operation that generates the tensor on.

  • And there are roughly three things

  • you can do to a TensorHandle, really.

  • You can ask about its shape and dtype and stuff-- that's one.

  • You can give it to execute--

  • that's another one.

  • And you can copy it to a device.

  • Also, I guess there are four things,

  • and the fourth thing is you can ask, hey, what is its value?

  • And when you want to force the evaluation of its value,

  • the TensorFlow runtime might have

  • to pause until it can give you the result.

  • You might need to copy the tensor from some remote device

  • and do all sorts of other things, but once you're done,

  • you get a TF_Tensor, which is essentially

  • just a pointer to some data.

  • But every other operation that is not

  • looking at the value of the tensor

  • is something that the runtime is free to delay, reorder,

  • as long as it respects the intended

  • semantics of your program.

  • And this is very important, because when

  • you're working with things like GPUs and TPUs,

  • you'd like to be able to dispatch operations that

  • are running the hardware as quickly as you can

  • in a way that's asynchronous.

  • If the Python thread can race ahead of the GPU

  • as it's executing operations, we can fully

  • utilize the GPU hardware and get the maximum performance.

  • And even if the eager runtime, we

  • can do this if the GPU kernels are sufficiently heavy.

  • There's a few cases in which we need to copy tensors,

  • even if they're local on the CPU.

  • And some of those cases are going away, like currently,

  • we copy string tensors every time we try to look

  • at their values because TensorFlow's internal string

  • representation is a complicated C++ class and we'd like to have

  • an API stable C type.

  • We're actually changing our internal string

  • representation-- there's an RFC you can look about it--

  • that will make the internal and external representations

  • both be an API stable C type.

  • But internally, what is a TensorHandle?

  • And this is, in fact, very, very similar

  • to the first implementation of TensorHandle.

  • Now it's like hundreds of lines of code spread

  • across many files, but all that you really

  • need in a TensorHandle is to know what device

  • it's in and what data you have.

  • And the data can either be some concrete tensor, a value that

  • has already been computed, or a future that

  • might be computed later because it's remote or asynchronous

  • or on some other device.

  • But the core of it is this--

  • just, this is the future that we handle around

  • in the representation.

  • The current code is much more complicated,

  • and you might want to look at it to see why it's complicated,

  • but the core idea is there.

  • So popping the stack.

  • You now know what a tensor is, and you've probably figured out

  • that converting that list of Python integers to a tensor is

  • not terribly hard-- there's C code that does that--

  • and now it's time for us to execute ReLU.

  • Great.

  • Here is one thing to note about how the TensorFlow

  • eager runtime works, which is that this line that

  • is selected there is not a noncontroversial choice.

  • Overall, there are two ways we could have gone about this.

  • We could have had a closed domain API, in which TensorFlow

  • would export a symbol called ReLU,

  • another called MatMul, another called conf, et cetera.

  • Or we could have this open domain API

  • that we have here, where we have a symbol

  • called execute that takes the name of an operation ReLU

  • and a bunch of metadata about it.

  • There are advantages and disadvantages to both cases.

  • In general, the closed domain case,

  • where you just have a single endpoint

  • in your API for every operation you want to run--

  • that is easier to make fast.

  • However, it's a little tricky in TensorFlow

  • because the current--

  • the pre-existing graph runtime has a few layers

  • of indirection between a node in the graph and the actual kernel

  • that it executes.

  • And indeed, between TensorFlow versions,

  • we can, without breaking graph def compatibility,

  • replace some kernels.

  • Some things that were handled by a single kernel

  • now have to be handled by many multiple kernels.

  • So to preserve this layer of indirection,

  • we felt like it was more natural to have this API that

  • is an open domain API.

  • However, as you can see, just by the fact

  • that there's a string in there, executing

  • this can only be so fast, because we

  • need to somehow take this string and these attributes

  • and some properties about these inputs

  • and turn that into a kernel.

  • And that means that we can definitely not

  • execute kernels any faster than it takes

  • to at least hash that string.

  • So you know, there are trade-offs here,

  • but we felt that preserving the flexibility

  • that you have in graph mode was the most important thing here.

  • So how do you actually use execute?

  • When you call this line in Python, what actually happens?

  • And execute's something that is defined in the TensorFlow C

  • API, and to use it, you do something like this.

  • You first create a status so that you

  • can find out if things failed, and then you make a new op.

  • You add inputs, you set attributes,

  • you allocate some memory to put pointers to the return values,

  • and then you call execute and finally, you delete that all.

  • So this is fairly straightforward,

  • and if you're familiar with the TensorFlow C API for building

  • graphs, you'll see that this is very similar to that C API.

  • There's-- this is about as good as you could possibly get

  • for Python code in an open domain setup,

  • but it's a little sad here that when you build a TFE_Op,

  • you need to add the inputs to that op.

  • This means that if you're executing

  • an op in a tight loop, you can't exe--

  • you have to have the same inputs on every iteration of the loop

  • either/or you have to allocate a whole other TFE_op.

  • For Python, we really don't have a way of making this better,

  • but for other languages like Swift or languages

  • that have access to our compiler,

  • we should be able to cache the dynamic bits that

  • involve making a TFE_Op and separate them from the--

  • sorry, cache the static bits that don't really change,

  • like all your MatMuls are the same,

  • and separate that from the dynamic bits, which

  • are the inputs that should actually change.

  • And if you do this, you can actually--

  • we could make in principle this open domain

  • approach as fast as a closed domain approach.

  • And this is maybe a minor refactoring

  • that we should do at some point.

  • So what does execute do?

  • If you go through a few layers of APIs,

  • you end up on this function called EagerExecute,

  • and it has a lot of things here.

  • The first interesting one is this maybe update_op device,

  • which is, you might call it, the placer,

  • where we get to decide where we're going

  • to execute each operation.

  • This will have some complicated heuristics.

  • In general, you can think of it as, if you're

  • operating on a resource tensor, we'll

  • run your operation on the device that

  • has that resource because any other thing will fail.

  • Otherwise, if you have a tf.device annotation somewhere,

  • we'll run it there.

  • Otherwise, if you don't have any of those things,

  • we'll see what devices are available to execute

  • this operation and run on whatever device

  • we think is going to be the fastest, which

  • is how TensorFlow gets away with using a GPU for you

  • even if you don't specify with tf.device GPU in there.

  • And then you have some forks in there about are we local,

  • are we remote?

  • And once you do know that you're in the local case, what we want

  • to do is very quickly do that string processing

  • that we needed to do to find what

  • kernel we should be executing.

  • And there is a fast code that takes the attributes

  • in the name of an op and gives you

  • a cache key that is looked entirely in the context

  • and where we store the kernel.

  • And here you might think something's a little funny,

  • because usually you think of operations as functions,

  • not as classes, but clearly there

  • is like a kernel and device class in there,

  • so we probably have an object.

  • And the reason is that for many types

  • of kernels that we want to execute,

  • especially things involving GPUs but also some stateful kernels,

  • you want to keep some state in that object.

  • Ideally, that state will be managed by the runtime

  • if you have the resource managers,

  • but that doesn't always happen now.

  • And once you have a kernel, the kernel

  • tells us on what devices it wants its inputs on.

  • You would think that the kernel would want its inputs

  • all in a device that is executing,

  • but that turns out to be too narrow a view.

  • For some kernels, especially GPU kernels,

  • you might want some inputs on the host

  • CPU that's attached to the GPU.

  • And the reason is that imagine you're

  • a GPU kernel for generating a big matrix of random numbers.

  • You'd like to know how many numbers you're

  • going to generate so that you can run the memory

  • allocation before you enqueue your CUDA kernel.

  • So if you were to do that-- if the shape of the random number

  • vector you're going to generate, if that's in the GPU,

  • you'd need to fetch it back to the CPU to do that allocation.

  • That would be terribly slow.

  • So instead TensorFlow says, I expect that input

  • to be on the local CPU, and this is the function

  • that validates this.

  • But in this case, if you're also trying

  • to run a large convolution and one of the inputs

  • is in the CPU, this maybe copy will move that input

  • to the GPU for you, which is faster than just running

  • your convolution on the CPU.

  • And then finally, we get to decide whether we're

  • in sync or async mode, where we first create this node that

  • represents all the computation that has to happen

  • to execute this kernel.

  • If we're async mode, we throw this in a queue and return

  • control immediately.

  • If we're in sync mode, we run it now.

  • This async/sync here is complicated

  • because it's another layer of asynchrony that's

  • separate from the fact that our GPU runtime is itself

  • asynchronous.

  • This is kind of a patch to make the TensorFlow CPU

  • runtime, which is currently synchronous,

  • act asynchronously to try to get a little more performance

  • in the eager mode.

  • It's a little sad, because you lose your error messages

  • once you get very asynchronous there,

  • and we currently do not run shape_inference

  • in this asynchronous mode.

  • I think as we rework the TensorFlow runtime, which

  • the team has a large effort to do now,

  • we have a chance to fix this and have a single code

  • back for synchronous and asynchronous.

  • But for now, we have this patch.

  • And then finally, now we have the kernel.

  • We've got to call it.

  • That's easy, right?

  • So to call a kernel, we have to make an OpKernel context,

  • and to do that, you need to fill this brand struct, which

  • I put here.

  • Which you can clearly read in this light,

  • because it can definitely fit with a very large and readable

  • font.

  • So we don't do that.

  • This is sadly something that-- the original TensorFlow

  • API for kernels had only one caller, which was a TensorFlow

  • executor, so it was very easy to just add parameters and make

  • the calling convention harder and harder,

  • because there was only one place to fix.

  • We're now trying to trim this back and simplify it,

  • so it will likely get better.

  • But for the eager runtime, we have this class KernelAndDevice

  • that knows how to call a kernel, requiring a lot fewer things

  • about it.

  • Mostly all it needs is the inputs--

  • a place for you to populate with outputs--

  • and some information in case you want to profile--

  • things about how long it takes to execute each node

  • or do a staff or what graphs you're

  • using, if you're executing function, and things like that.

  • So now that we have this, we can run the kernel.

  • So what the kernels look like--

  • ReLU happens to have one of the simpler kernels

  • we have in TensorFlow.

  • It's a UnaryElementWiseOp.

  • We have a base class for this that

  • handles a lot of the logic around memory

  • allocation, buffer reuse, so that tf.relu by default

  • will reuse its input buffer if the TensorFlow runtime knows

  • that no other op yet to be executed is going to need this.

  • But once all the boilerplate is dealt with,

  • all that this kernel has to do is execute the functor.

  • And this is another place where you'd be surprised

  • that we use an object where you think we should use a function,

  • because in principle, relu is just a function.

  • It doesn't keep any state.

  • There should be no reason for us to make an object for it.

  • Except C++ does not let you declare a templated function

  • in the header file, but define it in the C++ file.

  • And this is something very useful for us

  • because as you can see, device is a parameter in there,

  • and one of those devices is GPU devices.

  • And for GPU devices, we'd like to put the function in the file

  • that we're going to compile for CUDA compiler.

  • And we would like to not compile our entire code

  • base with a CUDA compiler.

  • So being able to define this functor in a single place,

  • where we can generate a whole specialization of this class--

  • we don't have any access to CUDA compiler,

  • but have a file on the side that's

  • just going to fill in this implementation

  • after running the CUDA compiler.

  • It's very useful.

  • And as you can also see here, TensorFlow

  • kernels-- they tend to be highly templated.

  • Most are templated like this one, based on the device

  • that it's going to execute on and on the dtypes.

  • So they will generate fast, specialized code

  • for every core numerical type supported by TensorFlow, which

  • is an incentive to keeping the set of core numerical types

  • supported by TensorFlow relatively small,

  • as otherwise their binary size would grow.

  • But this has the nice side effect

  • that the code generated is very fast.

  • And one of the things that makes us generate

  • very fast code for this, which you will see if you look

  • into the implementation of the functors,

  • is that we can use Eigen to generate this code for us.

  • So the ReLU functor in particular

  • is very easy to write, because it's

  • just a computed wise max between the tensor you

  • have your input and 0.

  • And Eigen turns out to be a very useful tool to write this.

  • It lets us write this code once.

  • It will generate specializations for every dtype

  • we are interested in and also for CPUs and GPUs.

  • For this particular operation, you could probably

  • write it in fast assembly language yourself,

  • or SSE intrinsics or something like that,

  • but for more complicated operations,

  • like softmax and others that might

  • have interesting intermediate values that need computing,

  • being able to just have this code be generated

  • for you instead of requiring that you write all the device

  • specific and type specific things, can save a lot of time.

  • Also, Eigen in its core has a very, very fast gem,

  • which is the core, like, basic MatMul that is inside most

  • of our very expensive kernels.

  • It ends up being a very large asset

  • in making TensorFlow go fast.

  • So that was it, really.

  • It only took us what, 20 minutes to get through executing ReLU?

  • I think TensorFlow can do it a little bit faster than that.

  • [LAUGHTER]

  • But in general, this is kind of like what the stack of things

  • looks like as you're executing operations eagerly.

  • Of course, if you've been following

  • this for a little bit, you should

  • know that we can do better than executing operations eagerly

  • in TensorFlow.

  • We have tf.function and we have graphs and other things that

  • can get you a lot more performance by instead

  • of going down and up that stack for every single operation,

  • going down once and executing a lot of operations.

  • Also, this slides into optimizations and stuff.

  • So how do we run functions?

  • And so-- yes?

  • AUDIENCE: Previously, you mentioned briefly

  • about the async mode.

  • So is that something that is user configurable?

  • Because there's like context async, [INAUDIBLE]..

  • ALEX PASSOS: I don't remember right now if there's

  • a public API to make it user configurable or not,

  • but there is an internal API to make it user configurable.

  • AUDIENCE: I believe that there was-- in enable Eager

  • Execution, you could set it.

  • So I think you could set it in v1,

  • but it not be exposed directly in v2.

  • ALEX PASSOS: Yes.

  • You-- Taylor is correct.

  • I think it-- I know how to expose it in v1

  • and do not know how to expose it in v2,

  • but there's probably a way.

  • Or maybe there's not a way yet because we're still

  • treating it as experimental.

  • I'm not sure.

  • Regardless, the way it is now, I don't think it's

  • something we should rely on in the long run.

  • So I'd rather be able to iterate on it

  • a little longer until we start recommending it

  • as a way for people to get performance.

  • AUDIENCE: And what is the difference between eager

  • slow and fast path?

  • ALEX PASSOS: It's most that the fast path has

  • special case-- some types of tensor conversion

  • into fast C code.

  • So if you pass a list of Python floating point numbers,

  • we can convert that to an eager tensor

  • without hitting any Python code, and that

  • will save you quite some time.

  • OK, so most of what we saw in right now

  • for the case of executing a single op

  • also applies to executing a function.

  • So a function itself is an op named PartitionedCall,

  • and you will execute that op--

  • like the tf.function internals will execute that op--

  • just like how we just saw how to execute ReLU.

  • And so the first half, until you get to the kernel and device

  • run bit, is all the same.

  • It's just that that kernel implementation

  • is particularly interesting.

  • In function calls in TensorFlow, they look relatively simple.

  • We have inputs, we have outputs, we

  • have attributes that tell TensorFlow what types of inputs

  • we have, what types of outputs we have,

  • and we have a function.

  • There are some things in there that

  • seem a little tricky, like there's

  • all sorts of configuration we can pass.

  • I actually forgot what's the difference between config

  • and config_proto in there.

  • But essentially, this is the big entry point

  • to executing functions in TensorFlow.

  • But what this-- if you go look at the kernel of this op, what

  • you'll find is that it mostly just forwards things

  • to the FunctionLibraryRuntime.

  • And the FunctionLibraryRuntime is this core bit of TensorFlow

  • that knows about functions.

  • It can do things like instantiate and run,

  • pretty much.

  • And also create kernel, which you usually

  • do between instantiate and run, since that will let you--

  • for the function of a runtime also knows how to execute

  • operations that are not functions.

  • So what does instantiate mean?

  • An instantiate mostly runs all the graph

  • optimizations that we might want to run on that function--

  • to take code that you enjoyed writing

  • and turn it into code that the executor will

  • execute very quickly.

  • Most of this processing happens in this process

  • FunctionLibraryRuntime Instantiate multi-device call,

  • where we run all sorts of graph transformations.

  • This is if you have a tf-xla bridge happening,

  • it will run the transformations related to tf-xla bridge.

  • It will run the TensorFlow placer.

  • What the TensorFlow placer does is it takes a graph--

  • in this case, a function graph--

  • that has devices assigned to some of the nodes,

  • and it spits out another graph that has devices

  • assigned to all of the nodes.

  • It does this by following a very similar algorithm to the one

  • that I described earlier for individual ops.

  • So if you have a resource, it will place that op

  • next to the resource.

  • Otherwise, if you have specified a device,

  • it will respect that device, even if you partially

  • specified a device, it will respect that partial device

  • specification.

  • And finally, we'll group things by colocation group.

  • The TensorFlow graph language allows

  • you to specify colocations, even though these

  • have very non-intuitive consequences, because

  • by colocating a node A of a node B,

  • you can actually move where node B

  • is placed because the placer is not

  • aware of the direction of the colocation arrows.

  • It just groups all the colocated nodes into a bag

  • and finds a device that can execute all ops in there.

  • So this can have very fun consequences,

  • like a bug I helped fix a while back

  • where if you try to speed your distributed system by always

  • colocating the variable initialization

  • code with the remote device in which the variable is in,

  • and you can accidentally say, please always colocate

  • the initializer from a GPU variables

  • on the GPU, which can be trouble.

  • If some of your initializers have operations that cannot run

  • in the GPU, you now have silently moved your variables

  • to the CPU, which probably is quite a performance

  • degradation.

  • So it's very subtle and in TF2, we're

  • trying to move away from using colocation constraints

  • inside TensorFlow and we're definitely moving away

  • from encouraging people to use colocation constraints

  • outside TensorFlow.

  • I would rather you be more specific about what devices

  • you're using, or even better, use something

  • like a distribution strategy that

  • is aware of all these bugs and colocations

  • and can work around them for you instead of trying to replicate

  • this functionality yourself.

  • And once you have placed the graph,

  • we can run the partition that takes a graph that

  • has nodes for many devices and returns many graphs,

  • all of which have nodes on a single device only.

  • And to do that, if there was any edge that went from one device

  • to another device, that gets replaced with a pair of sends

  • and receives.

  • This is also where we run Grappler and all the function

  • inlining and optimization passes that the last training section

  • with Eugene was covering.

  • But yeah.

  • So this does a lot of heavy lifting

  • and indeed, the partition call-up

  • puts the cache in front of Instantiate

  • to make sure that we don't call it twice

  • for the single function, because otherwise we'd be very slow.

  • And once they've instantiated the function,

  • we can go to the other main method

  • in the FunctionLibraryRuntime and run it.

  • So in general, as you can tell by the name

  • Partition Call for the op, our functions

  • can be on multiple devices.

  • And at this point, at least the code

  • has simplified enough in there--

  • this is actually snippeted from the core runtime,

  • even though the core runtime has a lot more error handling

  • going on--

  • that all we have to do to run a function on multiple devices

  • is to just run a function on a single--

  • run end functions each on a single device

  • and trust that they all know how to talk to each other

  • to make the sends and receives happen.

  • So there's some thing called a rendezvous,

  • and if you read the TensorFlow code base,

  • you will see lots of references to it, that's

  • responsible for making the sends and receives all the way

  • over each other.

  • We'll have rendezvous that know how to deal with single host

  • and with multi host.

  • And there are lots of tricky and interesting bits

  • into how they relate with what is

  • the correct lifetime of a rendezvous,

  • how do you shut down a rendezvous once you want

  • to shut down TensorFlow computation because maybe

  • some kernels failed.

  • And you know, some kernels failed

  • and something is blocking on receiving a tensor.

  • They're never going to get that tensor.

  • So we probably need to shut that operation down gracefully.

  • And there's a lot of cancellation related

  • logic in that.

  • But it mostly-- at the level of FunctionLibraryRuntime,

  • you can just run your end functions, one per device,

  • and forget about it.

  • And running our function on a single device

  • mostly consists of bringing up this TensorFlow executor

  • and calling run in it.

  • And you'll see that you have things named RunAsync and done

  • callbacks.

  • In general, we treat all of these things as asynchronous

  • so that we can release the calling thread as quickly as we

  • can so that you can keep on running more computation on it.

  • Especially if you have nested function calls--

  • treating these things asynchronously

  • is quite the performance improvement.

  • And here, I could dig into the TensorFlow executor,

  • but that code is fairly complex and you have a simple core

  • algorithm, but it's really hard to pull it out of there.

  • And I think the main reason why it's

  • hard to pull it out of there is that the executor grew together

  • with the implementation of control flow in TensorFlow,

  • and specifically, the details that we had to implement

  • while_loop kind of, like, obscured

  • a lot of the core functionality of the executor.

  • It now is aware of frames and a lot of other complicated

  • things, and you have multidimensional pending

  • counts.

  • So I'm not going to snippet that code,

  • but I'll say that if you want to read it, go for it.

  • It's very interesting-- like, highly asynchronous,

  • highly parallel interpreter.

  • But I'll just give you some of the highlights

  • of what's happening in there.

  • And its input is a large bag of nodes and there is no output.

  • Anything that you want to get out of it,

  • you get out of it through a send or a receive.

  • I mean, there's-- technically there are outputs,

  • but by far most outputs in the common case of TensorFlow are

  • handled through sends and receives.

  • And the end state for the executor

  • is that all nodes must executed or an error has happened.

  • And inside TensorFlow, the core runtime,

  • we have no error recovery.

  • So any error will shut everything down.

  • This is sometimes unfortunate, because some parts

  • of the TensorFlow's higher level APIs

  • rely on errors for the common path.

  • For example, tf.data raises an error

  • once it reaches the end of a stream, which

  • means that you can't really easily have a single graph that

  • exhausts a single iterator, does some things, and then runs

  • and other iterator.

  • Because by the time you've exhausted the first iterator,

  • an error is raised and TensorFlow will shut everything

  • down.

  • There are ways of interacting with tf.data that do not

  • involve using the iterator GetNext op, which can fail,

  • and we use those inside AutoGraph to make it easier

  • for you to write code that can recover from these failures

  • and--

  • well, not recover from these failures.

  • It will say no failures when iterating

  • over multiple iterators.

  • It's quite nice.

  • tf.data has all these cool little combinators

  • like [INAUDIBLE] and reduce and you can thread together

  • like three of those to simulate a while_loop of breaks.

  • But anyway, popping the stack here,

  • the core algorithm of the executor

  • is while there are some nodes that haven't been executed

  • and no errors have happened, execute a viable node,

  • and once that node finishes executing,

  • you mark all of its output tensors as ready.

  • And there's some bookkeeping in there

  • that once you mark a tensor as ready,

  • you look at what ops are going to be made executable

  • by marking the tensor as ready, which

  • marks other nodes as viable.

  • And this just, like, recursively applies.

  • In the executor itself, it's not a single thing.

  • It runs on every thread that is executing an op as soon as

  • that op finishes executing, and it dispatches all the execution

  • to another threadpool.

  • It's kind of surprising that this is happening,

  • because this means that TensorFlow cannot really be run

  • on a single thread.

  • But some interesting noteworthy things

  • about the executor that you might have

  • guessed from my comments so far, but require some thinking,

  • I think.

  • One is that the executor is greedy, not lazy.

  • And if you're familiar with TensorFlow, it looks very lazy,

  • but it mostly looks lazy because we do very aggressive graft

  • pruning.

  • And once you start putting control flow and function

  • execution and a few other things in the executor,

  • it actually pays off to have a mental model that

  • says the first pruning happens and then

  • greedy execution happens.

  • Otherwise, you can trick yourself

  • into thinking that some things are not

  • going to be executed when they are, in fact, going

  • to be executed.

  • Like, my favorite is if you have a conditional

  • and one of the branches of the conditional

  • depends on a value that is not in the conditional,

  • that value is unconditionally executed even if that branch is

  • never executed.

  • Which, if the executor were lazy,

  • that would not be the case.

  • But the executor being greedy also

  • makes it easier for you to be able to reason

  • about stateful operations, which is very nice, given

  • that those exist.

  • Another thing is that this executor,

  • it only looks at very local information.

  • Like, the only bit it has for each node is whether it's ready

  • or not.

  • So often, there is nothing preventing it

  • from choosing very suboptimal ordering of things to do.

  • Like if you need to fetch a lot of tensors

  • from a parameter server, the executor

  • is just as likely to fetch the first layer's tensor

  • as it is likely to fetch the last layer's tensor,

  • because none of these things have any dependencies on them.

  • And it can be quite tricky to teach TensorFlow

  • to actually choose the optimal ordering of things to execute.

  • And as I was saying earlier, this executor

  • is this thing that, there is no single executor thread.

  • It just runs on every thread as soon

  • as it finishes executing an op.

  • So it's kind of this highly parallel little monster.

  • So this is it for most of the core TF runtime.

  • I just had a few topics that I couldn't really

  • fit very well that I wanted to cover in this presentation,

  • just to generate some documentation.

  • One is hosts versus device memory.

  • As I hinted at earlier, when you partition TensorFlow,

  • it takes a graph and it spits out end graphs,

  • one graph per device.

  • Each graph per device gets its own executor.

  • But how do you deal with the fact

  • that some GPU ops take CPU tensors?

  • So we make a distinction between--

  • when you specify an input to a kernel,

  • you can say that that kernel expects that input

  • to be in host memory or expects that input

  • to be in device memory.

  • And so in fact, the executors for the GPU device can be,

  • and most of the time are, running a lot of CPU operations

  • and CPU tensors, only they call those tensors GPU tensors

  • in host memory.

  • And so if you look at the TensorFlow code,

  • you might sometimes see things like a distinction

  • between the device a tensor is in

  • and a device its memory is in and a device its operation is

  • in, and this bookkeeping is necessary to avoid

  • mixing these things up.

  • And incidentally, all resource handles are in host memory.

  • But this has a very sad, unintuitive consequence

  • that we need to fix, which I call it,

  • and I think other people call it,

  • just "TF int32 problem," which is that most of the things

  • that GPU ops take as host memories are shape related--

  • things like fill and zeros and RandomNormal,

  • they all take a shape and they fill that shape with whatever

  • values you want.

  • But the shapes are not static.

  • They're often computed based on other shapes.

  • The simplest case is when you just use zeros or something

  • like that, where you take a tensor, take its shape,

  • and use that shape to fill in another tensor.

  • But sometimes you're going to reduce

  • some dimensions in the shape, broadcast,

  • do some other things, and TF has this rule where by default, it

  • will place every GPU-capable op on a GPU device.

  • And if you want finer grained control,

  • you just take a large block of code in TF

  • and you wrap it with a tf.device, which

  • also means that every op that happens inside a block of code

  • gets placed on that device.

  • So if you allow TensorFlow to have

  • GPU kernels for int32 tensors, we

  • would keep bouncing these shapes between the GPU and the CPU.

  • So you take a shape and you want to slice it to, like,

  • remove the back of dimension.

  • We would copy it to the GPU, remove the back of dimension,

  • then copy it back to the CPU and use

  • it to fill the RandomNormal, and that's just sad.

  • So what we did instead in TensorFlow

  • to kind of paper over this, and this

  • has been [INAUDIBLE] because this would create

  • a lot of host-device transfers, and every time you have one,

  • you have to sync the GPU stream and you slow everything down.

  • So to avoid this, we say that for almost every op that

  • has a kernel registered in the GPU, that has int32 inputs

  • and outputs, those are force-placed in host memory.

  • Including things like plus, gather, reductions,

  • and other things that you'd like to use as part of a model.

  • Currently, the work-around is use int64 for types

  • that you actually want to get executed on the GPU

  • and use int32 only for your shapes.

  • We have a fix forthcoming.

  • We have code already that exists in both Grappler for graph mode

  • and the eager placer for eager mode

  • that uses some heuristics and estimates of the cost

  • of transfer versus cost of computation

  • to try to keep small integer computations on the CPU

  • where they belong instead of bouncing them

  • back and forth to the GPU.

  • But there's still some performance regressions

  • that prevent these things from being turned on by default.

  • We expect it to happen very soon.

  • So this is it.

  • If you have any questions?

  • AUDIENCE: Could you maybe comment

  • on the difference between the kernel cache and the caching

  • that we do for functions in the Python layer?

  • ALEX PASSOS: So the Python layer does a caching for functions

  • from the--

  • like, here is a Python function, all these metadata

  • that's completely invisible to the runtime,

  • to a concrete function definition.

  • That-- when you go execute that concrete function definition,

  • we admit our partition call-up, our stateful partition call-up,

  • that then hits the tfe execute.

  • And the first time the op has to execute,

  • there's a lot of initialization that has

  • to happen inside the runtime.

  • That initialization is mostly covered by--

  • a little bit of it's covered by the kernel cache that mostly

  • tries to figure out what device it's going to be in

  • and the details of the attributes

  • and things like that.

  • And then the first time you actually execute it,

  • we'd have the cache inside the FunctionLibraryRuntime, which

  • is the thing that guards Grappler

  • and all the other graph processing transformations

  • we have to do.

  • So it's a few layers of caches to make things fast.

  • I don't really know how it could possibly

  • merge these caches across these layers of the stack,

  • but maybe if we unify more things,

  • this is going to be possible.

  • AUDIENCE: And maybe it might be worth

  • also looking if you can comment a little bit about--

  • you said the kernel cache is something

  • that is very unique to the eager execution versus something

  • that we'd have in graph mode?

  • ALEX PASSOS: No, we already had the kernel cache in graph mode.

  • I mean, it has a slightly different kernel

  • cache in eager execution, but we already needed a kernel cache

  • in graph mode, too, because creating a kernel

  • might require allocating memory, which might be expensive.

  • It definitely requires allocating a vtable pointer.

  • But in some kernels, you have to allocate a lot more than that.

  • AUDIENCE: We just had some cases where

  • there was memory bloat because of the kernel cache.

  • Is there anything that users need to be aware of for that,

  • or is something that the runtime needs to--

  • ALEX PASSOS: Hopefully our users should not be aware of it.

  • There are some cases now where you have to be aware of it,

  • like if you're using the V1 random number generator ops,

  • the only way to reset the random seed

  • requires resetting the kernel cache because states

  • step into kernels.

  • As we move to the V2 random number ops,

  • we don't make this type of mistake,

  • and so the space overall taken by the kernel cache

  • should become much smaller.

  • I also think for the particular case of functions,

  • we should be able to garbage collect

  • that cache a little better.

  • Yeah.

  • OK.

  • Oh, yes.

  • AUDIENCE: So when XLA is enabled,

  • we don't have this shape int32 problems anymore, right?

  • ALEX PASSOS: XLA has a different notion of kernel

  • and computation, so by giving up the possibility

  • of having any dynamic shapes at all, it can effectively--

  • XLA only works when you can ConstantFold all the shape

  • tensors away, and then it doesn't

  • matter if you ConstantFolded them on the GPU

  • or ConstantFolded them on the CPU but they're not there.

  • This is only a problem when you have runtime dynamic shapes.

  • AUDIENCE: Can you comment on the garbage collection

  • in the runtime?

  • ALEX PASSOS: What part--

  • AUDIENCE: In terms of what is garbage collected

  • and what is that garbage--

  • ALEX PASSOS: Most of our runtime is ref counted, not

  • garbage collected.

  • AUDIENCE: Oh, ref counted.

  • ALEX PASSOS: Yeah.

  • If you-- there's a class in TensorFlow

  • correlated core refcount.h.

  • It's a base class for a ref counted pointer,

  • and we use that instead of SharedPtr

  • because it's a smaller memory footprint

  • and it has better cache locality behavior.

  • So you should be able to just read that and find

  • the places of the runtime that inherit from it,

  • and you can see the things which are ref counted.

  • AUDIENCE: But we currently have no garbage collection

  • for the kernel cache.

  • ALEX PASSOS: The kernel cache is not garbage collected, correct.

  • But almost everything that can be ref counted already

  • is ref counted.

  • A few things, like the kernel cache, are not, because it--

  • ref counting caches feels weird.

  • But in some cases, like when you're

  • caching the kernel for a function,

  • it actually makes sense to ref count it.

  • AUDIENCE: Can we put a limit on that kernel cache?

  • ALEX PASSOS: In principle we can do, yes.

  • It's, you know, memory versus performance tradeoff.

  • Assuming we are not dealing with the v1 random number ops,

  • because those, if you are evicted from the cache,

  • you now change the sequence of random numbers you would get

  • and that's pretty bad.

  • OK, thank you.

  • [APPLAUSE]

  • [MUSIC PLAYING]

ALEX PASSOS: Hi, my name is Alex, and I'm here again,

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it