Placeholder Image

Subtitles section Play video

  • ALEX PASSOS: Hi, my name is Alex, and I'm here again,

  • this time to talk about the TensorFlow eager execution

  • runtime.

  • This is a very broad topic and there

  • are lots and lots of things we could cover,

  • so I'm going to lightly graze many, many different parts

  • of our code base.

  • I'll give you a lot of function names

  • and file names and things like that

  • that you can use to familiarize yourself with something,

  • but if you're in the room right now, by all means,

  • ask questions.

  • Like, this is-- there's some buffer time at the end

  • to account for variability here.

  • And I think the more we can maximize shared understanding

  • of this stuff, the better.

  • So the way I thought we could go about this

  • is to do a very, very deep dive on what

  • actually happens in TensorFlow, starting from TF2--

  • when you type a very simple line of code, in this case

  • a tf.nn.relu off some Python numbers.

  • And I think if you were to start doing this, probably

  • the first thing you'd do is graph the TensorFlow code

  • base to find where would we define ReLU.

  • And if you do that, you will find

  • that we have some definitions of ReLU in Keras,

  • but you won't find a single definition of ReLU itself

  • in core TensorFlow, and that might be

  • a little surprising at first.

  • It might put a damper on this whole,

  • let's find out what actually happens when we run ReLU

  • business, but the way ReLU comes from is because it's enough

  • that we've implemented in C++ and we didn't need to put

  • a complicated Python API around it.

  • We can just generate the Python code to call ReLU.

  • So the way it's actually defined,

  • it's defined using the same mechanism

  • we use to register all the ops in TensorFlow.

  • So for every core operation in TensorFlow

  • that's visible to the runtime, we

  • have a registration that looks like this--

  • REGISTER_OP.

  • It takes a name and you say how many inputs, how many outputs,

  • what attributes it has.

  • If you want to know more about attributes-- what things

  • are allowed in there-- they are how

  • we can make our ops polymorphic and have the same operation

  • have different types of outputs, so different numbers of outputs

  • and things like that.

  • There is a lot of documentation about this in TensorFlow.org,

  • if you search for how to define a new op.

  • And another interesting thing in there

  • is that we also register a shape_inference function.

  • ReLU, thankfully, is one of the simplest

  • ops we have-- it just has one input, one output,

  • they have the same shape.

  • So we can use a pre-built shape_inference function

  • that just says the shape does not change.

  • Other ops will have vastly more complicated

  • shape_inference functions.

  • And the nice thing is that we can

  • run these functions offline for our building graphs

  • without actually having the values of any tensors

  • and still be able to prove things

  • about the shapes of intermediate tenses and outputs

  • of your computation.

  • This is maybe the best tool we have for catching bugs now.

  • So if you want to look at the shape_inference code,

  • that's where you'd hook into.

  • Now that we have that registration,

  • we run some complicated code that generates Python code

  • to actually call ReLU.

  • And if you look in bazel-genfiles, files,

  • you will find a file named gen_nn_ops.py and this file has

  • the actual def ReLU that we call.

  • And as you can see, it's not pretty

  • and there's a lot of stuff going in there.

  • The first line deals are dispatching

  • so that we can define ReLU not just for normal tensors,

  • but also optionally for sparse tensors

  • and ragged tensors and other composite types.

  • The second line has tf_export and what this does

  • is define the TensorFlow public API.

  • Every symbol that you get when you are using TensorFlow by tf

  • dot something is defined somewhere

  • from a tf_export decorator like this one.

  • There will be a future video on how exactly this works

  • and why we do things this way instead

  • of relying on Python's normal, you know,

  • name spacing mechanism.

  • But you can probably guess that it's because TensorFlow

  • is very complicated.

  • But essentially, you'll see this,

  • and this generated code for ReLU has a bunch of cases in it.

  • That are roughly four.

  • You have an eager fast path, an eager slow path,

  • you have a graph mode path, and kind of a side hook

  • for the symbolic execution.

  • But here, let's focus on the eager paths.

  • In the first one, the first thing

  • that we're actually doing here is

  • we're checking to see if we're in eager mode or not.

  • And to do that, we look at this context thing.

  • This context thing is part of the core of the TensorFlow v 2

  • runtime.

  • It's the moral equivalent to the session,

  • but it's longer lived than the session

  • and represents more things.

  • So what is it?

  • From Python, the context is this class

  • that's defined in a file called [? context.ui. ?]

  • And it's a collection of a lot of things

  • that your Python program needs to be aware to connect

  • to the TensorFlow runtime.

  • It stores things like, am I in eager mode or in graph mode?

  • Or if someone used a with tf.device decorator, what

  • device am I supposed to be executing code in?

  • And it stores things like, what's

  • your name scope and many other things.

  • And all of this information that the context

  • stores, in general--

  • the things that can change during the execution

  • of a program--

  • they're all stored in ThreadLocal stacks.

  • Usually stacks, because we have these nested things

  • like with tf.device, with tf.device, with tf.device,

  • so you'd like to be able to pop the stack

  • to go back to where you were.

  • And ThreadLocal because it's very important to us

  • that a TensorFlow runtime itself be thread agnostic,

  • so that if you write two threads and one is doing

  • a reinforcement learning learner and the other's doing

  • an agent that's talking to some game, when the agent wants

  • to use its GPU, it shouldn't necessarily make the learner

  • use the U, and vice versa.

  • Providing some kind of resolution between the threads

  • is what we felt like was the right way,

  • so that at least each single thread

  • can feel like it's its own single-threaded Python program.

  • We use this a lot in distribution strategies,

  • like MirroredStrategy uses a lot of threads under the hood,

  • so it's really important that things are thread-safe.

  • In the Python context, essentially it mostly is

  • a wrapper around the C++ context.

  • It's available for the TensorFlow C API.

  • And this is the core thing.

  • It has a bunch more methods than just these--

  • like, you can do a lot more things

  • than just listing devices.

  • One thing I'd like to call out is that right now,

  • there are some things that are done in Python, like storing,

  • whether in eager mode and graph mode,

  • and some things that are done in C++.

  • And it's not the set of things that are done Python and set

  • of things that are done in C++ are likely to change.

  • I think as TensorFlow evolves, more and more things should

  • migrate from the Python context to the C++ context,

  • which will make things more language agnostic,

  • but also faster.

  • And you know, if everything in the context was in C++,

  • then all the generated Python code could have just been C++

  • code and it would have to--

  • we'd be able to get out of the overhead of executing Python

  • much sooner and remove performance problems

  • in our APIs.

  • So once you know you're in eager mode,

  • we try to do this fast path execution.

  • In this, the fast path is some complicated C code

  • that mostly does the same things that the fallback case is

  • trying to do.

  • So I don't think it's necessarily worth reading that.

  • I would rather look at the simpler

  • code and the fallback path.

  • And it's here.

  • So this is where we actually implement the ReLU function.

  • And what are the interesting things here?

  • First, we have this function called args_to_matching_eager,

  • and what it does is it takes a bunch of tensors,

  • a bunch of things that can be tensors,

  • converts them all to tensors of the same dtype.

  • And in the case of ReLU, there is only one.

  • But in the case of any other ops, like Add or MatMul,

  • they take multiple inputs, some of which might be tensors,

  • others might be numpy.arrays or Python lists or variables

  • or other objects that you can convert to a tensor,

  • but they're not necessarily tensors.

  • And this is the thing that's responsible for canonicalizing

  • everything.

  • Then once we have--

  • but here you might want to stop me a little and ask,

  • what is a tensor at the level of a TensorFlow runtime?

  • And the eager tensor class is a thing

  • that's half implemented in Python

  • and half implemented in the Python C API,

  • but it's a relatively thin wrapper around this thing

  • that we call TensorHandle that you can see in our TensorFlow C

  • API.

  • And this TensorHandle, it represents a potentially yet

  • to be computer tensor which is going to live on some device.

  • And we know what device it's going

  • to live on, because we know what device we executed

  • the operation that generates the tensor on.

  • And there are roughly three things

  • you can do to a TensorHandle, really.

  • You can ask about its shape and dtype and stuff-- that's one.

  • You can give it to execute--

  • that's another one.

  • And you can copy it to a device.

  • Also, I guess there are four things,

  • and the fourth thing is you can ask, hey, what is its value?

  • And when you want to force the evaluation of its value,

  • the TensorFlow runtime might have

  • to pause until it can give you the result.

  • You might need to copy the tensor from some remote device

  • and do all sorts of other things, but once you're done,

  • you get a TF_Tensor, which is essentially

  • just a pointer to some data.

  • But every other operation that is not

  • looking at the value of the tensor

  • is something that the runtime is free to delay, reorder,

  • as long as it respects the intended

  • semantics of your program.

  • And this is very important, because when

  • you're working with things like GPUs and TPUs,

  • you'd like to be able to dispatch operations that

  • are running the hardware as quickly as you can

  • in a way that's asynchronous.

  • If the Python thread can race ahead of the GPU

  • as it's executing operations, we can fully

  • utilize the GPU hardware and get the maximum performance.

  • And even if the eager runtime, we

  • can do this if the GPU kernels are sufficiently heavy.

  • There's a few cases in which we need to copy tensors,

  • even if they're local on the CPU.

  • And some of those cases are going away, like currently,

  • we copy string tensors every time we try to look

  • at their values because TensorFlow's internal string

  • representation is a complicated C++ class and we'd like to have

  • an API stable C type.

  • We're actually changing our internal string

  • representation-- there's an RFC you can look about it--

  • that will make the internal and external representations

  • both be an API stable C type.

  • But internally, what is a TensorHandle?

  • And this is, in fact, very, very similar

  • to the first implementation of TensorHandle.

  • Now it's like hundreds of lines of code spread

  • across many files, but all that you really

  • need in a TensorHandle is to know what device

  • it's in and what data you have.

  • And the data can either be some concrete tensor, a value that

  • has already been computed, or a future that

  • might be computed later because it's remote or asynchronous

  • or on some other device.

  • But the core of it is this--

  • just, this is the future that we handle around

  • in the representation.

  • The current code is much more complicated,

  • and you might want to look at it to see why it's complicated,

  • but the core idea is there.

  • So popping the stack.

  • You now know what a tensor is, and you've probably figured out

  • that converting that list of Python integers to a tensor is

  • not terribly hard-- there's C code that does that--

  • and now it's time for us to execute ReLU.

  • Great.

  • Here is one thing to note about how the TensorFlow

  • eager runtime works, which is that this line that

  • is selected there is not a noncontroversial choice.

  • Overall, there are two ways we could have gone about this.

  • We could have had a closed domain API, in which TensorFlow

  • would export a symbol called ReLU,

  • another called MatMul, another called conf, et cetera.

  • Or we could have this open domain API

  • that we have here, where we have a symbol

  • called execute that takes the name of an operation ReLU

  • and a bunch of metadata about it.

  • There are advantages and disadvantages to both cases.

  • In general, the closed domain case,

  • where you just have a single endpoint

  • in your API for every operation you want to run--

  • that is easier to make fast.

  • However, it's a little tricky in TensorFlow

  • because the current--

  • the pre-existing graph runtime has a few layers

  • of indirection between a node in the graph and the actual kernel

  • that it executes.

  • And indeed, between TensorFlow versions,

  • we can, without breaking graph def compatibility,

  • replace some kernels.

  • Some things that were handled by a single kernel

  • now have to be handled by many multiple kernels.

  • So to preserve this layer of indirection,

  • we felt like it was more natural to have this API that

  • is an open domain API.

  • However, as you can see, just by the fact

  • that there's a string in there, executing

  • this can only be so fast, because we

  • need to somehow take this string and these attributes

  • and some properties about these inputs

  • and turn that into a kernel.

  • And that means that we can definitely not

  • execute kernels any faster than it takes

  • to at least hash that string.

  • So you know, there are trade-offs here,

  • but we felt that preserving the flexibility

  • that you have in graph mode was the most important thing here.

  • So how do you actually use execute?

  • When you call this line in Python, what actually happens?

  • And execute's something that is defined in the TensorFlow C

  • API, and to use it, you do something like this.

  • You first create a status so that you

  • can find out if things failed, and then you make a new op.

  • You add inputs, you set attributes,

  • you allocate some memory to put pointers to the return values,

  • and then you call execute and finally, you delete that all.

  • So this is fairly straightforward,

  • and if you're familiar with the TensorFlow C API for building

  • graphs, you'll see that this is very similar to that C API.

  • There's-- this is about as good as you could possibly get

  • for Python code in an open domain setup,

  • but it's a little sad here that when you build a TFE_Op,

  • you need to add the inputs to that op.

  • This means that if you're executing

  • an op in a tight loop, you can't exe--

  • you have to have the same inputs on every iteration of the loop

  • either/or you have to allocate a whole other TFE_op.

  • For Python, we really don't have a way of making this better,

  • but for other languages like Swift or languages

  • that have access to our compiler,

  • we should be able to cache the dynamic bits that

  • involve making a TFE_Op and separate them from the--

  • sorry, cache the static bits that don't really change,

  • like all your MatMuls are the same,

  • and separate that from the dynamic bits, which

  • are the inputs that should actually change.

  • And if you do this, you can actually--

  • we could make in principle this open domain

  • approach as fast as a closed domain approach.

  • And this is maybe a minor refactoring

  • that we should do at some point.

  • So what does execute do?

  • If you go through a few layers of APIs,

  • you end up on this function called EagerExecute,

  • and it has a lot of things here.

  • The first interesting one is this maybe update_op device,

  • which is, you might call it, the placer,

  • where we get to decide where we're going

  • to execute each operation.

  • This will have some complicated heuristics.

  • In general, you can think of it as, if you're

  • operating on a resource tensor, we'll

  • run your operation on the device that

  • has that resource because any other thing will fail.

  • Otherwise, if you have a tf.device annotation somewhere,

  • we'll run it there.

  • Otherwise, if you don't have any of those things,

  • we'll see what devices are available to execute

  • this operation and run on whatever device

  • we think is going to be the fastest, which

  • is how TensorFlow gets away with using a GPU for you

  • even if you don't specify with tf.device GPU in there.

  • And then you have some forks in there about are we local,

  • are we remote?

  • And once you do know that you're in the local case, what we want

  • to do is very quickly do that string processing

  • that we needed to do to find what

  • kernel we should be executing.

  • And there is a fast code that takes the attributes

  • in the name of an op and gives you

  • a cache key that is looked entirely in the context

  • and where we store the kernel.

  • And here you might think something's a little funny,

  • because usually you think of operations as functions,

  • not as classes, but clearly there

  • is like a kernel and device class in there,

  • so we probably have an object.

  • And the reason is that for many types

  • of kernels that we want to execute,

  • especially things involving GPUs but also some stateful kernels,

  • you want to keep some state in that object.

  • Ideally, that state will be managed by the runtime

  • if you have the resource managers,

  • but that doesn't always happen now.

  • And once you have a kernel, the kernel

  • tells us on what devices it wants its inputs on.

  • You would think that the kernel would want its inputs

  • all in a device that is executing,

  • but that turns out to be too narrow a view.

  • For some kernels, especially GPU kernels,

  • you might want some inputs on the host

  • CPU that's attached to the GPU.

  • And the reason is that imagine you're

  • a GPU kernel for generating a big matrix of random numbers.

  • You'd like to know how many numbers you're

  • going to generate so that you can run the memory

  • allocation before you enqueue your CUDA kernel.

  • So if you were to do that-- if the shape of the random number

  • vector you're going to generate, if that's in the GPU,

  • you'd need to fetch it back to the CPU to do that allocation.

  • That would be terribly slow.

  • So instead TensorFlow says, I expect that input

  • to be on the local CPU, and this is the function

  • that validates this.

  • But in this case, if you're also trying

  • to run a large convolution and one of the inputs

  • is in the CPU, this maybe copy will move that input

  • to the GPU for you, which is faster than just running

  • your convolution on the CPU.

  • And then finally, we get to decide whether we're

  • in sync or async mode, where we first create this node that

  • represents all the computation that has to happen

  • to execute this kernel.

  • If we're async mode, we throw this in a queue and return

  • control immediately.

  • If we're in sync mode, we run it now.

  • This async/sync here is complicated

  • because it's another layer of asynchrony that's

  • separate from the fact that our GPU runtime is itself

  • asynchronous.

  • This is kind of a patch to make the TensorFlow CPU

  • runtime, which is currently synchronous,

  • act asynchronously to try to get a little more performance

  • in the eager mode.

  • It's a little sad, because you lose your error messages

  • once you get very asynchronous there,

  • and we currently do not run shape_inference

  • in this asynchronous mode.

  • I think as we rework the TensorFlow runtime, which

  • the team has a large effort to do now,

  • we have a chance to fix this and have a single code

  • back for synchronous and asynchronous.

  • But for now, we have this patch.

  • And then finally, now we have the kernel.

  • We've got to call it.

  • That's easy, right?

  • So to call a kernel, we have to make an OpKernel context,

  • and to do that, you need to fill this brand struct, which

  • I put here.

  • Which you can clearly read in this light,

  • because it can definitely fit with a very large and readable

  • font.

  • So we don't do that.

  • This is sadly something that-- the original TensorFlow

  • API for kernels had only one caller, which was a TensorFlow

  • executor, so it was very easy to just add parameters and make

  • the calling convention harder and harder,

  • because there was only one place to fix.

  • We're now trying to trim this back and simplify it,

  • so it will likely get better.

  • But for the eager runtime, we have this class KernelAndDevice

  • that knows how to call a kernel, requiring a lot fewer things

  • about it.

  • Mostly all it needs is the inputs--

  • a place for you to populate with outputs--

  • and some information in case you want to profile--

  • things about how long it takes to execute each node

  • or do a staff or what graphs you're

  • using, if you're executing function, and things like that.

  • So now that we have this, we can run the kernel.

  • So what the kernels look like--

  • ReLU happens to have one of the simpler kernels

  • we have in TensorFlow.

  • It's a UnaryElementWiseOp.

  • We have a base class for this that

  • handles a lot of the logic around memory

  • allocation, buffer reuse, so that tf.relu by default

  • will reuse its input buffer if the TensorFlow runtime knows

  • that no other op yet to be executed is going to need this.

  • But once all the boilerplate is dealt with,

  • all that this kernel has to do is execute the functor.

  • And this is another place where you'd be surprised

  • that we use an object where you think we should use a function,

  • because in principle, relu is just a function.

  • It doesn't keep any state.

  • There should be no reason for us to make an object for it.

  • Except C++ does not let you declare a templated function

  • in the header file, but define it in the C++ file.

  • And this is something very useful for us

  • because as you can see, device is a parameter in there,

  • and one of those devices is GPU devices.

  • And for GPU devices, we'd like to put the function in the file

  • that we're going to compile for CUDA compiler.

  • And we would like to not compile our entire code

  • base with a CUDA compiler.

  • So being able to define this functor in a single place,

  • where we can generate a whole specialization of this class--

  • we don't have any access to CUDA compiler,

  • but have a file on the side that's

  • just going to fill in this implementation

  • after running the CUDA compiler.

  • It's very useful.

  • And as you can also see here, TensorFlow

  • kernels-- they tend to be highly templated.

  • Most are templated like this one, based on the device

  • that it's going to execute on and on the dtypes.

  • So they will generate fast, specialized code

  • for every core numerical type supported by TensorFlow, which

  • is an incentive to keeping the set of core numerical types

  • supported by TensorFlow relatively small,

  • as otherwise their binary size would grow.

  • But this has the nice side effect

  • that the code generated is very fast.

  • And one of the things that makes us generate

  • very fast code for this, which you will see if you look

  • into the implementation of the functors,

  • is that we can use Eigen to generate this code for us.

  • So the ReLU functor in particular

  • is very easy to write, because it's

  • just a computed wise max between the tensor you

  • have your input and 0.

  • And Eigen turns out to be a very useful tool to write this.

  • It lets us write this code once.

  • It will generate specializations for every dtype

  • we are interested in and also for CPUs and GPUs.

  • For this particular operation, you could probably

  • write it in fast assembly language yourself,

  • or SSE intrinsics or something like that,

  • but for more complicated operations,

  • like softmax and others that might

  • have interesting intermediate values that need computing,

  • being able to just have this code be generated

  • for you instead of requiring that you write all the device

  • specific and type specific things, can save a lot of time.

  • Also, Eigen in its core has a very, very fast gem,

  • which is the core, like, basic MatMul that is inside most

  • of our very expensive kernels.

  • It ends up being a very large asset

  • in making TensorFlow go fast.

  • So that was it, really.

  • It only took us what, 20 minutes to get through executing ReLU?

  • I think TensorFlow can do it a little bit faster than that.

  • [LAUGHTER]

  • But in general, this is kind of like what the stack of things

  • looks like as you're executing operations eagerly.

  • Of course, if you've been following

  • this for a little bit, you should

  • know that we can do better than executing operations eagerly

  • in TensorFlow.

  • We have tf.function and we have graphs and other things that

  • can get you a lot more performance by instead

  • of going down and up that stack for every single operation,

  • going down once and executing a lot of operations.

  • Also, this slides into optimizations and stuff.

  • So how do we run functions?

  • And so-- yes?

  • AUDIENCE: Previously, you mentioned briefly

  • about the async mode.

  • So is that something that is user configurable?

  • Because there's like context async, [INAUDIBLE]..

  • ALEX PASSOS: I don't remember right now if there's

  • a public API to make it user configurable or not,

  • but there is an internal API to make it user configurable.

  • AUDIENCE: I believe that there was-- in enable Eager

  • Execution, you could set it.

  • So I think you could set it in v1,

  • but it not be exposed directly in v2.

  • ALEX PASSOS: Yes.

  • You-- Taylor is correct.

  • I think it-- I know how to expose it in v1

  • and do not know how to expose it in v2,

  • but there's probably a way.

  • Or maybe there's not a way yet because we're still

  • treating it as experimental.

  • I'm not sure.

  • Regardless, the way it is now, I don't think it's

  • something we should rely on in the long run.

  • So I'd rather be able to iterate on it

  • a little longer until we start recommending it

  • as a way for people to get performance.

  • AUDIENCE: And what is the difference between eager

  • slow and fast path?

  • ALEX PASSOS: It's most that the fast path has

  • special case-- some types of tensor conversion

  • into fast C code.

  • So if you pass a list of Python floating point numbers,

  • we can convert that to an eager tensor

  • without hitting any Python code, and that

  • will save you quite some time.

  • OK, so most of what we saw in right now

  • for the case of executing a single op

  • also applies to executing a function.

  • So a function itself is an op named PartitionedCall,

  • and you will execute that op--

  • like the tf.function internals will execute that op--

  • just like how we just saw how to execute ReLU.

  • And so the first half, until you get to the kernel and device

  • run bit, is all the same.

  • It's just that that kernel implementation

  • is particularly interesting.

  • In function calls in TensorFlow, they look relatively simple.

  • We have inputs, we have outputs, we

  • have attributes that tell TensorFlow what types of inputs

  • we have, what types of outputs we have,

  • and we have a function.

  • There are some things in there that

  • seem a little tricky, like there's

  • all sorts of configuration we can pass.

  • I actually forgot what's the difference between config

  • and config_proto in there.

  • But essentially, this is the big entry point

  • to executing functions in TensorFlow.

  • But what this-- if you go look at the kernel of this op, what

  • you'll find is that it mostly just forwards things

  • to the FunctionLibraryRuntime.

  • And the FunctionLibraryRuntime is this core bit of TensorFlow

  • that knows about functions.

  • It can do things like instantiate and run,

  • pretty much.

  • And also create kernel, which you usually

  • do between instantiate and run, since that will let you--

  • for the function of a runtime also knows how to execute

  • operations that are not functions.

  • So what does instantiate mean?

  • An instantiate mostly runs all the graph

  • optimizations that we might want to run on that function--

  • to take code that you enjoyed writing

  • and turn it into code that the executor will

  • execute very quickly.

  • Most of this processing happens in this process

  • FunctionLibraryRuntime Instantiate multi-device call,

  • where we run all sorts of graph transformations.

  • This is if you have a tf-xla bridge happening,

  • it will run the transformations related to tf-xla bridge.

  • It will run the TensorFlow placer.

  • What the TensorFlow placer does is it takes a graph--

  • in this case, a function graph--

  • that has devices assigned to some of the nodes,

  • and it spits out another graph that has devices

  • assigned to all of the nodes.

  • It does this by following a very similar algorithm to the one

  • that I described earlier for individual ops.

  • So if you have a resource, it will place that op

  • next to the resource.

  • Otherwise, if you have specified a device,

  • it will respect that device, even if you partially

  • specified a device, it will respect that partial device

  • specification.

  • And finally, we'll group things by colocation group.

  • The TensorFlow graph language allows

  • you to specify colocations, even though these

  • have very non-intuitive consequences, because

  • by colocating a node A of a node B,

  • you can actually move where node B

  • is placed because the placer is not

  • aware of the direction of the colocation arrows.

  • It just groups all the colocated nodes into a bag

  • and finds a device that can execute all ops in there.

  • So this can have very fun consequences,

  • like a bug I helped fix a while back

  • where if you try to speed your distributed system by always

  • colocating the variable initialization

  • code with the remote device in which the variable is in,

  • and you can accidentally say, please always colocate

  • the initializer from a GPU variables

  • on the GPU, which can be trouble.

  • If some of your initializers have operations that cannot run

  • in the GPU, you now have silently moved your variables

  • to the CPU, which probably is quite a performance

  • degradation.

  • So it's very subtle and in TF2, we're

  • trying to move away from using colocation constraints

  • inside TensorFlow and we're definitely moving away

  • from encouraging people to use colocation constraints

  • outside TensorFlow.

  • I would rather you be more specific about what devices

  • you're using, or even better, use something

  • like a distribution strategy that

  • is aware of all these bugs and colocations

  • and can work around them for you instead of trying to replicate

  • this functionality yourself.

  • And once you have placed the graph,

  • we can run the partition that takes a graph that

  • has nodes for many devices and returns many graphs,

  • all of which have nodes on a single device only.

  • And to do that, if there was any edge that went from one device

  • to another device, that gets replaced with a pair of sends

  • and receives.

  • This is also where we run Grappler and all the function

  • inlining and optimization passes that the last training section

  • with Eugene was covering.

  • But yeah.

  • So this does a lot of heavy lifting

  • and indeed, the partition call-up

  • puts the cache in front of Instantiate

  • to make sure that we don't call it twice

  • for the single function, because otherwise we'd be very slow.

  • And once they've instantiated the function,

  • we can go to the other main method

  • in the FunctionLibraryRuntime and run it.

  • So in general, as you can tell by the name

  • Partition Call for the op, our functions

  • can be on multiple devices.

  • And at this point, at least the code

  • has simplified enough in there--

  • this is actually snippeted from the core runtime,

  • even though the core runtime has a lot more error handling

  • going on--

  • that all we have to do to run a function on multiple devices

  • is to just run a function on a single--

  • run end functions each on a single device

  • and trust that they all know how to talk to each other

  • to make the sends and receives happen.

  • So there's some thing called a rendezvous,

  • and if you read the TensorFlow code base,

  • you will see lots of references to it, that's

  • responsible for making the sends and receives all the way

  • over each other.

  • We'll have rendezvous that know how to deal with single host

  • and with multi host.

  • And there are lots of tricky and interesting bits

  • into how they relate with what is

  • the correct lifetime of a rendezvous,

  • how do you shut down a rendezvous once you want

  • to shut down TensorFlow computation because maybe

  • some kernels failed.

  • And you know, some kernels failed

  • and something is blocking on receiving a tensor.

  • They're never going to get that tensor.

  • So we probably need to shut that operation down gracefully.

  • And there's a lot of cancellation related

  • logic in that.

  • But it mostly-- at the level of FunctionLibraryRuntime,

  • you can just run your end functions, one per device,

  • and forget about it.

  • And running our function on a single device

  • mostly consists of bringing up this TensorFlow executor

  • and calling run in it.

  • And you'll see that you have things named RunAsync and done

  • callbacks.

  • In general, we treat all of these things as asynchronous

  • so that we can release the calling thread as quickly as we

  • can so that you can keep on running more computation on it.

  • Especially if you have nested function calls--

  • treating these things asynchronously

  • is quite the performance improvement.

  • And here, I could dig into the TensorFlow executor,

  • but that code is fairly complex and you have a simple core

  • algorithm, but it's really hard to pull it out of there.

  • And I think the main reason why it's

  • hard to pull it out of there is that the executor grew together

  • with the implementation of control flow in TensorFlow,

  • and specifically, the details that we had to implement

  • while_loop kind of, like, obscured

  • a lot of the core functionality of the executor.

  • It now is aware of frames and a lot of other complicated

  • things, and you have multidimensional pending

  • counts.

  • So I'm not going to snippet that code,

  • but I'll say that if you want to read it, go for it.

  • It's very interesting-- like, highly asynchronous,

  • highly parallel interpreter.

  • But I'll just give you some of the highlights

  • of what's happening in there.

  • And its input is a large bag of nodes and there is no output.

  • Anything that you want to get out of it,

  • you get out of it through a send or a receive.

  • I mean, there's-- technically there are outputs,

  • but by far most outputs in the common case of TensorFlow are

  • handled through sends and receives.

  • And the end state for the executor

  • is that all nodes must executed or an error has happened.

  • And inside TensorFlow, the core runtime,

  • we have no error recovery.

  • So any error will shut everything down.

  • This is sometimes unfortunate, because some parts

  • of the TensorFlow's higher level APIs

  • rely on errors for the common path.

  • For example, tf.data raises an error

  • once it reaches the end of a stream, which

  • means that you can't really easily have a single graph that

  • exhausts a single iterator, does some things, and then runs

  • and other iterator.

  • Because by the time you've exhausted the first iterator,

  • an error is raised and TensorFlow will shut everything

  • down.

  • There are ways of interacting with tf.data that do not

  • involve using the iterator GetNext op, which can fail,

  • and we use those inside AutoGraph to make it easier

  • for you to write code that can recover from these failures

  • and--

  • well, not recover from these failures.

  • It will say no failures when iterating

  • over multiple iterators.

  • It's quite nice.

  • tf.data has all these cool little combinators

  • like [INAUDIBLE] and reduce and you can thread together

  • like three of those to simulate a while_loop of breaks.

  • But anyway, popping the stack here,

  • the core algorithm of the executor

  • is while there are some nodes that haven't been executed

  • and no errors have happened, execute a viable node,

  • and once that node finishes executing,

  • you mark all of its output tensors as ready.

  • And there's some bookkeeping in there

  • that once you mark a tensor as ready,

  • you look at what ops are going to be made executable

  • by marking the tensor as ready, which

  • marks other nodes as viable.

  • And this just, like, recursively applies.

  • In the executor itself, it's not a single thing.

  • It runs on every thread that is executing an op as soon as

  • that op finishes executing, and it dispatches all the execution

  • to another threadpool.

  • It's kind of surprising that this is happening,

  • because this means that TensorFlow cannot really be run

  • on a single thread.

  • But some interesting noteworthy things

  • about the executor that you might have

  • guessed from my comments so far, but require some thinking,

  • I think.

  • One is that the executor is greedy, not lazy.

  • And if you're familiar with TensorFlow, it looks very lazy,

  • but it mostly looks lazy because we do very aggressive graft

  • pruning.

  • And once you start putting control flow and function

  • execution and a few other things in the executor,

  • it actually pays off to have a mental model that

  • says the first pruning happens and then

  • greedy execution happens.

  • Otherwise, you can trick yourself

  • into thinking that some things are not

  • going to be executed when they are, in fact, going

  • to be executed.

  • Like, my favorite is if you have a conditional

  • and one of the branches of the conditional

  • depends on a value that is not in the conditional,

  • that value is unconditionally executed even if that branch is

  • never executed.

  • Which, if the executor were lazy,

  • that would not be the case.

  • But the executor being greedy also

  • makes it easier for you to be able to reason

  • about stateful operations, which is very nice, given

  • that those exist.

  • Another thing is that this executor,

  • it only looks at very local information.

  • Like, the only bit it has for each node is whether it's ready

  • or not.

  • So often, there is nothing preventing it

  • from choosing very suboptimal ordering of things to do.

  • Like if you need to fetch a lot of tensors

  • from a parameter server, the executor

  • is just as likely to fetch the first layer's tensor

  • as it is likely to fetch the last layer's tensor,

  • because none of these things have any dependencies on them.

  • And it can be quite tricky to teach TensorFlow

  • to actually choose the optimal ordering of things to execute.

  • And as I was saying earlier, this executor

  • is this thing that, there is no single executor thread.

  • It just runs on every thread as soon

  • as it finishes executing an op.

  • So it's kind of this highly parallel little monster.

  • So this is it for most of the core TF runtime.

  • I just had a few topics that I couldn't really

  • fit very well that I wanted to cover in this presentation,

  • just to generate some documentation.

  • One is hosts versus device memory.

  • As I hinted at earlier, when you partition TensorFlow,

  • it takes a graph and it spits out end graphs,

  • one graph per device.

  • Each graph per device gets its own executor.

  • But how do you deal with the fact

  • that some GPU ops take CPU tensors?

  • So we make a distinction between--

  • when you specify an input to a kernel,

  • you can say that that kernel expects that input

  • to be in host memory or expects that input

  • to be in device memory.

  • And so in fact, the executors for the GPU device can be,

  • and most of the time are, running a lot of CPU operations

  • and CPU tensors, only they call those tensors GPU tensors

  • in host memory.

  • And so if you look at the TensorFlow code,

  • you might sometimes see things like a distinction

  • between the device a tensor is in

  • and a device its memory is in and a device its operation is

  • in, and this bookkeeping is necessary to avoid

  • mixing these things up.

  • And incidentally, all resource handles are in host memory.

  • But this has a very sad, unintuitive consequence

  • that we need to fix, which I call it,

  • and I think other people call it,

  • just "TF int32 problem," which is that most of the things

  • that GPU ops take as host memories are shape related--

  • things like fill and zeros and RandomNormal,

  • they all take a shape and they fill that shape with whatever

  • values you want.

  • But the shapes are not static.

  • They're often computed based on other shapes.

  • The simplest case is when you just use zeros or something

  • like that, where you take a tensor, take its shape,

  • and use that shape to fill in another tensor.

  • But sometimes you're going to reduce

  • some dimensions in the shape, broadcast,

  • do some other things, and TF has this rule where by default, it

  • will place every GPU-capable op on a GPU device.

  • And if you want finer grained control,

  • you just take a large block of code in TF

  • and you wrap it with a tf.device, which

  • also means that every op that happens inside a block of code

  • gets placed on that device.

  • So if you allow TensorFlow to have

  • GPU kernels for int32 tensors, we

  • would keep bouncing these shapes between the GPU and the CPU.

  • So you take a shape and you want to slice it to, like,

  • remove the back of dimension.

  • We would copy it to the GPU, remove the back of dimension,

  • then copy it back to the CPU and use

  • it to fill the RandomNormal, and that's just sad.

  • So what we did instead in TensorFlow

  • to kind of paper over this, and this

  • has been [INAUDIBLE] because this would create

  • a lot of host-device transfers, and every time you have one,

  • you have to sync the GPU stream and you slow everything down.

  • So to avoid this, we say that for almost every op that

  • has a kernel registered in the GPU, that has int32 inputs

  • and outputs, those are force-placed in host memory.

  • Including things like plus, gather, reductions,

  • and other things that you'd like to use as part of a model.

  • Currently, the work-around is use int64 for types

  • that you actually want to get executed on the GPU

  • and use int32 only for your shapes.

  • We have a fix forthcoming.

  • We have code already that exists in both Grappler for graph mode

  • and the eager placer for eager mode

  • that uses some heuristics and estimates of the cost

  • of transfer versus cost of computation

  • to try to keep small integer computations on the CPU

  • where they belong instead of bouncing them

  • back and forth to the GPU.

  • But there's still some performance regressions

  • that prevent these things from being turned on by default.

  • We expect it to happen very soon.

  • So this is it.

  • If you have any questions?

  • AUDIENCE: Could you maybe comment

  • on the difference between the kernel cache and the caching

  • that we do for functions in the Python layer?

  • ALEX PASSOS: So the Python layer does a caching for functions

  • from the--

  • like, here is a Python function, all these metadata

  • that's completely invisible to the runtime,

  • to a concrete function definition.

  • That-- when you go execute that concrete function definition,

  • we admit our partition call-up, our stateful partition call-up,

  • that then hits the tfe execute.

  • And the first time the op has to execute,

  • there's a lot of initialization that has

  • to happen inside the runtime.

  • That initialization is mostly covered by--

  • a little bit of it's covered by the kernel cache that mostly

  • tries to figure out what device it's going to be in

  • and the details of the attributes

  • and things like that.

  • And then the first time you actually execute it,

  • we'd have the cache inside the FunctionLibraryRuntime, which

  • is the thing that guards Grappler

  • and all the other graph processing transformations

  • we have to do.

  • So it's a few layers of caches to make things fast.

  • I don't really know how it could possibly

  • merge these caches across these layers of the stack,

  • but maybe if we unify more things,

  • this is going to be possible.

  • AUDIENCE: And maybe it might be worth

  • also looking if you can comment a little bit about--

  • you said the kernel cache is something

  • that is very unique to the eager execution versus something

  • that we'd have in graph mode?

  • ALEX PASSOS: No, we already had the kernel cache in graph mode.

  • I mean, it has a slightly different kernel

  • cache in eager execution, but we already needed a kernel cache

  • in graph mode, too, because creating a kernel

  • might require allocating memory, which might be expensive.

  • It definitely requires allocating a vtable pointer.

  • But in some kernels, you have to allocate a lot more than that.

  • AUDIENCE: We just had some cases where

  • there was memory bloat because of the kernel cache.

  • Is there anything that users need to be aware of for that,

  • or is something that the runtime needs to--

  • ALEX PASSOS: Hopefully our users should not be aware of it.

  • There are some cases now where you have to be aware of it,

  • like if you're using the V1 random number generator ops,

  • the only way to reset the random seed

  • requires resetting the kernel cache because states

  • step into kernels.

  • As we move to the V2 random number ops,

  • we don't make this type of mistake,

  • and so the space overall taken by the kernel cache

  • should become much smaller.

  • I also think for the particular case of functions,

  • we should be able to garbage collect

  • that cache a little better.

  • Yeah.

  • OK.

  • Oh, yes.

  • AUDIENCE: So when XLA is enabled,

  • we don't have this shape int32 problems anymore, right?

  • ALEX PASSOS: XLA has a different notion of kernel

  • and computation, so by giving up the possibility

  • of having any dynamic shapes at all, it can effectively--

  • XLA only works when you can ConstantFold all the shape

  • tensors away, and then it doesn't

  • matter if you ConstantFolded them on the GPU

  • or ConstantFolded them on the CPU but they're not there.

  • This is only a problem when you have runtime dynamic shapes.

  • AUDIENCE: Can you comment on the garbage collection

  • in the runtime?

  • ALEX PASSOS: What part--

  • AUDIENCE: In terms of what is garbage collected

  • and what is that garbage--

  • ALEX PASSOS: Most of our runtime is ref counted, not

  • garbage collected.

  • AUDIENCE: Oh, ref counted.

  • ALEX PASSOS: Yeah.

  • If you-- there's a class in TensorFlow

  • correlated core refcount.h.

  • It's a base class for a ref counted pointer,

  • and we use that instead of SharedPtr

  • because it's a smaller memory footprint

  • and it has better cache locality behavior.

  • So you should be able to just read that and find

  • the places of the runtime that inherit from it,

  • and you can see the things which are ref counted.

  • AUDIENCE: But we currently have no garbage collection

  • for the kernel cache.

  • ALEX PASSOS: The kernel cache is not garbage collected, correct.

  • But almost everything that can be ref counted already

  • is ref counted.

  • A few things, like the kernel cache, are not, because it--

  • ref counting caches feels weird.

  • But in some cases, like when you're

  • caching the kernel for a function,

  • it actually makes sense to ref count it.

  • AUDIENCE: Can we put a limit on that kernel cache?

  • ALEX PASSOS: In principle we can do, yes.

  • It's, you know, memory versus performance tradeoff.

  • Assuming we are not dealing with the v1 random number ops,

  • because those, if you are evicted from the cache,

  • you now change the sequence of random numbers you would get

  • and that's pretty bad.

  • OK, thank you.

  • [APPLAUSE]

  • [MUSIC PLAYING]

ALEX PASSOS: Hi, my name is Alex, and I'm here again,

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it

B1 kernel device runtime execute cache executor

Inside TensorFlow: Eager execution runtime

  • 7 0
    林宜悉 posted on 2020/03/31
Video vocabulary

Keywords

bunch

US /bʌntʃ/

UK /bʌntʃ/

  • other
  • (of a fabric) gather or cause to gather into folds or wrinkles.
  • other
  • Collect or gather together.
  • noun
  • A group of things of the same kind
  • A group of people regarded as a unit; a company.
  • A group of people.
  • verb
  • To group people or things closely together
  • (Cloth) to gather/be gathered together in folds
context

US /ˈkɑnˌtɛkst/

UK /ˈkɒntekst/

  • noun
  • Set of facts surrounding a person or event
  • The parts of something written or spoken that immediately precede and follow a word or passage and clarify its meaning.
  • The circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed.
multiple

US /ˈmʌltəpəl/

UK /ˈmʌltɪpl/

  • adjective
  • Having or involving more than one of something
  • Capable of handling more than one task or user at a time.
  • Consisting of or involving more than one.
  • Affecting many parts of the body.
  • More than one; many.
  • Having or involving several parts, elements, or members.
  • noun
  • Number produced by multiplying a smaller number
  • A ratio used to estimate the total value of a company.
  • A number of identical circuit elements connected in parallel or series.
  • A number that can be divided by another number without a remainder.
  • pronoun
  • More than one; several.
tricky

US /ˈtrɪki/

UK /'trɪkɪ/

  • adjective
  • Difficult, so needing skill to do or deal with
  • Likely to use tricks; dishonest or deceptive
  • Using or likely to use dishonest tricks.
  • Difficult to deal with or do because it is complex and full of problems.
aware

US /əˈwɛr/

UK /əˈwɛə/

  • adjective
  • Knowing or feeling that something exists
audience

US /ˈɔdiəns/

UK /ˈɔ:diəns/

  • noun
  • Group of people attending a play, movie etc.
general

US /ˈdʒɛnərəl/

UK /'dʒenrəl/

  • noun
  • A broad field of study or knowledge.
  • A high-ranking officer in the army, air force, or marine corps.
  • The public; the population at large.
  • Top ranked officer in the army
  • adjective
  • Widespread, normal or usual
  • Having the rank of general; chief or principal.
  • Not detailed or specific; vague.
  • Relating to all the people or things in a group; overall.
  • Applicable or occurring in most situations or to most people.
random

US /ˈrændəm/

UK /'rændəm/

  • adjective
  • Chosen, done without a particular plan or pattern
dynamic

US /daiˈnæmik/

UK /daɪˈnæmɪk/

  • adjective
  • Always active or energetic; getting things done
function

US /ˈfʌŋkʃən/

UK /'fʌŋkʃn/

  • noun
  • A social event or ceremony.
  • A routine that performs a specific task.
  • Social event, or party such as a wedding
  • Mathematical operation used in calculations
  • A relationship or expression involving one or more variables.
  • The way in which something works or operates.
  • What something is intended to be used for; purpose
  • other
  • To operate or perform in a specified way.
  • To work or operate in a proper or particular way.
  • verb
  • To serve a certain purpose or role
  • To be operating, working or achieving its purpose