Placeholder Image

Subtitles section Play video

  • SHANQING CAI: We're going to do this presentation about how

  • to debug TensorFlow programs.

  • We're going to focus specifically on TF 2,

  • because TF 2 is the stable release,

  • and it will have long-term support going forward.

  • But there are also places where we're going to mention TF 1.

  • And when we do, we'll make that clear

  • so you know which version of TensorFlow we're talking about.

  • So first of all, I want to define a scope of debugging.

  • And the reason why you should do this

  • is because the word "debugging" is an overloaded term

  • in machine learning.

  • Different people use it to refer to different things,

  • sometimes in confusing ways.

  • So in the scope of this talk, debugging

  • refers to specific things, really,

  • that mainly have to do with the correctness of your TensorFlow

  • program, like mathematical implementation bugs.

  • For example, when you are implementing a new [INAUDIBLE]

  • type or a new loss function, you may

  • run into DType issues or shape issues, or just

  • straight bugs in the math.

  • And the techniques we'll cover will also

  • be useful for debugging pre-processing and

  • post-processing parts your TensorFlow program.

  • And one big focus will be the debugging

  • of the issues like NaN and infinity

  • in our models, which happen very frequently during TF model

  • training.

  • [INAUDIBLE] will talk about a specific tool

  • called TensorTracer, which is very useful for catching

  • the root cause of NaNs and infinities

  • on TPUs and other devices.

  • And we're not going to talk about how

  • to debug bugs in op kernels themselves or bugs in hardware,

  • because it's more specific to the hardware

  • and for the op kernel that you're using.

  • However, the methods we'll cover will be useful for you

  • to debug models that are affected by those kernel

  • or hardware bugs.

  • At least, it will be useful for you to narrow down

  • the cause of the model bug to the level

  • of op kernels or hardware.

  • And the tools and techniques we'll cover

  • will also be useful in case you want to just peek

  • into a model to understand what's going on.

  • And so one example would be answering a question like,

  • why is my model making a wrong prediction on a [INAUDIBLE]

  • for example?

  • So you will be able to peek into the model

  • and look at the layer activations

  • and the intermediate tensors and answer that question.

  • So one use case that's kind of relevant

  • to that is when you are porting model

  • from one version of the library to another,

  • or from one library to another, like from TensorFlow to TFLite,

  • or from TensorFlow to TF.js, or from TensorFlow to PyTwitch

  • or more from PyTwitch to TensorFlow.

  • You will often see divergence between the two implementations

  • of the same model, and you want to quickly identify the root

  • cause of the divergence.

  • And the tools and techniques we'll cover

  • will also be useful for that purpose.

  • What we're not going to cover, however,

  • are the debugging cases like debugging the model performance

  • and looking at the accuracy of models

  • after training, like model evaluation and analysis.

  • We're not going to cover how to debug fairness

  • issues in models, either Those are also

  • important kinds of TensorFlow debugging,

  • but those are outside the scope of this talk.

  • There are great tools for those, like some dashboards

  • in TensorBoard and What-If Tool, and fairness indicator,

  • and so forth.

  • But I'm not going to talk about those here.

  • Any questions so far?

  • OK.

  • So here's a brief outline of this presentation.

  • We're going to talk about how to debug tensor values.

  • We're going to talk about how to look at the placement of ops

  • on different devices, like CPUs and GPUs.

  • It's a very commonly asked question.

  • We're going to look at how to debug the structures of graphs

  • in TensorFlow 2, including the graphs from tf.function

  • and graphs that are optimized for the runtime.

  • And then in section 4, we're going

  • to cover the special topic of step debugging, which

  • is to use an IDE to step over your code line by line.

  • And then in the fifth section, we're

  • going to move from low-level API to high-level API.

  • And the specific high-level I will

  • focus is on tf.keras, because tf.keras

  • is the official high-level API in TF 2,

  • and also because I'm not as familiar

  • with other high-level APIs like [INAUDIBLE] and so forth.

  • And in section 6, we're going to talk

  • about the debugging of numerical issues like NaNs and infinity.

  • And finally, I'm going to present

  • to work on TensorFlow Debugger, including the existing V1

  • features and the V2 features that we're currently

  • working on.

  • So first, let's take a look at how to debug tensor values.

  • So here's a very simple example.

  • So it's very straightforward.

  • You are not decorating your functions

  • with tf.function decorator.

  • So everything is executed eagerly in TF 2.

  • And there you can simply use the print statements

  • to look at the values of tensors,

  • like in this simple example here.

  • So y is an eager tensor.

  • If you do print, you will see the value

  • in the stdout printout.

  • So it's very similar to print the value of the numpy arrays,

  • with the caveat that if the tensor lives on the GPU,

  • then printing it will involve copying from the GPU

  • to your host, which may be a little bit costly.

  • And oftentimes, when the size of the tensor is too big,

  • you probably don't want to look at the entire tensor,

  • because there are going to be millions of elements.

  • What you sometimes want is to do a reduce operation

  • on the tensor and then look at some numerical summaries

  • of the tensor, like what's the minimum value, what's

  • the maximum value, and what's the mean.

  • It's also a useful technique that's

  • fully supported in eager mode.

  • So EagerTensor str and repr methods

  • use numpy under the hood, which means

  • that you can use the set_printoptions function

  • from numpy to control the details of how

  • the values of tensors are printed.

  • For instance, if you use the precision [INAUDIBLE]

  • arguments, you can adjust basically

  • how many decimal points are printed in the floats type

  • tensors.

  • You can also adjust the threshold element count

  • beyond which ellipses will be used in the printing, which

  • is useful for cases where you want

  • to look at the values of huge tensors,

  • like thousands of elements.

  • Of course, the story is not always as straightforward

  • as this.

  • The program is often not executing purely eagerly.

  • And sometimes you have tf.function,

  • sometimes your function is decorated by TensorFlow itself

  • and converted into a graph.

  • So there, if you use the print statements,

  • then the results of the printing may not

  • be what you would expect.

  • So here the user intends to look at the value of n

  • as each iteration of the while loop.

  • So the user puts a print statement here naively.

  • And the result has only one printed line,

  • even though the while loop is executed multiple times.

  • And the contents of the printed text

  • is also kind of confusing to a naive user.

  • And the reason is that when tf.function is used,

  • the code here is transformed into a tf.Graph.

  • And the print statement gets executed

  • during that function-to-graph transformation.

  • And the contents you see here is actually

  • a node in the graph instead of the value

  • of the tensor at runtime.

  • So the correct approach is to use tf.print.

  • So tf.print will modify the graph.

  • It will actually add a couple of nodes to the graph

  • so you can look at the value of the n inside the tf.function.

  • So here at the bottom here, you can see the value of n

  • as each iteration of the while loop is printed.

  • So it's quite useful.

  • So here is a homework problem for you.

  • So the examples I've shown so far are all for simple tensors,

  • like float32 tensor or an integer tensor.

  • What if the tf.print statement is

  • used on a ragged tensor or a sparse tensor?

  • So those are the major composite tensor types in TensorFlow.

  • So you can try that.

  • By the way, I inserted a link to the Colab Notebook

  • for all the code examples in this presentation.

  • So you can look at the slides.

  • And if you want to play with a code examples,

  • you can use that Notebook.

  • So here is a second homework code.

  • You can use the code to see what happens if you use tf.print

  • on a sparse tensors.

  • OK.

  • So sometimes, the user doesn't want to just print

  • the value of the tensors.

  • The user wants to programmatically extract

  • the value of the tensors so they can

  • be used for like programmatic debugging

  • or downstream computation.

  • This code snip here shows how you can pull out

  • intermediate tensors from a toy implementation of a TF Dense

  • layer.

  • So the function originally returns only the final outputs

  • of the Dense layer.

  • But maybe for some reason you want

  • to look at the intermediate steps,

  • like the results of the matmul or the results

  • of adding with a bias.

  • So what you can do here is you can actually

  • append these two tensors to the return

  • values of the tf.function.

  • And then you'll be able to access

  • these intermediate values when you call the layer.

  • What's slightly more complicated is

  • if those tensors are inside the control flow structures.

  • So for instance, if you want to programmatically

  • access the value of n at every iteration of the while loop,

  • you can't simply just add n to the return value here.

  • What you need to do here is to use tf.TensorArray and then

  • append that tf.TensorArray as each iteration of the while

  • loop.

  • And then you be able to see the full history of how n changes.

  • It's slightly complicated.

  • So the TensorFlow Debugger tool I

  • will present at the end of this talk hopefully

  • will make this simpler.

  • So having covered tensor values, I'm

  • going to talk about how to debug the placement of ops

  • on different devices, mainly CPUs and the GPUs.

  • So it's a very frequently asked question,

  • because the users want to make sure

  • that their heavy computation is computed on the GPU,

  • not on the CPU.

  • So again, if the program is running purely eagerly,

  • then it's pretty straightforward.

  • You can just call an API called tf.debugging.set log device

  • placement equal True.

  • And then when those operations are executed eagerly,

  • you will see lines being printed to the stdout.

  • For instance, when the multiplication operation areas

  • run, you will see a line that tells you that the log

  • op is running on a GPU.

  • And when the print statement here is running,

  • it's actually running a Print-V2 op on the CPU.

  • So here you can see clearly where the ops are running,

  • whether it's on CPU or GPU, and if you have multiple GPUs,

  • which GPU is running an op.

  • So one thing you need to know here

  • is that it's only going to log information when the op is

  • placed for the first time.

  • If you have the same op executing multiple times

  • eagerly, it's not going to print it multiple times.

  • So that mechanism prevents a flood of information

  • to your stdout.

  • So a more realistic scenario is where you

  • have tf.functions and graphs.

  • So there, set_log_device_placement

  • equal to True will still work.

  • You will see not only the placement

  • for the eager ops, like in the green box here,

  • but you will also see the placement

  • of the graph ops in the purple box on the bottom here.

  • But one caveat here is that you need

  • to be aware that, even though the eager lines are printed

  • to stdout, currently the graph logs are printed to info log.

  • The implication of that is that if you're

  • doing this in a Jupyter Notebook or Colab,

  • then you will not be able to see the bottom parts of the log.

  • But there is actually a way in Colab

  • to capture the log so you can see both.

  • It's just something you need to be aware of.

  • So in the graph placement logs, the text inside the parentheses

  • are for the op type, and the text

  • outside the parentheses to the left of the parentheses

  • are for the name of the op.

  • So here are some other important things

  • to know about set_log_device_placement.

  • So it works for both eager operations

  • and for graph operations.

  • But they're allotted to different places,

  • as I mentioned.

  • And also, be aware that the fact that an op is logged at graph

  • construction time does not guarantee that the op will

  • be executed at runtime.

  • And that's because TensorFlow has its built-in graph

  • optimization step called Grappler,

  • and Grappler may change the placement,

  • or it may prune the op away from the graph,

  • or it may fuse the op into a larger op, and so forth.

  • I'm going to talk about that in a coming slide.

  • And also, be aware that set_log_device_placement

  • currently does not work fully for TPUs.

  • So it's mainly useful for debugging CPUs and GPUs

  • currently.

  • OK.

  • So I'm going to move on to the section about debugging graph

  • structures.

  • So here you have a tf.function.

  • And then how do you look at the graph of the compilation

  • of that tf.function?

  • So the answer to that is to use the method

  • called get_concrete_function on that function object.

  • So get_concrete_function should be

  • called only after that tf.function is

  • called for the first time.

  • If you call that before the function is compiled,

  • then that method will not even exist.

  • And when you call get_concrete_function,

  • you need to pass an argument.

  • So the argument can be the same argument

  • as you pass when you are calling the function.

  • And the reason why you need to pass that argument

  • is because the same Python function

  • can be compiled into different tf.Graphs,

  • depending on the DTypes and shapes of the [INAUDIBLE]

  • arguments.

  • You can also pass tensor text as the arguments.

  • And the return value of get_concrete_function

  • here is an object that has a graph property.

  • The graph property is a tf.Graph on a Python level.

  • To see its structure, you can call as_graph_def

  • on a method of the graph object.

  • And the returned value is a text float

  • for the graph, as shown on the right here.

  • So the text float here is basically

  • a repeated fields of nodes.

  • It tells you which nodes there are in the graph

  • and how they're connected to each other.

  • So there are properties like name, and op, and attributes,

  • and so forth.

  • So if you're not familiar with the format,

  • you should spend some time looking at some examples,

  • because it's very critical, very important for TensorFlow code.

  • But the important thing here is that, for any realistic models

  • and realistic tf.functions, the size of the graph

  • then is going to be too big.

  • It's not going to be friendly to reading a text.

  • And that's why graphical tools like TensorBoard

  • will be important.

  • So you can start TensorBoard, the binary.

  • So even if you have an empty logdir,

  • you can still switch to the Graph Dashboard

  • by using the dropdown menu.

  • And then inside the Graph Dashboard,

  • you should be able to see a button called Choose File.

  • And you can use Choose File to upload the contents

  • of the pbtxt of graph.

  • And then TensorBoard will be able to show you

  • the structure of the graph, as shown in this example here.

  • So some important properties to know about TensorBoard's Graph

  • Dashboard is that that information flow is generally

  • from bottom to top.

  • So the inputs are usually on the bottom.

  • And at the top, you're seeing the outputs.

  • And also, TensorBoard's graph visualizer will group nodes

  • by name scope by default. So that's the reason why you often

  • see those big, colorful boxes.

  • And you can double-click each box to expand into it.

  • So it's quite handy for debugging large models.

  • And it also handles FunctionDefLibrary correctly.

  • So FunctionDefLibraries are basically nested graphs.

  • So it's used frequently in TF 2, like in control flow

  • structures.

  • So a TF 2 while loop will contain two functions, like one

  • for the condition of the while loop

  • and the other for the body of the while loop.

  • Those are also color boxes that you can

  • double-click to expand into.

  • If you're Google internal, then you

  • should be able to use a special import from Colab.

  • And that will enable you to look at the graph inside the Colab

  • Notebook.

  • So I find that to be slightly handier than looking

  • at graph structures in TensorFlow itself,

  • because that means I don't have to switch

  • back and forth between two different tabs of your browser.

  • So here is an example.

  • So, as we mentioned before, you can append variables

  • to the return list of the tf.function

  • to access intermediate tensors.

  • And in this graph being visualized by our TensorBoard,

  • you can see two extra identity nodes that correspond

  • to the added return values.

  • And that's because TF 2 currently

  • uses identity nodes to mark the return values of tf.functions.

  • So here's a graph for a function that's

  • slightly more complicated.

  • So it involves control flow structures,

  • including while loops and if-else conditions.

  • So these are the boxes that are the FunctionDefLibraries

  • that I mentioned before.

  • So you can see a box for the true branch

  • of the if-else condition.

  • You can see another box for the false branch

  • of the if-else condition.

  • And the box here in red is the condition of the while loop,

  • and so forth.

  • So if you are very careful and if you spend some time,

  • you can see the correspondence between these ops in the graph

  • and also the operations in the Python code.

  • But that's in general how to do it.

  • And it requires an expertise to see

  • the correspondence between the graph nodes

  • and the Python operators or Python functions.

  • So that's one gap where the TF Debugger tool that I

  • will talk about tries to fill.

  • So in TF Debugger tool, you will be

  • able to look at the graph structures and the source code

  • side by side.

  • So it will be easier for you to establish

  • the correspondence between the Python functions or Python

  • operators and the nodes of the graph.

  • OK.

  • So what if the function is not executing on a single device

  • but it's executing on multiple devices or multiple hosts

  • using distribution strategy?

  • So before I talk about that, I'm going

  • to tell you about a useful API for mocking out

  • virtual devices.

  • So for instance, if you have only one physical GPU

  • on your machine, and you want to do

  • some testing or some debugging on a distribution strategy that

  • involves four different GPUs, then you

  • can use the API called set_virtual_device_configuration

  • to create four logical GPUs.

  • And you can use the API called list_logical_device to confirm

  • that.

  • It's a very useful technique for testing and debugging

  • TensorFlow functions that involve multiple devices.

  • So once we have set the four logical GPUs,

  • we can use MirroredStrategy to basically create

  • a variable on the four GPUs.

  • And we can construct a function that

  • will basically incorporate and take

  • that variable on the four GPUs.

  • So the function here comes--

  • dist_f is the function that involves the replication.

  • And you can use the get_concrete_function method

  • as before to look at graph structure.

  • So you can upload the graph pbtxt

  • to TensorBoard to see its structure.

  • And in the structure, you can see four boxes.

  • And those four boxes correspond to the four GPUs.

  • So the technique here is useful for debugging graphs

  • in mirrored strategies and other distribution strategies.

  • So this slide here shows you how tf.print

  • works in terms of the graph.

  • So each time you call tf.print inside a tf.function,

  • it will append a pair of nodes to your graph.

  • So the first node here will converge your tensor, the input

  • tensor, into a string.

  • And the second one will actually print that string

  • into stdout or info out or whatever output stream

  • the printout is configured to.

  • So here's an example for you.

  • It's also available in the Colab Notebook,

  • so you can play with it a little.

  • It is very interesting.

  • So the question here is, what happens if there is no return

  • value from the function?

  • So I forgot to mention that the reason why

  • these PrintV2 ops get executed at runtime because they

  • are attached as control dependencies

  • to the final output identity node of the graph.

  • So these correspond to the dashed lines in the graph.

  • So the homework problem is about finding out

  • how the print op gets executed when the tf.function does not

  • involve a return value.

  • So when you use tf.print, you need

  • to be aware that it may inadvertently

  • change how graph optimization works at runtime.

  • So in the code snippet on the left here,

  • we're computing the harmonic mean of a tensor.

  • However, there is a line in the code which constructs an op.

  • But the output tensor of that op, which is a Min op,

  • there's not usually any downstream computation.

  • Now, when the tf.function is actually at runtime,

  • Grappler is going to do its job, and it's

  • going to prune out that Min op.

  • So the Min op will not actually get executed at runtime.

  • However, if you use tf.print, you

  • will change the optimization.

  • And you're basically going to attach the output

  • tensor of the Min op to the string format from the PrintV2,

  • as I mentioned before.

  • But the important thing to note here

  • is that if you use the get_concrete_function

  • method to debug the graph structure,

  • you will always see a Min op.

  • And that's because at the Python level,

  • TensorFlow AutoGraph faithfully only converts the Python

  • function into a graph.

  • It's not trying to do any optimization.

  • Instead, it will hand the graph to Grappler

  • for downstream optimization.

  • So the question here is, how can we

  • debug the optimized graph that are generated by Grappler?

  • So that leads us to the next section.

  • So in order to see the Grappler output graph,

  • you need to use Bazel build.

  • So when you call this, you need to specify an environment

  • variable called TF_DUMP_GRAPH_PREFIX

  • and point it to any directory you have write access to.

  • And then you have to specify the flag called vmodule.

  • So that tells the meta_optimizer,

  • which is a part of Grappler, to be verbose and dump information

  • to the folder.

  • And after the program runs, you will see

  • a bunch of files in the folder.

  • So those are the outputs from each pass of Grappler.

  • So Grappler performance graph optimization in steps,

  • kind of like a compiler.

  • So the final output, which is usually

  • called after_MetaOptimizer something,

  • is using the graph of interest.

  • So it will tell you the structure of the graph that

  • gets executed at runtime.

  • So using the technique here, you will

  • be able to compare the runtime graph of the two code snippets

  • that we have seen before.

  • So in the code snippet on the left,

  • and also the graph on the left, you

  • see that the Min op is not present,

  • because it's pruned away by Grappler.

  • However, in the code snippet on the right here,

  • and also the graph on the right here,

  • you can see that the Min op is present.

  • And it feeds input into the two ops

  • that correspond to tf.print.

  • So as you can see, the process here

  • is convoluted and complicated.

  • So TF Debugger will try to present both the Python graph

  • and the runtime graph to you, so you

  • don't have to do any Bazel builds or any special flags

  • or environment variables.

  • OK.

  • So now let's talk about the interesting topic

  • of step debugging.

  • So by step debugging, I mean using

  • a Python IDE or [INAUDIBLE] to step over lines

  • of the source code one by one.

  • Some people prefer that over print debugging.

  • So the useful API here if you want to start debugging

  • is the tf.config.experimental run functions eagerly.

  • So if you call that function with the eager element

  • True, then you're basically telling the tf.function

  • to not compile functions into graphs.

  • And all the code here will run eagerly.

  • And you will be able to use either print,

  • or you can use step debugging, or can use breakpoints

  • in your favorite IDE.

  • But one important caveat you want to keep in mind

  • is that it works for all cases except tf.data.

  • Because tf.data works in a special way,

  • it always converts Python functions into graphs

  • before it runs them.

  • So I'm going to show an example for that in an upcoming slide.

  • This slide here shows an example of using

  • VSCode to do step debugging on a tf.function

  • after you call experimental run functions eagerly True.

  • However, if you don't call that function--

  • I mean, if you don't call experimental run functions

  • eagerly, or if you call it False,

  • then it's not a good idea to step debug your tf.function.

  • And for some IDEs, if you add a breakpoint,

  • it's not even going to hit that breakpoint.

  • And in other IDEs, it will hit that breakpoint,

  • but the stepping pattern after that breakpoint

  • will be very confusing.

  • And the reason for that has to do with the internal details

  • of how AutoGraph works.

  • And I refer you to the presentation

  • made by Dan Moldovan on AutoGraph last year.

  • I think it's publicly available.

  • So understanding that, it will probably not

  • be too hard for you to understand

  • why this strange behavior is happening here.

  • So the strange behavior is that you're inserting a print

  • statement in both branches of the if-else condition,

  • and you see that when the function is called,

  • both branches get executed.

  • Yeah.

  • So the slide here shows you an example

  • in which experimental run functions eagerly

  • does not work on a map function that you pass to tf.Dataset.

  • So even if you comment out the tf.function decorator

  • for the to_multi_hot function, it's still

  • going to be converted into a graph,

  • and it will run in a graph fashion

  • instead of running eagerly.

  • So in order to debug intermediate tensors

  • inside a function, you must use tf.print.

  • If you do print, you're only going

  • to print the symbolic tensors in the graph.

  • But TensorFlow Debugger will also

  • make it easy for you to debug the values insides

  • a tf.function passed to tf.Dataset.

  • OK.

  • So so far, we have been talking about how

  • to debug low-level constructs of TensorFlow, including ops,

  • and tensors, and graphs.

  • But many users also use high-level APIs like tf.keras.

  • And then they also want to peek into their models.

  • So in the following slides, I'm going

  • to talk about some tools and techniques

  • available for debugging Keras models.

  • So one very frequently asked question

  • is, how do I get the intermediate layer

  • outputs, meaning the intermediate layer activation

  • from a tf.keras model?

  • So one way to do it is to construct

  • a second model, which is the debug_model in the example

  • here.

  • The second model has the same inputs as the original model.

  • But the outputs will be the original model's output

  • plus the outputs from the layers you're interested in.

  • And then when you call debug_model.predict or simply

  • call debug_model as a tf.function,

  • you'll be able to see not only the final output of the Keras

  • model, but also the intermediate outputs.

  • So this approach is useful to look at the final outputs

  • of each layer.

  • If you want to have the intermediate tensors

  • inside each layer, it's not that useful.

  • You have to use tf.print method or the other techniques

  • mentioned in earlier parts of the presentation.

  • And TF Debugger will also make it easier

  • for you to look at layer internal tensors.

  • So one other useful thing to know

  • when you are debugging a tf.keras model

  • is to use the TensorBoard callback.

  • So the TensorBoard callback, which

  • is under the tf.keras.callbacks namespace,

  • is a callback you can pass to your model.fit.

  • What it will do is it's going to log loss functions

  • and metrics to the logdir when the model is training.

  • But for debugging purposes, it will also

  • log the graph of the model to the logdir.

  • So you can just open the Graph Dashboard of TensorBoard

  • and look at the graph structure.

  • So there you see that the layers of the model

  • are organized in those boxes that you

  • can double-click to expand.

  • And that's thanks to the work done

  • by the authors of tf.keras, who have

  • been very careful in specifying the correct name

  • scopes for each layer.

  • But the other important and useful thing to note

  • is that the tensors are marked as those arrows that

  • connect the layers.

  • And if you look carefully, you can see those very small fonts.

  • Those small fonts are the shapes of the tensors to the extent

  • known at graph construction time.

  • So for instance, the dropout layer

  • here outputs two-dimension tensor

  • of size question mark times 5.

  • So the question mark is the undetermined batch dimension.

  • OK.

  • So having covered high-level API debugging,

  • let's move on to the next section,

  • which is about how to debug NaNs and infinities you're running.

  • So that's a very frequently occurring debugging task

  • in TensorFlow, and probably accounts for about half

  • of the questions that we get asked.

  • So if we're talking about the tools,

  • I want to show you some common causes for NaNs and infinities

  • in TensorFlow models.

  • So they can be caused by a lack of value clipping.

  • Like when you have a division operation in your TensorFlow

  • program, like some sort of normalization,

  • if you forget to add an epsilon or very small

  • positive value to your denominator,

  • it's likely to run into infinities at runtime,

  • especially in the face of variability of input stream

  • data.

  • And that also applies to the math log operation.

  • And sometimes, NaNs and infinities

  • in your TensorFlow model can be caused

  • by bugs in op kernels themselves or even in hardware.

  • So for instance, we have seen a bug recently

  • that involves [INAUDIBLE] kernel on TPUs outputting infinities

  • and NaNs, even when the inputs are totally valid.

  • And sometimes, the NaNs and infinities

  • can be also caused by exploding gradients, especially when

  • your learning rate is too high.

  • So in that case, to quote the famous meme--

  • just keep calm and decrease the learning rate.

  • And then as [INAUDIBLE] will tell us about,

  • sometimes the NaNs and infinities

  • can also be caused by problematic training examples.

  • So debugging, then, the root cause of the NaN and infinity

  • is different from the print debugging and the graph

  • structure debugging we have talked about in earlier

  • parts of the presentation.

  • And that's because, to find the root cause of NaNs

  • and infinities, you don't know where to look,

  • because that's exactly what you're trying to find out.

  • You could insert tf.print statements

  • to every single tensor in your model.

  • But it's not going to work for a realistic model, which

  • can include up to tens of thousands of tensors.

  • So that's why we need specialized tools

  • to help you debug the root cause of NaNs and infinities.

  • So I'm going to present two tools here.

  • The first one is a new API.

  • It's called tf.debugging.enable check numerics.

  • So it's a relatively new API.

  • It just came into existence in TF 2.1,

  • which was released about a month ago.

  • So what the API does here is that you can simply

  • add one line of code to your TF program.

  • And when the TF program runs, like when the model trains,

  • it's going to check every floating type tensor in our TF

  • program, including the eagerly computed tensors and tensors

  • inside graphs and tf.functions.

  • And as soon as any floating type tensor contains

  • NaNs or infinity in their output,

  • then the program will error out with a helpful error message,

  • as the one shown on the right here.

  • So the error message here contains

  • a bunch of useful information for debugging, including

  • the name of the op, the runtime DType

  • and the shape of the tensor, as well as a stack trace.

  • So we know that the stack trace from TensorFlow error messages

  • are usually very verbose and hard to understand.

  • And the API here tries to infer, or try

  • to guess which frames of the stack trace

  • correspond to the user's own program.

  • And it highlights those frames with an arrow.

  • So hopefully it will be easier for you

  • to find the important frames in the stack trace.

  • And the API here is also general in the sense

  • that it works for both forward pass and backward pass.

  • It works for low-level API and high-level APIs,

  • including Keras.

  • It also works if you are stuck with an old TF 1 API.

  • And it should work on CPU and GPU and TPU.

  • So one question you might want to ask

  • is, what's the performance overhead of this?

  • And it's an important question, because to find the root

  • cause of NaNs and infinities, the overhead

  • needs to be as low as possible.

  • Sometimes, the NaNs and infinities

  • don't happen until like a few hours or even a few days

  • into training.

  • So thanks to the work of an intern, Anthony,

  • the overhead of this API is low.

  • So we have benchmarked the API on a bunch of models.

  • So here's an example from the transformer v2 model.

  • When it's training on the CPU, if you

  • enable [INAUDIBLE] check, then it's going to get about 30%

  • overhead.

  • If the model is trained on GPU, then overhead

  • is slightly higher.

  • It's about 75%.

  • But it's not that high.

  • So it may be even a good idea for you

  • to turn this API on in your tests for [INAUDIBLE] checks.

  • So this API here is useful, but it's also limited in the sense

  • that it only tells you what happens when

  • the NaN or infinity happens.

  • It tells you about op.

  • But it has no information about the moments

  • or the history of the execution leading up to that moment.

  • OK.

  • So what TensorFlow Debugger is can

  • be thought of as basically a combined tool that

  • will help you achieve almost all the debugging tasks

  • that we have mentioned in earlier

  • parts of the presentation, including looking at tensor

  • values, the placement of ops and devices, graph structures,

  • also step debugging and numerical issues

  • like NaNs and infinities.

  • So far, we haven't put a lot of thought

  • into high-level API support like Keras.

  • But it's on our radar.

  • So there are two different versions

  • of TF Debugger, V1 and V2.

  • So V1 was a part of TF 1.

  • So it's centered around the old tf.Session API.

  • So it's basically a set of wrappers for your sessions.

  • So it's still available in TensorFlow.

  • If you are still using TF 1 APIs,

  • it might be useful to you.

  • So there are two different wrappers--

  • the command line interface interface wrapper

  • and the TensorBoard wrapper.

  • When you wrap the session objects,

  • you don't have to make any other changes to your TensorFlow

  • code.

  • When Session.run runs, it's going to present you

  • with debugging information.

  • If you use the command line interface interface wrapper,

  • then Session.run will basically drop

  • into an interactive terminal based program in your terminal.

  • And these screenshots are showing you

  • that the command line interface will show you

  • the list of tensors that are executed.

  • And you can click those tensor names

  • to look at the details of the tensors,

  • like the op placement, the values of the tensors,

  • and so on.

  • It will also show you on source code

  • and annotate each line in the source code

  • with the ops that are created at that line.

  • So currently, we're working on V2 of TF Debugger.

  • The reason why we want to invest in is are obviously,

  • first we want to bring the tool up

  • to speed with our current API which has no tf.Session,

  • but it involves eager execution and tf.function.

  • And also, in earlier parts of the presentation,

  • you have seen that print debugging and tf.print

  • are useful for a lot of debugging cases,

  • but it's not useful in all cases, especially when you

  • want to debug some code deep inside the TensorFlow codebase

  • itself.

  • So we also want to incorporate some lessons

  • we learned from V1 of the tool.

  • First, we want the tool to be general enough

  • to work on all hardware types.

  • TF Debugger V1, because it predates TPU in TensorFlow,

  • it does not work for TPU.

  • It only works for CPU and GPU.

  • But TF Debugger V2 will work for all the major hardware types--

  • CPUs, GPUs, and TPUs.

  • And secondly, we want the overhead

  • to be as low as possible.

  • And also, we learned that there are some improvements

  • that we can make to the UX of the frontend.

  • So TF Debugger V2 in a nutshell will involve this process.

  • So the user has a TF program that he or she wants to debug.

  • Then they can just insert one-line call

  • into their tf.function and specify a logdir.

  • So the logdir can be the same logdir as your TensorBoard

  • logdir.

  • And then if your TensorBoard has started,

  • then you can switch to the Debugger V2

  • Dashboard in TensorBoard to look at the debug information.

  • So the frontend work is currently underway.

  • So I've been reaching out to various people,

  • like people at the TensorFlow team and outside TensorFlow

  • team at Google to get their feedback.

  • If you're interested in [INAUDIBLE] this

  • or telling us about your specific debugging use case,

  • please reach out to me.

  • And I will be more than happy to work with you to make sure

  • that the new tool will be useful for your problem.

  • So here are some UI mocks that UX researchers helped us make.

  • So it's going to be the look of the new Debugger V2

  • plugin in TensorBoard.

  • It's going to show you the execution history on the top.

  • It's going to show you both eager execution ops

  • and of tf.functions.

  • And you can zoom into tf.functions

  • to look at the graph structure and the list of tensors

  • that are computed inside the tf.function.

  • And the top left section will highlight important events,

  • like the generation of NaNs and infinities,

  • and the repeated function compiles

  • which might hurt your performance, and so forth.

  • And more importantly, on the bottom section

  • you will be able to associate your graph ops with your source

  • code or associate eager execution events

  • with your source code.

  • This will make it easier for you to find

  • the way back from your bug into your source code.

  • And it should speed up your bug fixing process.

  • And finally, some advice.

  • So the authors of TensorFlow have done a lot of work

  • recently to improve the error messages.

  • So next time you get an error message in TensorFlow,

  • be patient and read through the error message,

  • especially the sections labeled as "in user code."

  • It may contain some surprisingly useful information for you

  • to debug your problems.

  • And lastly, some machine learning bugs

  • are not machine learning bugs, but they're just

  • general programming bugs.

  • So here is a puzzle for you to chew on.

  • It's a small problem.

  • So here, the user is trying to code two of the files

  • for the features and for the labels.

  • And the user feeds them into a function

  • to construct a data set.

  • And the data set is fed into the fit call to train the model.

  • But for some reason, the model training is not very good.

  • The accuracy is much worse than expected.

  • And what's the reason for that?

  • So it's a puzzle for you.

  • If you're interested in the answer, reach you to me,

  • and I'll be happy to tell you the answers.

  • But the point is that some bugs are just

  • general problem bugs, not machine learning bugs per se.

  • Thank you very much for your attention.

  • [MUSIC PLAYING]

SHANQING CAI: We're going to do this presentation about how

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it