Placeholder Image

Subtitles section Play video

  • SHANQING CAI: We're going to do this presentation about how

  • to debug TensorFlow programs.

  • We're going to focus specifically on TF 2,

  • because TF 2 is the stable release,

  • and it will have long-term support going forward.

  • But there are also places where we're going to mention TF 1.

  • And when we do, we'll make that clear

  • so you know which version of TensorFlow we're talking about.

  • So first of all, I want to define a scope of debugging.

  • And the reason why you should do this

  • is because the word "debugging" is an overloaded term

  • in machine learning.

  • Different people use it to refer to different things,

  • sometimes in confusing ways.

  • So in the scope of this talk, debugging

  • refers to specific things, really,

  • that mainly have to do with the correctness of your TensorFlow

  • program, like mathematical implementation bugs.

  • For example, when you are implementing a new [INAUDIBLE]

  • type or a new loss function, you may

  • run into DType issues or shape issues, or just

  • straight bugs in the math.

  • And the techniques we'll cover will also

  • be useful for debugging pre-processing and

  • post-processing parts your TensorFlow program.

  • And one big focus will be the debugging

  • of the issues like NaN and infinity

  • in our models, which happen very frequently during TF model

  • training.

  • [INAUDIBLE] will talk about a specific tool

  • called TensorTracer, which is very useful for catching

  • the root cause of NaNs and infinities

  • on TPUs and other devices.

  • And we're not going to talk about how

  • to debug bugs in op kernels themselves or bugs in hardware,

  • because it's more specific to the hardware

  • and for the op kernel that you're using.

  • However, the methods we'll cover will be useful for you

  • to debug models that are affected by those kernel

  • or hardware bugs.

  • At least, it will be useful for you to narrow down

  • the cause of the model bug to the level

  • of op kernels or hardware.

  • And the tools and techniques we'll cover

  • will also be useful in case you want to just peek

  • into a model to understand what's going on.

  • And so one example would be answering a question like,

  • why is my model making a wrong prediction on a [INAUDIBLE]

  • for example?

  • So you will be able to peek into the model

  • and look at the layer activations

  • and the intermediate tensors and answer that question.

  • So one use case that's kind of relevant

  • to that is when you are porting model

  • from one version of the library to another,

  • or from one library to another, like from TensorFlow to TFLite,

  • or from TensorFlow to TF.js, or from TensorFlow to PyTwitch

  • or more from PyTwitch to TensorFlow.

  • You will often see divergence between the two implementations

  • of the same model, and you want to quickly identify the root

  • cause of the divergence.

  • And the tools and techniques we'll cover

  • will also be useful for that purpose.

  • What we're not going to cover, however,

  • are the debugging cases like debugging the model performance

  • and looking at the accuracy of models

  • after training, like model evaluation and analysis.

  • We're not going to cover how to debug fairness

  • issues in models, either Those are also

  • important kinds of TensorFlow debugging,

  • but those are outside the scope of this talk.

  • There are great tools for those, like some dashboards

  • in TensorBoard and What-If Tool, and fairness indicator,

  • and so forth.

  • But I'm not going to talk about those here.

  • Any questions so far?

  • OK.

  • So here's a brief outline of this presentation.

  • We're going to talk about how to debug tensor values.

  • We're going to talk about how to look at the placement of ops

  • on different devices, like CPUs and GPUs.

  • It's a very commonly asked question.

  • We're going to look at how to debug the structures of graphs

  • in TensorFlow 2, including the graphs from tf.function

  • and graphs that are optimized for the runtime.

  • And then in section 4, we're going

  • to cover the special topic of step debugging, which

  • is to use an IDE to step over your code line by line.

  • And then in the fifth section, we're

  • going to move from low-level API to high-level API.

  • And the specific high-level I will

  • focus is on tf.keras, because tf.keras

  • is the official high-level API in TF 2,

  • and also because I'm not as familiar

  • with other high-level APIs like [INAUDIBLE] and so forth.

  • And in section 6, we're going to talk

  • about the debugging of numerical issues like NaNs and infinity.

  • And finally, I'm going to present

  • to work on TensorFlow Debugger, including the existing V1

  • features and the V2 features that we're currently

  • working on.

  • So first, let's take a look at how to debug tensor values.

  • So here's a very simple example.

  • So it's very straightforward.

  • You are not decorating your functions

  • with tf.function decorator.

  • So everything is executed eagerly in TF 2.

  • And there you can simply use the print statements

  • to look at the values of tensors,

  • like in this simple example here.

  • So y is an eager tensor.

  • If you do print, you will see the value

  • in the stdout printout.

  • So it's very similar to print the value of the numpy arrays,

  • with the caveat that if the tensor lives on the GPU,

  • then printing it will involve copying from the GPU

  • to your host, which may be a little bit costly.

  • And oftentimes, when the size of the tensor is too big,

  • you probably don't want to look at the entire tensor,

  • because there are going to be millions of elements.

  • What you sometimes want is to do a reduce operation

  • on the tensor and then look at some numerical summaries

  • of the tensor, like what's the minimum value, what's

  • the maximum value, and what's the mean.

  • It's also a useful technique that's

  • fully supported in eager mode.

  • So EagerTensor str and repr methods

  • use numpy under the hood, which means

  • that you can use the set_printoptions function

  • from numpy to control the details of how

  • the values of tensors are printed.

  • For instance, if you use the precision [INAUDIBLE]

  • arguments, you can adjust basically

  • how many decimal points are printed in the floats type

  • tensors.

  • You can also adjust the threshold element count

  • beyond which ellipses will be used in the printing, which

  • is useful for cases where you want

  • to look at the values of huge tensors,

  • like thousands of elements.

  • Of course, the story is not always as straightforward

  • as this.

  • The program is often not executing purely eagerly.

  • And sometimes you have tf.function,

  • sometimes your function is decorated by TensorFlow itself

  • and converted into a graph.

  • So there, if you use the print statements,

  • then the results of the printing may not

  • be what you would expect.

  • So here the user intends to look at the value of n

  • as each iteration of the while loop.

  • So the user puts a print statement here naively.

  • And the result has only one printed line,

  • even though the while loop is executed multiple times.

  • And the contents of the printed text

  • is also kind of confusing to a naive user.

  • And the reason is that when tf.function is used,

  • the code here is transformed into a tf.Graph.

  • And the print statement gets executed

  • during that function-to-graph transformation.

  • And the contents you see here is actually

  • a node in the graph instead of the value

  • of the tensor at runtime.

  • So the correct approach is to use tf.print.

  • So tf.print will modify the graph.

  • It will actually add a couple of nodes to the graph

  • so you can look at the value of the n inside the tf.function.

  • So here at the bottom here, you can see the value of n

  • as each iteration of the while loop is printed.

  • So it's quite useful.

  • So here is a homework problem for you.

  • So the examples I've shown so far are all for simple tensors,

  • like float32 tensor or an integer tensor.