Placeholder Image

Subtitles section Play video

  • YURI: Hi, everyone.

  • My name is Yuri.

  • And today, I'm going to be talking

  • to you about tf.data, which is TensorFlow's input pipeline.

  • As a disclaimer, this presentation

  • assumes familiarity with basic TensorFlow concepts

  • such as ops and kernels.

  • And it contains a lot of code examples

  • that are not necessarily 100% accurate.

  • There may be some details that have been removed

  • because they're either unnecessary or distracting

  • for the purpose of presentation.

  • So with that, let's get started.

  • In this talk, we're going to cover a couple of topics.

  • We're going to peel the two main layers

  • of tf.data's implementation one by one,

  • first focusing on Python view and then on the C++ view

  • of tf.data.

  • And then I'm going to cover three areas of tf.data that

  • might be of interest to the broader audience,

  • support for non-tensor types, and both static and dynamic

  • optimizations in tf.data.

  • So let's get started with the Python view.

  • Throughout the course of the presentation,

  • I'm going to be using the following example, which

  • is a pretty standard example of an input pipeline.

  • What this input pipeline does, it's

  • reading files that are in TFRecord formats--

  • so this contains records--

  • then shuffling those records, applying a map transformation

  • that allows you to transform the records and parse them,

  • pre-process them, and finally, batching the pre-processed data

  • so that it's amenable to machine learning computation.

  • And the idiomatic way to iterate through elements

  • of an input pipeline in TF 2.0 is by a simple for loop.

  • And that's because in TF 2.0, data sets are Python iterables.

  • Besides this approach, you can also

  • use the explicit iter or next keywords.

  • Now, as the comment at the bottom mentions,

  • the user-defined function that you

  • can pass into the map transformation

  • can be both graph or non-graph computation

  • where the non-graph computation is enabled by AutoGraph.

  • And I'll talk a little more about that later on.

  • Just to contrast the simplifications that happened

  • in the transition between 1.x and 2.0,

  • let's take a look at what an input pipeline--

  • or idiomatic iteration of an input pipeline-- in 1.x

  • would look like.

  • And you can see that the definition of the input

  • pipeline that is the top part of the data set remains the same.

  • But the iteration is much more verbose.

  • So hopefully, this kind of illustrates that the simplest

  • way to iterate through a data set

  • has been made much more simple in the 2.0 release

  • of TensorFlow.

  • So let's talk a bit more about what's actually going on when

  • you run the Python program.

  • And what we're going to do is we're

  • going to go through different lines of the Python program

  • and talk about what actually happens under the hood in terms

  • of what types of TensorFlow ops these invocations correspond

  • to.

  • And I'm using a diagram to visualize

  • the different types of ops.

  • The gray box is the actual op-- so

  • in this case, TFRecordDataset-- while the yellow boxes

  • are the different inputs for the op,

  • while the blue box is the output of the op.

  • So in the case of the TFRecordDataset,

  • we have a couple of inputs-- file names, compression types,

  • buffer sizes.

  • And an important thing that I want to highlight here

  • is that this op produces a variant tensor, which

  • is a representation of the data set object

  • that can be passed between different ops.

  • And we will see how that's used right away when we're looking

  • at the map transformation.

  • So the MapDataset op, you can see that one of its inputs

  • is actually a variant, which is the downstream data set that

  • produces the elements that the map transformation transforms.

  • Other inputs are-- they're called other arguments.

  • And these are actually the captured inputs

  • for the function.

  • In this particular case, that input

  • would be empty, because the function doesn't

  • have any captured inputs, at least not as outlined

  • in the example.

  • And the round boxes are not inputs.

  • They are attributes.

  • The difference between inputs and attributes

  • is that the attribute values do not

  • change with different executions of the op.

  • They are constant.

  • And the attributes here are function,

  • which identifies the function parse, which

  • is stored separately in TensorFlow runtime--

  • but it allows the op to look it up when it executes--

  • as well as the type of the arguments

  • that the function inputs.

  • And again, like the TFRecordDataset,

  • it produces an output variant.

  • So a little more about the use of support

  • for user-defined functions in tf.data.

  • A number of tf.data transformations

  • are operations that actually allow users

  • to specify their own functions.

  • Examples of those are filter, flat_map, interleave, map,

  • or reduce.

  • And irrespective of the mode of the execution,

  • tf.data will convert the user-defined function

  • into a graph.

  • And as illustrated on the previous slide,

  • the function graph is--

  • a handle to the function graph is passed to the respective op

  • through an attr.

  • A little more detail on the tracing implementation-- it

  • was originally based on framework.function.Defun

  • and recently switched to the same tracing implementation

  • that's used for TF functions in 2.0.

  • This provided a number of benefits,

  • including control flow version 2,

  • support for resource variables, TensorArrayV2, and also

  • the ability for users to specify user-defined functions that

  • are not necessarily graph-compatible as

  • long as they're supported by AutoGraph.

  • And it's marked as work in progress

  • here because this functionality is actually

  • temporarily disabled.

  • And we're working on enabling it back on very soon.

  • So to tie it together, if we look at the input pipeline

  • definition, the four lines, this definition of an input pipeline

  • will roughly correspond to the following ops, and inputs,

  • and attributes.

  • Now, up to this point, we've only

  • talked about how to define the input pipeline.

  • But naturally, the thing that you

  • would want to do with the input pipeline

  • is that you would like to enumerate the elements

  • inside of it.

  • And that's where the iterator ops come in play.

  • Because iterator, it can be thought

  • of as an instance of a data set that has a state

  • and allows you to enumerate the elements in a sequential order.

  • So what are the iterator lifecycle ops?

  • The op on the left top corner called

  • Iterator that takes no input and produces a single output called

  • handle is an op that creates an empty iterator resource, which

  • is a way to pass the state, iterator state,

  • between different operations, while the MakeIterator op takes

  • two different inputs.

  • It takes iterator resource, which is something

  • that we've created by the iterator op,

  • and a data set variant.

  • And what this MakeIterator op does,

  • it instantiates the data set-- sorry,

  • the iterator resource with that particular data set.

  • So at that point, you have an iterator resource

  • that has been initialized to start producing

  • elements for that particular data set as defined by the data

  • set variant.

  • Now, the actual iteration happens

  • by the means of the IteratorGetNext op,

  • which takes an iterator resource handle

  • and produces the actual elements, which

  • can be a tensor, or a nest of tensors, or possibly

  • also non-tensor types.

  • And later in the presentation, I'll

  • talk about what exactly is supported

  • in tf.data in terms of types.

  • And finally, there's also a DeleteIterator op

  • that takes the iterator resource and makes

  • sure that the iterator state is properly

  • disposed of when the iterator is no longer needed.

  • This final op, as you can imagine,