Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • BOYA FANG: Hi, my name is Boya, and I'm a software engineer

  • at Google Lens.

  • And I'll be talking today about how

  • TensorFlow and, particularly, TF Lite

  • help us bring the powerful capabilities of Google Lens'

  • computer vision stack on device.

  • First, what is Google Lens?

  • And what is it capable of?

  • Lens is a mobile app that allows you to search what you see.

  • It takes input from your phone's camera

  • and uses advanced computer vision technology in order

  • to extract semantic information from image pixels.

  • As an example, you can point Lens

  • to your train ticket, which is in Japanese.

  • It will automatically translate it and show it to you

  • in English in the live viewfinder.

  • You can also use Lens to calculate the tip

  • at the end of dinner by simply pointing

  • your phone at the receipt.

  • Aside from just searching for answers

  • and providing suggestions, though, Lens

  • also integrates live AR experiences

  • into your phone's viewfinder, using optical motion tracking

  • and on-device rendering stack.

  • Just like you saw in the examples,

  • Lens utilizes the full spectrum of computer vision

  • capabilities, starting with image quality enhancements

  • such as de-noising and motion tracking in order

  • to enable AR experiences and then,

  • particularly, deep-learning models for object detection

  • and semantic understanding.

  • In order to give you a quick overview of Lens' computer

  • vision pipeline today, on the mobile client,

  • we select an image from the camera stream

  • to send to the server for processing.

  • On the server side, the query image

  • then gets processed using a stack of computer vision models

  • in order to extract text and object

  • information from the pixels.

  • These semantic signals are then used

  • to retrieve search results from our server-side index, which

  • then gets sent back to the client

  • and displayed to the user.

  • Lens' current computer vision architecture is very powerful,

  • but it has some limitations.

  • Because we send a lower resolution image to the server

  • in order to minimize the payload size of the query,

  • the quality of the computer vision prediction

  • is lowered due to the compression artifacts

  • and reduced image detail.

  • Also the queries are processed on a per-image basis, which

  • can sometimes lead to inconsistencies, especially

  • for visually similar objects.

  • You can see, on the right there, the moth

  • was misidentified by Lens as a chocolate

  • cake, just an example of how this may impact the user.

  • Finally, Lens aims to provide great answers

  • to all of our users instantly after opening the app.

  • We want Lens to work extremely fast and reliably

  • for all users, regardless of device type

  • and network connectivity.

  • The main bottleneck to achieving this vision

  • is the network round-trip time with the image payload.

  • To give you a better idea about how network latency impacts us,

  • in particular, it goes up significantly

  • with poorer connectivity as well as payload size.

  • In this graph, you can see latency

  • plotted against payload size with the blue bars representing

  • a 4G connection and the red a 3G connection.

  • For example, sending a 100 KB image on the 3G network

  • can take up to 2.5 seconds, which is very high from a user

  • experience standpoint.

  • In order to achieve our goal of less than one

  • second end-to-end latency for all Lens' users,

  • we're exploring moving server-side computer vision

  • models entirely on device.

  • In this new architecture, we can stop sending pixels

  • to the server by extracting text and object

  • features on the client side.

  • Moving machine learning models on device

  • eliminates the network latency.

  • But this is a significant shift from the way Lens currently

  • works, And implementing this change

  • is complex and challenging.

  • Some of the main technical challenges

  • are that mobile CPUs are much less powerful

  • than specialized, server-side, hardware architectures like

  • TPUs.

  • We've had some success importing server models

  • on device using deep-learning architectures optimized

  • for mobile CPUs, such as MobileNets,

  • in combination with quantizing for mobile hardware

  • acceleration.

  • Retraining models from scratch is also very time consuming,

  • but training strategies like transfer learning

  • and distillation significantly reduced model development time

  • by leveraging existing server models to teach a mobile model.

  • Finally, the models themselves need deployment infrastructure

  • that's inherently mobile efficient and manages

  • the trade off between quality, compute, latency, and power.

  • We have used TF Lite in combination

  • with mediapipe as an executor framework

  • in order to deploy and optimize our ML

  • pipeline for mobile devices.

  • Our high level developer workflow

  • to port a server model on device is to,

  • first, pick a mobile friendly architecture,

  • such as a MobileNet, then train the model using TensorFlow

  • training pipeline, distilling from a server model,

  • and then evaluating the performance by using

  • TensorFlow's evaluation tools.

  • And, finally, to save the train model at a checkpoint

  • that you like, and then converting the format to TF

  • Lite in order to deploy it on mobile.

  • Here's an example of how easy it is

  • by using TensorFlow's command line tools to convert

  • the saved model to TF Lite.

  • Switching gears a little bit, let's look

  • at an example of how Lens uses on-device computer vision

  • to bring helpful suggestions instantly to the user.

  • We can use on device ML in order to determine

  • if the user's camera's pointed out something that Lens

  • can help the user with.

  • You can see here, in this video, that,

  • when the user points at a block of text,

  • a suggestion chip is shown.

  • When pressed, it brings the user to Lens,

  • which then allow them to select the text

  • and use it to search the web.

  • In order to enable these kinds of experiences on device,

  • multiple visual signals are required.

  • In order to generate these signals,

  • Lens uses a cascade of text, barcode, and visual detection

  • models, which is implemented as a directed acyclic graph, some

  • of which can run in parallel.

  • The raw text, barcode, and object-detection signals

  • are further processed using various on-device annotators

  • and higher level semantic models such as fine-grained

  • classifiers and embedders.

  • This graph-based framework of models

  • allows Lens to understand the scene's content as well

  • as the user's intent.

  • To further help optimize for low latency, Lens on device

  • uses a set of inexpensive ML models,

  • which can be run within a few milliseconds on every camera

  • frame.

  • These perform functions like frame selection and course

  • classification in order to optimize for latency

  • and compute by carefully selecting when to run

  • the rest of the ML pipeline.

  • In summary, Lens can help improve

  • the user experience of all our users

  • by moving computer vision on device.

  • TF Lite and other TensorFlow tools

  • are critical in enabling this vision.

  • We can rely on cascading multiple models in order

  • to scale this vision to many device types

  • and tackle reliability and latency.

  • You, too, can add computer vision to your mobile product.

  • First, you can try Lens to get some inspiration for what

  • you could do, and then you could check out

  • the pre-trained, mobile models that TensorFlow publishes.

  • And you can also follow something like the MediaPipe

  • tutorial to help you build your own custom cascade.

  • Or you could build and deploy and integrate ML models

  • into your mobile app using something like ML Kit Firebase.

  • Thank you.

  • [MUSIC PLAYING]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it