Subtitles section Play video
Meet the Packets: How audio travels into your browser
Sara Fecadu KATIE: Hello. Welcome back. So, I keep forgetting
to do this and I apologize. But the big announcement right now is that the swag is ready. But do
not go get swag now because we're about to have a really awesome talk by Sara Fecadu.
I asked Sara for a fun fact and her fun fact was that she makes bakes a mean cookie which
unfortunately we can't all indulge in. So, as a follow up question, I said what prompted
you write this talk about an audio API. And she said, well, I had spent a year building
a checkout form and I just couldn't stand to look at it or think about it anymore and
I had to do something different. Which I think is something that literally all you have us
can probably identify really strongly with. So, anyways, Sara is gonna come up and talk
to us about the audio API. So, give it up for Sara.
[ Applause ] SARA: Hello. See if I can get my computer
started here. Okay. Welcome to my talk. Meet the packets. If not everyone has realized,
it's a play off meet the parents. I spent a lot of time working on that.
[ Laughter ] Let's see here. One second. Gonna progress?
No. Okay. We're gonna do it without the clicker. So, this will be interesting. As Katie said,
my name oh. My whole slide deck isn't progressing. Okay. One second. There we go. Okay. Thank
you for coming to talk. As Katie said, my name is Sara Fecadu. I am from Seattle, Washington.
And I don't have a ton of hobbies besides making cookies and listening to a lot of podcasts.
And by day I'm a software developer at Nordstrom. And Nordstrom is a clothing retailer founded
in 1901. While people don't usually associate 100 year old companies with tech, we have
a thriving tech org working on innovative ways to get you what you need and feel your
best. And a year ago I was hired on to do a rewrite of Nordstrom.com's redux. And as
of last May, we have been taking 100% of customer orders. Now, why am I talking about audio
streaming? Katie may have taken my joke here, but the answer is: Form fields. Our checkout
UI has 22 form fields. And they come in different groupings for different reasons. But many
of my waking moments over the past year have been spent thinking about these form fields.
And I just wanted to do anything else. So, I was sitting on my couch one night reading
a book on packet analysis, like one does, and watching a YouTube video. And I thought
to myself, how does that work? Like, on the packet level, how does audio video streaming
work? So, to answer the larger question, I started small with: What is audio streaming?
And audio streaming is the act of sending audio files over the network. And this talk
will be about on demand audio streaming. Now, the major difference between on demand streaming
and live streaming, is with on demand streaming we need all of the packets to get across the
wire. Whereas with live streaming, you may be more interested in keeping them up with
the event and a certain amount of packet loss is acceptable. Over the past few months, I
learned that audio streaming, even when limited to on demand, is as wide a subject as it is
deep. I have picked three topics that exemplify what audio streaming is. Why it's hard and
how to get started yourself. And we will talk about audio streaming protocols, TCP congestion
control and client players. Audio streaming protocols give us a stand how to encode, segment
and ship your code to the client. TCP congestion control handles congestion on the TCP layer
of the stack. And it is relevant with on demand audio streaming because we're shipping larger
audio files and we need every single packet to make its way to the client to play audio.
A client player is any network connected device with a play and pause button. So, this could
be your phone, your TV, your laptop, et cetera. And client players not only allow us to play
our audio, but when paired with modern audio streaming protocols, they hold a lot of decision
making power. Well, audio streaming protocols are the heart of audio streaming. And today
we'll talk about adaptive bitrate streaming it &s it benefits and how to convert your
own audio files to work with two popular audio streaming protocols. Before we get started,
I wanted to go over some terms that will come up. A codec encodes data and uses compression
techniques to get the highest quality for the smallest footprint. Encoding and trans
coding is converting it from one type to another. Trans coding can convert from digital to digital.
And then move from analog to other digital files. Bitrate is how many bits it takes to
encode a second of audio. And this number usually refers to the quality of the audio
file. When I think of playing music on the Internet, I think of an HTML5 audio tag with
a source attribute set to the path of my audio file. And this is a perfectly reasonable way
to do it. You can request and receive a single file containing an entire song. And it would
be referred to as progressive streaming and the major benefit here is you only have one
file to deal with. But let's say, for instance, you have a user and they have a slow network
connection and they can't download your one file. They're stuck. So, adaptive bitrate
streaming aims to solve this problem by encoding your audio in multiple bitrates and allowing
the client player to decide which quality is best for the user to listen to your audio
uninterrupted. This allows more users to access your audio. But it does add a layer of operational
complexity because now you've got a lot more work on moving parts. The audio streaming
protocols we'll talk about not only average adaptive bitrate streaming, but also use HTTP
web servers. They do this by encoding the file, segmenting they will, placing them on
a web server and then once requested, partial audio files are sent to the client one at
a time. Here is the secret to our modern audio streaming protocols is it's more of a series
of downloads than it really is a stream. But we'll refer to it as streaming anyway. The
two most popular audio streaming protocols today are HTTP lye streaming, or HLS, and
dynamic adaptive streaming over HTTP, MPEG DASH. It was created by Apple to support streaming
to mobile devices and it is default on all Mac OS and Apple devices. And MPEG DASH was
a direct alternative to HLS. It was created by the forum who want to make MPEG DASH the
international streaming. Let's look at them side by side. HLS takes the MPC, AAC, AC 3,
or EC 3, encodes them into fragmented MP4 files. Those segmented files are in a play
list. If you have multiple bitrate streams, each stream will be in a media play list and
all of your media play lists will be in a master play list. With MPEG DASH, it is agnostic,
in theory you can convert any into MPEG DASH. It will be fragmented into a fragmented MP4
file. That will be displayed in an XML manifest file called a media presentation description.
Okay. We've talked about what files will be used and what they'll be segmented into, but
how do you get it there? You've got this audio file. What tools allow you to convert the
audio file? Well, you've got options. But most of these options are paid options. Except
for FFmpeg. Which is an open source demand line tool that among other things allows you
to convert audio files to be HLS or MPEG DASH. However, I founded learning curve for FFmpeg
to be pretty steep. And a lot of the documentation for HLS and MPEG DASH were for video streams.
Instead I used Amazon elastic trans coder. It's an AWS offering that converts files of
one type to another. In our case, we're taking an audio file and converting it to be used
with HLS and MPEG DASH. It's pretty much plug and play. You tell Amazon elastic trans coder
what type of files you have and what type of files you want and it outputs the stream
for you. And even though it's easy to use, it's not a free service. So, if you were going
to be converting a lot of files, it may be worth your time to learn more about an open
source alternative like MPEG DASH. My workflow when working with Amazon Elastic Transcoder
was to upload to an AWS object store. I told Amazon Elastic Transcoder where my audio file
was and what settings I needed it to convert my audio files to. And Amazon Elastic Transcoder
output my streams into that same S3 bucket. And I downloaded them for us to explore. This
is the basic set of files you would get with an HLS stream. And it kind of looks like a
lot. But we're going to break it down into four groups. In the top left, the master play
list. In our case, we have two bitrate streams represented and they will be linked out from
the master play list. And then in the top right you'll see those media play lists which
have each bitrate stream. And those will contain all of our links to our transport stream files
which are the fragmented audio files represented in both the bottom left and the bottom right.
On the bottom right we have our 64K bitrate stream segmented audio files. And in the bottom,
oh. Did I get that backwards? I'm not really good at right and left. But in the bottom
section you'll have your fragmented audio files. We'll take a closer look at those so
you can see really what's in it. This is the entirety of the HLS master play list. It contains
information about the specific bitrate streams and links out to those media play lists that
represent the streams themselves. Let's look at the 64K bitrate stream media playlist.
It has even more information about the stream including caching information, the target
duration of each segmented audio file, and most importantly, links out to our transport
streams. This is what one of those fragmented audio times looks like. And there's something
a little interesting going on here. If you'll notice, it's color coded and I kept trying
to figure out why. But then I realized a transport stream has the file extension .ts. And something
else has the file extension .ts, TypeScript. Ignore the colors. It's just a binary coded
file. Now our MPEG DASH audio stream has fewer files and looks more manageable. But it's
similar. We have our media presentation description, which is an XML manifest file which contains
all of our information about the stream. Then below we have our two segmented audio files.
All of the segments are encapsulated in a single file, but within them there are segments.
That's why there are fewer files in the MPEG DASH audio stream than in the other audio
stream. Look at the description. See a lot of stuff here. But there are three important
elements. All bitrate streams are represented in a representation tag. And then all bitrate
streams are enclosed in an adaptation set. Within the representation tag, we do have
our URL to our audio files. And taking a look at one of those audio files we'll see if looks
fairly similar to the segmented audio file we saw with HLS. Minus the color coding because
it's a .MP4 versus .TS. visual studio is not confused in this case.
Earlier we talked about progressive streaming which is streaming an entire audio file in
one two. We used an audio element and a source attribute with the path of our audio file.
With MPEG DASH and HLS, it's very similar. But instead of having the path to our audio
file, we have the path to the master play list for HLS or media presentation description
for MPEG DASH. We're going to take a hard left here and we're gonna talk about the second
topic in my talk. Which is TCP congestion control. And TCP is a transport layer protocol
and it has mechanisms in both its sender and receiver which are defined by the operating
systems of each to react to and hopefully avoid congestion when sending packets over
the wire. And they are called TCP congestion control. And today we talk about packet loss
congestion control and why it isn't so great. And more specific, the congestion window and
duplicate acknowledgment in packet loss based congestion control. Before we get started,
somewhere terms, bandwidth is the rate at which data can be sent. And throughput is
the rate at which data can be received. The congestion window is a TCP variable that defines
the amount of data that can be sent before the acknowledgment is received by the sender.
Let's say you have a user who has requested your audio file from the server. Your audio
packets travel down the network stack, across the physical layer, up the data link layer
in the network layer and arrives at the transport layer and unfortunately there's congestion
right before we reached our destination. Now, traffic congestion and network congestion
have very similar beginnings. Either too many cars or too many packets have entered the
roadway and there's nowhere for them to go. With traffic, you have to wait it out. Luckily
for us, TCP congestion control allows them to flow over the wire, even during congestion.
And before we get to the specifics of these TCP congestion control algorithms, let's talk
about the TCP happy path. We're going to start with a single packet sent from the sender
to the receiver flowing through the receiver's buffer. And being acknowledged by the receiver
and having an acknowledgment packet sent back to the requester. We talked about the congestion
window, the amount of data before a sender receives an acknowledgment. Another way of
thinking about the congestion window is as a sending rate. As the sender receives acknowledgments,
the congestion window grows. And as the receiver's buffers fill and they drop all excess packets,
the sender responds by shrinking the congestion window. A second way of thinking about the
congestion window is as a bucket. And as packet loss occurs, the bucket shrinks. And as acknowledgments
are received by the sender, the bucket gross. There's a slight oversight in the bucket explanation