Placeholder Image

Subtitles section Play video

  • We're now going to

  • take up a particular problem that has a very non-trivial solution.

  • We assume the stream elements are bits, zeroes or

  • one, and we want to know how many bits in the last N are one.

  • If we can store the most recent N bits,

  • we can use the solution like the one discussed for averages.

  • In fact the algorithm would be even simpler.

  • However, we're going to address the situation where N bits don't fit

  • in main memory.

  • Perhaps N is a trillion or N is a reasonable number but

  • there, there are so many streams that we can't store complete windows for all.

  • If we want exact answers then we can show that it is impossible to do

  • anything better than to store the entire window.

  • However, what is interesting is that we can store on the order of

  • the square of log in bits, where N is the window size, and

  • still answer queries about the counted ones with answers that are off by at

  • most a small factor, as small as we like if we do enough work.

  • The problem we're going to discuss is this.

  • We're given a stream of 0's and 1's.

  • At any time we have to be prepared to answer a query of the form.

  • How many of the last k bits were one?

  • Here k can be any integer from one up to some large upper limit N.

  • We can surely do this if we use a windows size N and store the last N bits.

  • When a new bit arrives, we throw away the oldest bit in the windows since it

  • can never again be useful to answer one of these queries.

  • But one disadvantage of this approach is that answering one query requires that we

  • examine k bits, since k can be quite large, and both inputs and

  • queries may be arriving very rapidly, that may be time we cannot afford.

  • Another potential problem is that we may not be able to afford the space.

  • As we just mentioned we could be trying to handle a large number of streams or

  • N could be so large that even one window does not fit in main memory.

  • Both these concerns suggests that we should consider a method that uses less

  • than N space.

  • And that also allows us to answer queries about the last k

  • bits much faster than on the order of k.

  • It turns out that we can't get an exact answer to queries without using N

  • bits in the window.

  • But we can get close using much less space than all of N,

  • and also much less time than all of k.

  • We're going to introduce the right algorithm with the discussion of

  • something that seems like it should work but doesn't quite.

  • Our goal was the be off by no more than a factor of 2 in

  • estimating the number of ones in the last k bits.

  • So we will summarize blocks of the stream as blocks will have exponentially

  • increasing lengths, that is 1, 2, 4, 8, 16, so on.

  • And the summary of a block will be simply the count of

  • the number of ones in that block.

  • When we want to know the count of ones for last k bits, we can find

  • some blocks that lie wholly within the last k bits and we add up their counts.

  • It is only the last block, the one furthest back in time that gives us pause.

  • We don't know how many of its ones within the last k bits so we have to guess.

  • But if we've created these exponentially growing blocks for

  • all time units then there would be as many blocks of length one as there are bits in

  • the window, as well as blocks or size two, four, eight, and so on.

  • So that saves us nothing.

  • Instead, we have to drop blocks if their left end, that is, the end that is

  • earliest in time, coincides with the left end of the larger block.

  • And we also drop a small block if there's a larger block completely to their right,

  • that is, later in the stream.

  • As a result, you never have more than two blocks of any one size.

  • So, here is an example of the blocks we might retain at some time.

  • The five rows of blocks are of lengths 1, 2, 4, 8 and 16.

  • Okay, there are two blocks of length 1.

  • The more recent has a count of 0 because it consists of a single 0.

  • That's this.

  • While the other has a count of 1 because it consists of a single 1.

  • 'Kay?

  • Here's a block of length 2.

  • That has a count of 1 because it represents 0 1.

  • That is, these two bits.

  • Notice that we've previously deleted the block of length 1 that would go here,

  • because it begins at the same point as the block of length 2 above it.

  • Also all other blocks of length 1 are deleted because they have

  • a block of length 2 completely to their right.

  • We also show a second block of length 2.

  • Its count is 2 because it represents, this 1 1.

  • There are two blocks of length 4 and they have counts of 2 and 3.

  • They represent, well, this guy represents this sequence, 0 0 1 1, so it has two 1's.

  • This represents 1 0 1 1 and

  • therefore gets a count of 3.

  • 'Kay.

  • We see one block of length 8.

  • Its count is 4.

  • Well let's see, because it represents these eight bits.

  • And notice that, that the count for second block of length 8 is not

  • needed because we can figure out it has six ones.

  • Since that's tad, that 6 is the difference between the number of

  • ones in this block of length 16 and that block of length 8.

  • Or that is 10 minus 4 equals 6.

  • So if this block existed, it would have, surely have six once.

  • Okay.

  • Now, suppose we get a query for how many ones there are in the most recent 28 bits.

  • We can add up the counts of certain blocks.

  • Some little blocks at the right end, and then some bigger blocks going to the left.

  • We want to pick blocks so that each of the most recent 28 bits is covered by

  • exactly one of the blocks we choose.

  • So, we pick this block of length 1.

  • This block of length 2.

  • This of length 4.

  • We don't want this block of length 8 because we have this block of length

  • 16 and that's still all within the last 28.

  • so, so far we have covered 23 bits and

  • we know that among them the number of

  • 1's is 0 plus 1 plus 2 plus 10,

  • which is 13 Okay but what do we do about the oldest five bits?

  • We that is, there are these bits now, if we could see the bits we would know

  • that they're 0 0 1 0 1 therefor they have two 1's, but we don't see them.

  • All we see is that they are part of this block of 16 and

  • we know that block has a count of six,

  • okay, but we can't tell how many of those six are in the most recent five positions.

  • Again, we don't ever get to see this anymore.

  • Now if we could see them of course we would know there were two and

  • that the right answer is 15.

  • But we need to estimate, without seeing how many ones there are in this region.

  • Okay if we guess that half the count of the block that is 3,

  • 6 divided by 2 in this case is is in the region

  • we don't see then we would guess 16 and

  • that's only off by 7% so that's not even bad.

  • We could even try a proportional guess that is say,

  • we know that there is 6 with, in 6 divided by 16, well,

  • 6 divided by 16 is the probability that any given bit

  • is 1 in the range represented by this, by this block of 16,

  • and we know that we have to count five of them, so

  • that's 30 divided by 16, which is roughly 2,

  • and so if we guess 2, and added that, we would get 15.

  • And that happens to be right on the mark even though we didn't get to see the,

  • the, the five bits that we wanted to count those.

  • This strategy has a lot to recommend it.

  • Okay, first it stores only the square of log N bits.

  • I might comment that we use log, we use this expression (log2N)

  • to mean the square of log N.

  • Okay this is a, a, a common expression.

  • You don't want to write it as log N squared because that's really 2 2 log N,

  • which is not what we mean.

  • So I, I should, if you've never seen this notation before again the putting

  • the square above the log means that you're actually squaring the whole thing.

  • The, the squaring log N.

  • okay, now.

  • As I said, okay square, storing square of log N bits is not that bad, okay?

  • It's much less than N for, for for large N.

  • So if N is a billion, then log squared N is about 900.

  • Now why do we need only on the order of log squared N bits?

  • Well first of all,

  • if the window size is N bits, we never need any blocks of length greater than N.

  • An account up to n can be stored in log based 2 of N bits.

  • Now how many counts do we need?

  • Well, there are only log based 2 N box sizes from 1 1 to 4, 8, 16 and so on.

  • Up to the largest power of 2 that are long, are no larger than N.

  • So we never store more than two blocks of any size.

  • And as a result, we need to store at most,

  • 2 log N counts, of at most log in bits each, and that's 2 log squared N.

  • Another good thing is that after each bit we do a limited amount of work.

  • We have to create a new block of length 1 for

  • each of the lengths 1, 2, 4, 8, and so on.

  • We may have to drop a block of that length or we may have to combine two blocks of

  • one length into two blocks of the next larger length.

  • But that means that most order log N were total since there were log N sizes.

  • And the error is frequently small.

  • It can't be bigger than the count of the biggest block.

  • The one that is only partially in the region we're counting.

  • There's a problem with the scheme, however.

  • When the 1's are distributed evenly among all the regions of the stream,

  • the number of 1's in the ambiguous region can't be

  • more than half the total number of 1's in the region we want to count.

  • So our error is limited to 50%, but look what happens if all the ones in

  • the region we want to count are at the left end, and in particular,

  • are counted only by a block that is partially within the desired region.

  • Then the true count could be anything from 0 up to the full count of that block.

  • Anything we guess could be wildly wrong, and we'll never know.

  • We're therefore going to discuss a similar algorithm that preserves the good and

  • avoids the problem with uneven distribution of 1's.

  • We'll still divide the window into blocks, but instead of letting each block cover

  • a fixed segment of the string, we'll let each block cover a fixed number of 1's.

  • The sizes of the blocks will still be limited to the powers of 2.

  • That is 1, 2, 4, 8, and so on, but the notion of the size of a block changes.

  • Now the size of a block will be the number of 1's.

  • So we'll have blocks of size 1 to represent segments in

  • the stream that have a single 1.

  • Blocks with twice that size will represent two 1's and number of 0's.

  • And then there will be blocks representing four 1's and any number of 0's and so on.

  • The advantage of this scheme is that there are few 1's in the resent stream,

  • the block size covering that region will stay small.

  • They will cover large parts of the stream, while their size, or

  • number of 1's remains limited.

  • I decided to call the algorithm I'm going to describe the DGIM algorithm.

  • The initials refers to the four guys who invented this algorithm Mayur Datar,

  • Aristides Gionis, Piotr Indyk, and Rajeev Motwani.

  • And in fact, this is a good time to stop and

  • remember Rajeev Motwani who died shortly after this algorithm was published.

  • He along with Gionis and Indyk is also responsible for

  • locality sensitive hashing which forms a major part of this course.

  • Like our earlier attempt and an algorithm,

  • a DGIM stores on the order of (log2N)bits, to represent 1N bit window.

  • There's an absolute guarantee of no more than 50% error in the answer to any query.

  • And if 50% is too much, you can reduce the error to anything greater than 0.

  • The algorithm becomes more complicated on the number of bits you need to

  • store grow although the number of bits remains proportionate to (log2N).

  • It's just the constant factor that grows in inverse proportion to

  • the desired error bound.

  • Okay, to begin the story we need to introduce the idea of a timestamp.

  • Every bit that arrives in the stream gets a timestamp.

  • You might think that we need an arbitrary number of bits to

  • represent the time stamp since there's no limit on how long the stream can be.

  • But it's really only necessary to represent timestamps modulo N,

  • the window size that is we can divide the timestamp by N and take the remainder.

  • The net effect is the timestamp start out at 0, 1, and so

  • on up to N minus 1 and then go to 0 again, 1, 2 and so on.

  • Regardless of where the window is in the stream,

  • its N bits will all have different timestamps.

  • We're going to partition the window of length N into buckets.

  • Each bucket is represented by a record, and

  • records can be stored in on the order of log N bits.

  • As we shall see, we only need on the order of log N buckets to represent the window,

  • so on the order of log squared N bits suffices.

  • The record contents are the following.

  • The timestamp of its end.

  • The most recently arrived bit.

  • As I mentioned we'll record timestamps modulo N.

  • So we need log N bits to represent the timestamp.

  • The number of 1's between beginning and the end of the segment.

  • We call this count of 1's the size of the bucket.

  • However the number of 1's in this segment represented by a bucket must be

  • a power of 2.

  • That explains why we only need log log N bits to represent the count of 1's.

  • We can store the logarithm of the count instead of the count itself since,

  • we know that log base to the count must be an integer.

  • The count itself can't be higher then N so

  • it's logarithm can't be higher than log base 2 of N.

  • Since the logarithm is an integer r i, and

  • we only need log i bits to represent the i in binary, log log N bit suffices.

  • It really doesn't matter much, since we still need order log N

  • bits in the record for the bucket, just to store the timestamp of its end.

  • The partition into buckets must obey the following rules.

  • There must be one or

  • two buckets of each allowed sides up to the maximum size we need.

  • Remember that allowed size is of the power-of-2.

  • No bit of the window is part of two buckets.

  • Some 0's in the stream may not belong to any bucket.

  • It, it, it doesn't matter.

  • But buckets can only increase in size as we, as we go back in time.

  • The most recent part of the window is represented by the smallest buckets.

  • When the end time stamp of a bucket is more than end-time units in the past,

  • it no longer represents part of the window, so

  • we delete it from the set of buckets whose records are stored.

  • Here is a picture of what the partition of a stream into buckets might look like at

  • some point.

  • The most recent two 1's are in bucket of size 1 by themselves,

  • and it's here and here.

  • Further back, the previous two 1's are grouped into a bucket of size 2.

  • It's that there might be two buckets of size 2 but

  • there could also only be one as in this case.

  • Then going further back in time we see the previous four 1's in a bucket of size 4,

  • and the four 1's before that are also in a bucket of size 4.

  • Then we see two buckets of size 8.

  • And finally a bucket of size 16.

  • The end-time stamp of this bucket is still within the window of length N.

  • That's this.

  • Although its beginning is outside the window.

  • We still need this bucket, but any previous buckets have a time stamp that is

  • prior to the beginning of the current window, so we have deleted their records.

  • So let's see how we manage the buckets as bits arrive on the stream.

  • The first thing we're going to do

  • is worry about whether we need to drop the oldest bucket.

  • We need to keep outside the bucket representation,

  • the count of the number of bits that have ever arrived in the screen.

  • However we only need this count modulo N so an extra log in bits is all we need.

  • When a new bit come in, increment that count.

  • Of course if the count reaches N then we set it back to 0.

  • That's how modular arithmetic works.

  • Now, look at the end-time of the oldest bucket.

  • If its time stamp agrees with the current time, then that time stamp is

  • really the current time minus N since we're computing all time stamps modulo N.

  • The entire oldest bucket is therefore out of the window and we delete its record.

  • But if the time stamp is anything else then the oldest bucket still has

  • it's end within the window so it remains.

  • What we do next depends on whether the bit that just entered is 0 or 1.

  • If it's 0, then we make no further changes to the set of buckets.

  • That was easy.

  • If the current input is 1, we have some work to do.

  • But the work is at most logarithmic in the window size N.

  • First we create a new bucket for the new bit.

  • The size of the bucket is 1, and its ending timestamp is the current time.

  • There might have been one or two buckets of size 1 previously.

  • If there's only one, now there are two, and that's fine.

  • We are allowed to have one or two of any size.

  • But, if there we previously two, now there are three.

  • We can't have three buckets of size 1 so

  • we combine the oldest two into one bucket of size 2.

  • Combining consecutive buckets of the same size is easy.

  • We add 1 to the logarithm of the size, and

  • we take the N timestamp to be the N timestamp of the more recent of the two.

  • So, for example here are two buckets could be of,

  • of consecutive buckets of any size, let's say 2 to the x.

  • We combine them into one bucket of size 2 to the x plus 1,

  • by simply, this bucket gets this ending time.

  • I just copy it from here.

  • And we add 1 to the size, which essentially says, therefore it's,

  • sorry, the size is doubled and then we just make that go away.

  • But our work might not be over.

  • If we had to create a bucket of size 2 we might now have three of that size.

  • So we combine the earliest two into one bucket of size 4.

  • And the problem could ripple through the sizes.

  • If we just created a third bucket of size 4 then we could have three buckets of

  • size 4.

  • We need to combine the earliest two into a bucket of size 8 and so on.

  • But because we're doubling the bucket's size each time we pass the problem to

  • the next level, after log N fix ups we've reached a bucket size as large as

  • the entire window and there's never need for a larger bucket.

  • 'Kay.

  • The rippling effect therefore stops after at most log N rounds.

  • And each round may, takes a constant amount of work.

  • So O(logN) is a guaranteed upper bound on the total time needed

  • to process an incoming 1.

  • Usually the, the time required is much less, and on the average, it is constant.

  • On this slide, we'll see the changes that occur as bits enter the system.

  • So here's the initial state of the window.

  • A 1 enters.

  • We create a bucket of size 1 for it, this.

  • But now they have three buckets of size 1.

  • So we're going to have to combine the two earliest 1's.

  • This one.

  • And that one.

  • Okay, so here we've done the combination.

  • What has happened in terms of the records is that the record for

  • this bucket is deleted.

  • The size for this record has changed from 1 to 2.

  • And it's time stamp has not changed, it has therefor actually become this record.

  • And notice that,

  • that 1 is really inside of the record that this slide is not shown perfectly there.

  • Now I'm showing what happens after another 101 arrives.

  • Okay? The first of these 1's created this

  • bucket, and then the 0 came in represented that nothing changed.

  • And then this next 1 arrives.

  • And now, we get a third bucket of size 1.

  • Okay, so that causes these two buckets to get combined into this guy.

  • And now we have three buckets of size 2.

  • So, we have to combine these two by that one really belongs

  • in the in, in, in the middle bucket of size 2.

  • So, we combine these two into to a bucket size 4.

  • And that made three buckets of size 4 so these guys got combined into

  • that bucket of size 8, but that was a third bucket of size 8.

  • So these buckets of size 8 got combined into that bucket size 16

  • now there can't be more buckets of size 16.

  • There's this one but that extends beyond the end of the of the window.

  • So we're done rippling changes to larger and larger buckets.

  • Now I want to explain how to query the system.

  • So suppose we want to know how many 1's there are in the last k

  • bits where k is any integer less than or equal to N, the window size.

  • First thing we want to do is to ignore all buckets whose ending timestamp is

  • earlier than k bits prior to the current time.

  • Those buckets are all outside the range we want to count so

  • they make no contribution.

  • Start by summing the sizes of all the buckets except the oldest bucket that is

  • still in the range we are interested in.

  • Then add half the size of that bucket.

  • Okay.

  • The reason we only had half the oldest bucket size is that we really don't know

  • how many ones from that bucket are still within the range of interest.

  • By guessing half, we minimize the maximum error as we'll discuss on the next slide.

  • So here is why the estimate can't be off by a factor of

  • more than 50% from the true answer.

  • For a supposed that the oldest bucket in the range we're interested in has size 2i.

  • We assumed half, or

  • 2i minus 1 of its 1's are among the most recent k bits to arrive.

  • The true number could be anything between 1 and 2i so

  • our error is upper bounded by 2i minus 1.

  • Now what's the smallest the true answer could be?

  • There is at least one bucket of each of the sizes less than

  • 2i that lies completely within the last, k bits.

  • These account for at least 1 plus 2

  • plus 4 plus so on, up to 2i minus 1.

  • And that's 2 to the, that sum is 2i minus 1.

  • Now we add 1 for the 1 that is at the end of the oldest bucket.

  • That bucket has an ending timestamp that's within range.

  • And buckets always end in a 1 so

  • there is, there are at least 2i 1's within the range.

  • Okay, since our error is no more than 2i minus 1.

  • That error is at most 50%, you know?

  • We're not going to discuss the extensions here but

  • it is possible to modify the algorithm described to limit the error to

  • any fraction we like greater than 0, while still using only on the order of

  • log squared N bits to represent all the buckets we need to represent.

  • The textbook describes how to do this.

We're now going to

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it