 ## Subtitles section Play video

• We're now going to

• take up a particular problem that has a very non-trivial solution.

• We assume the stream elements are bits, zeroes or

• one, and we want to know how many bits in the last N are one.

• If we can store the most recent N bits,

• we can use the solution like the one discussed for averages.

• In fact the algorithm would be even simpler.

• However, we're going to address the situation where N bits don't fit

• in main memory.

• Perhaps N is a trillion or N is a reasonable number but

• there, there are so many streams that we can't store complete windows for all.

• If we want exact answers then we can show that it is impossible to do

• anything better than to store the entire window.

• However, what is interesting is that we can store on the order of

• the square of log in bits, where N is the window size, and

• still answer queries about the counted ones with answers that are off by at

• most a small factor, as small as we like if we do enough work.

• The problem we're going to discuss is this.

• We're given a stream of 0's and 1's.

• At any time we have to be prepared to answer a query of the form.

• How many of the last k bits were one?

• Here k can be any integer from one up to some large upper limit N.

• We can surely do this if we use a windows size N and store the last N bits.

• When a new bit arrives, we throw away the oldest bit in the windows since it

• can never again be useful to answer one of these queries.

• But one disadvantage of this approach is that answering one query requires that we

• examine k bits, since k can be quite large, and both inputs and

• queries may be arriving very rapidly, that may be time we cannot afford.

• Another potential problem is that we may not be able to afford the space.

• As we just mentioned we could be trying to handle a large number of streams or

• N could be so large that even one window does not fit in main memory.

• Both these concerns suggests that we should consider a method that uses less

• than N space.

• And that also allows us to answer queries about the last k

• bits much faster than on the order of k.

• It turns out that we can't get an exact answer to queries without using N

• bits in the window.

• But we can get close using much less space than all of N,

• and also much less time than all of k.

• We're going to introduce the right algorithm with the discussion of

• something that seems like it should work but doesn't quite.

• Our goal was the be off by no more than a factor of 2 in

• estimating the number of ones in the last k bits.

• So we will summarize blocks of the stream as blocks will have exponentially

• increasing lengths, that is 1, 2, 4, 8, 16, so on.

• And the summary of a block will be simply the count of

• the number of ones in that block.

• When we want to know the count of ones for last k bits, we can find

• some blocks that lie wholly within the last k bits and we add up their counts.

• It is only the last block, the one furthest back in time that gives us pause.

• We don't know how many of its ones within the last k bits so we have to guess.

• But if we've created these exponentially growing blocks for

• all time units then there would be as many blocks of length one as there are bits in

• the window, as well as blocks or size two, four, eight, and so on.

• So that saves us nothing.

• Instead, we have to drop blocks if their left end, that is, the end that is

• earliest in time, coincides with the left end of the larger block.

• And we also drop a small block if there's a larger block completely to their right,

• that is, later in the stream.

• As a result, you never have more than two blocks of any one size.

• So, here is an example of the blocks we might retain at some time.

• The five rows of blocks are of lengths 1, 2, 4, 8 and 16.

• Okay, there are two blocks of length 1.

• The more recent has a count of 0 because it consists of a single 0.

• That's this.

• While the other has a count of 1 because it consists of a single 1.

• 'Kay?

• Here's a block of length 2.

• That has a count of 1 because it represents 0 1.

• That is, these two bits.

• Notice that we've previously deleted the block of length 1 that would go here,

• because it begins at the same point as the block of length 2 above it.

• Also all other blocks of length 1 are deleted because they have

• a block of length 2 completely to their right.

• We also show a second block of length 2.

• Its count is 2 because it represents, this 1 1.

• There are two blocks of length 4 and they have counts of 2 and 3.

• They represent, well, this guy represents this sequence, 0 0 1 1, so it has two 1's.

• This represents 1 0 1 1 and

• therefore gets a count of 3.

• 'Kay.

• We see one block of length 8.

• Its count is 4.

• Well let's see, because it represents these eight bits.

• And notice that, that the count for second block of length 8 is not

• needed because we can figure out it has six ones.

• Since that's tad, that 6 is the difference between the number of

• ones in this block of length 16 and that block of length 8.

• Or that is 10 minus 4 equals 6.

• So if this block existed, it would have, surely have six once.

• Okay.

• Now, suppose we get a query for how many ones there are in the most recent 28 bits.

• We can add up the counts of certain blocks.

• Some little blocks at the right end, and then some bigger blocks going to the left.

• We want to pick blocks so that each of the most recent 28 bits is covered by

• exactly one of the blocks we choose.

• So, we pick this block of length 1.

• This block of length 2.

• This of length 4.

• We don't want this block of length 8 because we have this block of length

• 16 and that's still all within the last 28.

• so, so far we have covered 23 bits and

• we know that among them the number of

• 1's is 0 plus 1 plus 2 plus 10,

• which is 13 Okay but what do we do about the oldest five bits?

• We that is, there are these bits now, if we could see the bits we would know

• that they're 0 0 1 0 1 therefor they have two 1's, but we don't see them.

• All we see is that they are part of this block of 16 and

• we know that block has a count of six,

• okay, but we can't tell how many of those six are in the most recent five positions.

• Again, we don't ever get to see this anymore.

• Now if we could see them of course we would know there were two and

• that the right answer is 15.

• But we need to estimate, without seeing how many ones there are in this region.

• Okay if we guess that half the count of the block that is 3,

• 6 divided by 2 in this case is is in the region

• we don't see then we would guess 16 and

• that's only off by 7% so that's not even bad.

• We could even try a proportional guess that is say,

• we know that there is 6 with, in 6 divided by 16, well,

• 6 divided by 16 is the probability that any given bit

• is 1 in the range represented by this, by this block of 16,

• and we know that we have to count five of them, so

• that's 30 divided by 16, which is roughly 2,

• and so if we guess 2, and added that, we would get 15.

• And that happens to be right on the mark even though we didn't get to see the,

• the, the five bits that we wanted to count those.

• This strategy has a lot to recommend it.

• Okay, first it stores only the square of log N bits.

• I might comment that we use log, we use this expression (log2N)

• to mean the square of log N.

• Okay this is a, a, a common expression.

• You don't want to write it as log N squared because that's really 2 2 log N,

• which is not what we mean.

• So I, I should, if you've never seen this notation before again the putting

• the square above the log means that you're actually squaring the whole thing.

• The, the squaring log N.

• okay, now.

• As I said, okay square, storing square of log N bits is not that bad, okay?

• It's much less than N for, for for large N.

• So if N is a billion, then log squared N is about 900.

• Now why do we need only on the order of log squared N bits?

• Well first of all,

• if the window size is N bits, we never need any blocks of length greater than N.

• An account up to n can be stored in log based 2 of N bits.

• Now how many counts do we need?

• Well, there are only log based 2 N box sizes from 1 1 to 4, 8, 16 and so on.

• Up to the largest power of 2 that are long, are no larger than N.

• So we never store more than two blocks of any size.

• And as a result, we need to store at most,

• 2 log N counts, of at most log in bits each, and that's 2 log squared N.

• Another good thing is that after each bit we do a limited amount of work.

• We have to create a new block of length 1 for

• each of the lengths 1, 2, 4, 8, and so on.

• We may have to drop a block of that length or we may have to combine two blocks of

• one length into two blocks of the next larger length.

• But that means that most order log N were total since there were log N sizes.

• And the error is frequently small.

• It can't be bigger than the count of the biggest block.

• The one that is only partially in the region we're counting.

• There's a problem with the scheme, however.

• When the 1's are distributed evenly among all the regions of the stream,

• the number of 1's in the ambiguous region can't be

• more than half the total number of 1's in the region we want to count.

• So our error is limited to 50%, but look what happens if all the ones in

• the region we want to count are at the left end, and in particular,

• are counted only by a block that is partially within the desired region.

• Then the true count could be anything from 0 up to the full count of that block.

• Anything we guess could be wildly wrong, and we'll never know.

• We're therefore going to discuss a similar algorithm that preserves the good and

• avoids the problem with uneven distribution of 1's.

• We'll still divide the window into blocks, but instead of letting each block cover

• a fixed segment of the string, we'll let each block cover a fixed number of 1's.

• The sizes of the blocks will still be limited to the powers of 2.

• That is 1, 2, 4, 8, and so on, but the notion of the size of a block changes.

• Now the size of a block will be the number of 1's.

• So we'll have blocks of size 1 to represent segments in

• the stream that have a single 1.

• Blocks with twice that size will represent two 1's and number of 0's.

• And then there will be blocks representing four 1's and any number of 0's and so on.

• The advantage of this scheme is that there are few 1's in the resent stream,

• the block size covering that region will stay small.

• They will cover large parts of the stream, while their size, or

• number of 1's remains limited.

• I decided to call the algorithm I'm going to describe the DGIM algorithm.

• The initials refers to the four guys who invented this algorithm Mayur Datar,

• Aristides Gionis, Piotr Indyk, and Rajeev Motwani.

• And in fact, this is a good time to stop and

• remember Rajeev Motwani who died shortly after this algorithm was published.

• He along with Gionis and Indyk is also responsible for

• locality sensitive hashing which forms a major part of this course.

• Like our earlier attempt and an algorithm,

• a DGIM stores on the order of (log2N)bits, to represent 1N bit window.

• There's an absolute guarantee of no more than 50% error in the answer to any query.

• And if 50% is too much, you can reduce the error to anything greater than 0.

• The algorithm becomes more complicated on the number of bits you need to

• store grow although the number of bits remains proportionate to (log2N).

• It's just the constant factor that grows in inverse proportion to

• the desired error bound.

• Okay, to begin the story we need to introduce the idea of a timestamp.

• Every bit that arrives in the stream gets a timestamp.

• You might think that we need an arbitrary number of bits to

• represent the time stamp since there's no limit on how long the stream can be.

• But it's really only necessary to represent timestamps modulo N,

• the window size that is we can divide the timestamp by N and take the remainder.

• The net effect is the timestamp start out at 0, 1, and so

• on up to N minus 1 and then go to 0 again, 1, 2 and so on.

• Regardless of where the window is in the stream,

• its N bits will all have different timestamps.

• We're going to partition the window of length N into buckets.

• Each bucket is represented by a record, and

• records can be stored in on the order of log N bits.

• As we shall see, we only need on the order of log N buckets to represent the window,

• so on the order of log squared N bits suffices.

• The record contents are the following.

• The timestamp of its end.

• The most recently arrived bit.

• As I mentioned we'll record timestamps modulo N.

• So we need log N bits to represent the timestamp.

• The number of 1's between beginning and the end of the segment.

• We call this count of 1's the size of the bucket.

• However the number of 1's in this segment represented by a bucket must be

• a power of 2.

• That explains why we only need log log N bits to represent the count of 1's.

• We can store the logarithm of the count instead of the count itself since,

• we know that log base to the count must be an integer.

• The count itself can't be higher then N so

• it's logarithm can't be higher than log base 2 of N.

• Since the logarithm is an integer r i, and

• we only need log i bits to represent the i in binary, log log N bit suffices.

• It really doesn't matter much, since we still need order log N

• bits in the record for the bucket, just to store the timestamp of its end.

• The partition into buckets must obey the following rules.

• There must be one or

• two buckets of each allowed sides up to the maximum size we need.

• Remember that allowed size is of the power-of-2.

• No bit of the window is part of two buckets.

• Some 0's in the stream may not belong to any bucket.

• It, it, it doesn't matter.

• But buckets