Placeholder Image

Subtitles section Play video

  • We're now going to

  • take up a particular problem that has a very non-trivial solution.

  • We assume the stream elements are bits, zeroes or

  • one, and we want to know how many bits in the last N are one.

  • If we can store the most recent N bits,

  • we can use the solution like the one discussed for averages.

  • In fact the algorithm would be even simpler.

  • However, we're going to address the situation where N bits don't fit

  • in main memory.

  • Perhaps N is a trillion or N is a reasonable number but

  • there, there are so many streams that we can't store complete windows for all.

  • If we want exact answers then we can show that it is impossible to do

  • anything better than to store the entire window.

  • However, what is interesting is that we can store on the order of

  • the square of log in bits, where N is the window size, and

  • still answer queries about the counted ones with answers that are off by at

  • most a small factor, as small as we like if we do enough work.

  • The problem we're going to discuss is this.

  • We're given a stream of 0's and 1's.

  • At any time we have to be prepared to answer a query of the form.

  • How many of the last k bits were one?

  • Here k can be any integer from one up to some large upper limit N.

  • We can surely do this if we use a windows size N and store the last N bits.

  • When a new bit arrives, we throw away the oldest bit in the windows since it

  • can never again be useful to answer one of these queries.

  • But one disadvantage of this approach is that answering one query requires that we

  • examine k bits, since k can be quite large, and both inputs and

  • queries may be arriving very rapidly, that may be time we cannot afford.

  • Another potential problem is that we may not be able to afford the space.

  • As we just mentioned we could be trying to handle a large number of streams or

  • N could be so large that even one window does not fit in main memory.

  • Both these concerns suggests that we should consider a method that uses less

  • than N space.

  • And that also allows us to answer queries about the last k

  • bits much faster than on the order of k.

  • It turns out that we can't get an exact answer to queries without using N

  • bits in the window.

  • But we can get close using much less space than all of N,

  • and also much less time than all of k.

  • We're going to introduce the right algorithm with the discussion of

  • something that seems like it should work but doesn't quite.

  • Our goal was the be off by no more than a factor of 2 in

  • estimating the number of ones in the last k bits.

  • So we will summarize blocks of the stream as blocks will have exponentially

  • increasing lengths, that is 1, 2, 4, 8, 16, so on.

  • And the summary of a block will be simply the count of

  • the number of ones in that block.

  • When we want to know the count of ones for last k bits, we can find

  • some blocks that lie wholly within the last k bits and we add up their counts.

  • It is only the last block, the one furthest back in time that gives us pause.

  • We don't know how many of its ones within the last k bits so we have to guess.

  • But if we've created these exponentially growing blocks for

  • all time units then there would be as many blocks of length one as there are bits in

  • the window, as well as blocks or size two, four, eight, and so on.

  • So that saves us nothing.

  • Instead, we have to drop blocks if their left end, that is, the end that is

  • earliest in time, coincides with the left end of the larger block.

  • And we also drop a small block if there's a larger block completely to their right,

  • that is, later in the stream.

  • As a result, you never have more than two blocks of any one size.

  • So, here is an example of the blocks we might retain at some time.

  • The five rows of blocks are of lengths 1, 2, 4, 8 and 16.

  • Okay, there are two blocks of length 1.

  • The more recent has a count of 0 because it consists of a single 0.

  • That's this.

  • While the other has a count of 1 because it consists of a single 1.

  • 'Kay?

  • Here's a block of length 2.

  • That has a count of 1 because it represents 0 1.

  • That is, these two bits.

  • Notice that we've previously deleted the block of length 1 that would go here,

  • because it begins at the same point as the block of length 2 above it.

  • Also all other blocks of length 1 are deleted because they have

  • a block of length 2 completely to their right.

  • We also show a second block of length 2.

  • Its count is 2 because it represents, this 1 1.

  • There are two blocks of length 4 and they have counts of 2 and 3.

  • They represent, well, this guy represents this sequence, 0 0 1 1, so it has two 1's.

  • This represents 1 0 1 1 and

  • therefore gets a count of 3.

  • 'Kay.

  • We see one block of length 8.

  • Its count is 4.

  • Well let's see, because it represents these eight bits.

  • And notice that, that the count for second block of length 8 is not

  • needed because we can figure out it has six ones.

  • Since that's tad, that 6 is the difference between the number of

  • ones in this block of length 16 and that block of length 8.

  • Or that is 10 minus 4 equals 6.

  • So if this block existed, it would have, surely have six once.

  • Okay.

  • Now, suppose we get a query for how many ones there are in the most recent 28 bits.

  • We can add up the counts of certain blocks.

  • Some little blocks at the right end, and then some bigger blocks going to the left.

  • We want to pick blocks so that each of the most recent 28 bits is covered by

  • exactly one of the blocks we choose.

  • So, we pick this block of length 1.

  • This block of length 2.

  • This of length 4.

  • We don't want this block of length 8 because we have this block of length

  • 16 and that's still all within the last 28.

  • so, so far we have covered 23 bits and

  • we know that among them the number of

  • 1's is 0 plus 1 plus 2 plus 10,

  • which is 13 Okay but what do we do about the oldest five bits?

  • We that is, there are these bits now, if we could see the bits we would know

  • that they're 0 0 1 0 1 therefor they have two 1's, but we don't see them.

  • All we see is that they are part of this block of 16 and

  • we know that block has a count of six,

  • okay, but we can't tell how many of those six are in the most recent five positions.

  • Again, we don't ever get to see this anymore.

  • Now if we could see them of course we would know there were two and

  • that the right answer is 15.

  • But we need to estimate, without seeing how many ones there are in this region.

  • Okay if we guess that half the count of the block that is 3,

  • 6 divided by 2 in this case is is in the region

  • we don't see then we would guess 16 and

  • that's only off by 7% so that's not even bad.

  • We could even try a proportional guess that is say,

  • we know that there is 6 with, in 6 divided by 16, well,

  • 6 divided by 16 is the probability that any given bit

  • is 1 in the range represented by this, by this block of 16,

  • and we know that we have to count five of them, so

  • that's 30 divided by 16, which is roughly 2,

  • and so if we guess 2, and added that, we would get 15.

  • And that happens to be right on the mark even though we didn't get to see the,

  • the, the five bits that we wanted to count those.

  • This strategy has a lot to recommend it.

  • Okay, first it stores only the square of log N bits.

  • I might comment that we use log, we use this expression (log2N)

  • to mean the square of log N.

  • Okay this is a, a, a common expression.

  • You don't want to write it as log N squared because that's really 2 2 log N,

  • which is not what we mean.

  • So I, I should, if you've never seen this notation before again the putting

  • the square above the log means that you're actually squaring the whole thing.

  • The, the squaring log N.

  • okay, now.

  • As I said, okay square, storing square of log N bits is not that bad, okay?

  • It's much less than N for, for for large N.

  • So if N is a billion, then log squared N is about 900.

  • Now why do we need only on the order of log squared N bits?

  • Well first of all,

  • if the window size is N bits, we never need any blocks of length greater than N.

  • An account up to n can be stored in log based 2 of N bits.

  • Now how many counts do we need?

  • Well, there are only log based 2 N box sizes from 1 1 to 4, 8, 16 and so on.

  • Up to the largest power of 2 that are long, are no larger than N.

  • So we never store more than two blocks of any size.

  • And as a result, we need to store at most,

  • 2 log N counts, of at most log in bits each, and that's 2 log squared N.

  • Another good thing is that after each bit we do a limited amount of work.

  • We have to create a new block of length 1 for

  • each of the lengths 1, 2, 4, 8, and so on.

  • We may have to drop a block of that length or we may have to combine two blocks of

  • one length into two blocks of the next larger length.

  • But that means that most order log N were total since there were log N sizes.

  • And the error is frequently small.

  • It can't be bigger than the count of the biggest block.

  • The one that is only partially in the region we're counting.

  • There's a problem with the scheme, however.

  • When the 1's are distributed evenly among all the regions of the stream,

  • the number of 1's in the ambiguous region can't be

  • more than half the total number of 1's in the region we want to count.

  • So our error is limited to 50%, but look what happens if all the ones in

  • the region we want to count are at the left end, and in particular,

  • are counted only by a block that is partially within the desired region.

  • Then the true count could be anything from 0 up to the full count of that block.

  • Anything we guess could be wildly wrong, and we'll never know.

  • We're therefore going to discuss a similar algorithm that preserves the good and

  • avoids the problem with uneven distribution of 1's.

  • We'll still divide the window into blocks, but instead of letting each block cover

  • a fixed segment of the string, we'll let each block cover a fixed number of 1's.

  • The sizes of the blocks will still be limited to the powers of 2.

  • That is 1, 2, 4, 8, and so on, but the notion of the size of a block changes.

  • Now the size of a block will be the number of 1's.

  • So we'll have blocks of size 1 to represent segments in

  • the stream that have a single 1.

  • Blocks with twice that size will represent two 1's and number of 0's.

  • And then there will be blocks representing four 1's and any number of 0's and so on.

  • The advantage of this scheme is that there are few 1's in the resent stream,

  • the block size covering that region will stay small.

  • They will cover large parts of the stream, while their size, or

  • number of 1's remains limited.

  • I decided to call the algorithm I'm going to describe the DGIM algorithm.

  • The initials refers to the four guys who invented this algorithm Mayur Datar,

  • Aristides Gionis, Piotr Indyk, and Rajeev Motwani.

  • And in fact, this is a good time to stop and

  • remember Rajeev Motwani who died shortly after this algorithm was published.

  • He along with Gionis and Indyk is also responsible for

  • locality sensitive hashing which forms a major part of this course.

  • Like our earlier attempt and an algorithm,

  • a DGIM stores on the order of (log2N)bits, to represent 1N bit window.

  • There's an absolute guarantee of no more than 50% error in the answer to any query.

  • And if 50% is too much, you can reduce the error to anything greater than 0.

  • The algorithm becomes more complicated on the number of bits you need to

  • store grow although the number of bits remains proportionate to (log2N).

  • It's just the constant factor that grows in inverse proportion to

  • the desired error bound.

  • Okay, to begin the story we need to introduce the idea of a timestamp.

  • Every bit that arrives in the stream gets a timestamp.

  • You might think that we need an arbitrary number of bits to

  • represent the time stamp since there's no limit on how long the stream can be.

  • But it's really only necessary to represent timestamps modulo N,

  • the window size that is we can divide the timestamp by N and take the remainder.

  • The net effect is the timestamp start out at 0, 1, and so

  • on up to N minus 1 and then go to 0 again, 1, 2 and so on.

  • Regardless of where the window is in the stream,

  • its N bits will all have different timestamps.

  • We're going to partition the window of length N into buckets.

  • Each bucket is represented by a record, and

  • records can be stored in on the order of log N bits.

  • As we shall see, we only need on the order of log N buckets to represent the window,

  • so on the order of log squared N bits suffices.

  • The record contents are the following.

  • The timestamp of its end.

  • The most recently arrived bit.

  • As I mentioned we'll record timestamps modulo N.

  • So we need log N bits to represent the timestamp.

  • The number of 1's between beginning and the end of the segment.

  • We call this count of 1's the size of the bucket.

  • However the number of 1's in this segment represented by a bucket must be

  • a power of 2.

  • That explains why we only need log log N bits to represent the count of 1's.

  • We can store the logarithm of the count instead of the count itself since,

  • we know that log base to the count must be an integer.

  • The count itself can't be higher then N so

  • it's logarithm can't be higher than log base 2 of N.

  • Since the logarithm is an integer r i, and

  • we only need log i bits to represent the i in binary, log log N bit suffices.

  • It really doesn't matter much, since we still need order log N

  • bits in the record for the bucket, just to store the