Subtitles section Play video Print subtitles We're now going to take up a particular problem that has a very non-trivial solution. We assume the stream elements are bits, zeroes or one, and we want to know how many bits in the last N are one. If we can store the most recent N bits, we can use the solution like the one discussed for averages. In fact the algorithm would be even simpler. However, we're going to address the situation where N bits don't fit in main memory. Perhaps N is a trillion or N is a reasonable number but there, there are so many streams that we can't store complete windows for all. If we want exact answers then we can show that it is impossible to do anything better than to store the entire window. However, what is interesting is that we can store on the order of the square of log in bits, where N is the window size, and still answer queries about the counted ones with answers that are off by at most a small factor, as small as we like if we do enough work. The problem we're going to discuss is this. We're given a stream of 0's and 1's. At any time we have to be prepared to answer a query of the form. How many of the last k bits were one? Here k can be any integer from one up to some large upper limit N. We can surely do this if we use a windows size N and store the last N bits. When a new bit arrives, we throw away the oldest bit in the windows since it can never again be useful to answer one of these queries. But one disadvantage of this approach is that answering one query requires that we examine k bits, since k can be quite large, and both inputs and queries may be arriving very rapidly, that may be time we cannot afford. Another potential problem is that we may not be able to afford the space. As we just mentioned we could be trying to handle a large number of streams or N could be so large that even one window does not fit in main memory. Both these concerns suggests that we should consider a method that uses less than N space. And that also allows us to answer queries about the last k bits much faster than on the order of k. It turns out that we can't get an exact answer to queries without using N bits in the window. But we can get close using much less space than all of N, and also much less time than all of k. We're going to introduce the right algorithm with the discussion of something that seems like it should work but doesn't quite. Our goal was the be off by no more than a factor of 2 in estimating the number of ones in the last k bits. So we will summarize blocks of the stream as blocks will have exponentially increasing lengths, that is 1, 2, 4, 8, 16, so on. And the summary of a block will be simply the count of the number of ones in that block. When we want to know the count of ones for last k bits, we can find some blocks that lie wholly within the last k bits and we add up their counts. It is only the last block, the one furthest back in time that gives us pause. We don't know how many of its ones within the last k bits so we have to guess. But if we've created these exponentially growing blocks for all time units then there would be as many blocks of length one as there are bits in the window, as well as blocks or size two, four, eight, and so on. So that saves us nothing. Instead, we have to drop blocks if their left end, that is, the end that is earliest in time, coincides with the left end of the larger block. And we also drop a small block if there's a larger block completely to their right, that is, later in the stream. As a result, you never have more than two blocks of any one size. So, here is an example of the blocks we might retain at some time. The five rows of blocks are of lengths 1, 2, 4, 8 and 16. Okay, there are two blocks of length 1. The more recent has a count of 0 because it consists of a single 0. That's this. While the other has a count of 1 because it consists of a single 1. 'Kay? Here's a block of length 2. That has a count of 1 because it represents 0 1. That is, these two bits. Notice that we've previously deleted the block of length 1 that would go here, because it begins at the same point as the block of length 2 above it. Also all other blocks of length 1 are deleted because they have a block of length 2 completely to their right. We also show a second block of length 2. Its count is 2 because it represents, this 1 1. There are two blocks of length 4 and they have counts of 2 and 3. They represent, well, this guy represents this sequence, 0 0 1 1, so it has two 1's. This represents 1 0 1 1 and therefore gets a count of 3. 'Kay. We see one block of length 8. Its count is 4. Well let's see, because it represents these eight bits. And notice that, that the count for second block of length 8 is not needed because we can figure out it has six ones. Since that's tad, that 6 is the difference between the number of ones in this block of length 16 and that block of length 8. Or that is 10 minus 4 equals 6. So if this block existed, it would have, surely have six once. Okay. Now, suppose we get a query for how many ones there are in the most recent 28 bits. We can add up the counts of certain blocks. Some little blocks at the right end, and then some bigger blocks going to the left. We want to pick blocks so that each of the most recent 28 bits is covered by exactly one of the blocks we choose. So, we pick this block of length 1. This block of length 2. This of length 4. We don't want this block of length 8 because we have this block of length 16 and that's still all within the last 28. so, so far we have covered 23 bits and we know that among them the number of 1's is 0 plus 1 plus 2 plus 10, which is 13 Okay but what do we do about the oldest five bits? We that is, there are these bits now, if we could see the bits we would know that they're 0 0 1 0 1 therefor they have two 1's, but we don't see them. All we see is that they are part of this block of 16 and we know that block has a count of six, okay, but we can't tell how many of those six are in the most recent five positions. Again, we don't ever get to see this anymore. Now if we could see them of course we would know there were two and that the right answer is 15. But we need to estimate, without seeing how many ones there are in this region. Okay if we guess that half the count of the block that is 3, 6 divided by 2 in this case is is in the region we don't see then we would guess 16 and that's only off by 7% so that's not even bad. We could even try a proportional guess that is say, we know that there is 6 with, in 6 divided by 16, well, 6 divided by 16 is the probability that any given bit is 1 in the range represented by this, by this block of 16, and we know that we have to count five of them, so that's 30 divided by 16, which is roughly 2, and so if we guess 2, and added that, we would get 15. And that happens to be right on the mark even though we didn't get to see the, the, the five bits that we wanted to count those. This strategy has a lot to recommend it. Okay, first it stores only the square of log N bits. I might comment that we use log, we use this expression (log2N) to mean the square of log N. Okay this is a, a, a common expression. You don't want to write it as log N squared because that's really 2 2 log N, which is not what we mean. So I, I should, if you've never seen this notation before again the putting the square above the log means that you're actually squaring the whole thing. The, the squaring log N. okay, now. As I said, okay square, storing square of log N bits is not that bad, okay? It's much less than N for, for for large N. So if N is a billion, then log squared N is about 900. Now why do we need only on the order of log squared N bits? Well first of all, if the window size is N bits, we never need any blocks of length greater than N. An account up to n can be stored in log based 2 of N bits. Now how many counts do we need? Well, there are only log based 2 N box sizes from 1 1 to 4, 8, 16 and so on. Up to the largest power of 2 that are long, are no larger than N. So we never store more than two blocks of any size. And as a result, we need to store at most, 2 log N counts, of at most log in bits each, and that's 2 log squared N. Another good thing is that after each bit we do a limited amount of work. We have to create a new block of length 1 for each of the lengths 1, 2, 4, 8, and so on. We may have to drop a block of that length or we may have to combine two blocks of one length into two blocks of the next larger length. But that means that most order log N were total since there were log N sizes. And the error is frequently small. It can't be bigger than the count of the biggest block. The one that is only partially in the region we're counting. There's a problem with the scheme, however. When the 1's are distributed evenly among all the regions of the stream, the number of 1's in the ambiguous region can't be more than half the total number of 1's in the region we want to count. So our error is limited to 50%, but look what happens if all the ones in the region we want to count are at the left end, and in particular, are counted only by a block that is partially within the desired region. Then the true count could be anything from 0 up to the full count of that block. Anything we guess could be wildly wrong, and we'll never know. We're therefore going to discuss a similar algorithm that preserves the good and avoids the problem with uneven distribution of 1's. We'll still divide the window into blocks, but instead of letting each block cover a fixed segment of the string, we'll let each block cover a fixed number of 1's. The sizes of the blocks will still be limited to the powers of 2. That is 1, 2, 4, 8, and so on, but the notion of the size of a block changes. Now the size of a block will be the number of 1's. So we'll have blocks of size 1 to represent segments in the stream that have a single 1. Blocks with twice that size will represent two 1's and number of 0's. And then there will be blocks representing four 1's and any number of 0's and so on. The advantage of this scheme is that there are few 1's in the resent stream, the block size covering that region will stay small. They will cover large parts of the stream, while their size, or number of 1's remains limited. I decided to call the algorithm I'm going to describe the DGIM algorithm. The initials refers to the four guys who invented this algorithm Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. And in fact, this is a good time to stop and remember Rajeev Motwani who died shortly after this algorithm was published. He along with Gionis and Indyk is also responsible for locality sensitive hashing which forms a major part of this course. Like our earlier attempt and an algorithm, a DGIM stores on the order of (log2N)bits, to represent 1N bit window. There's an absolute guarantee of no more than 50% error in the answer to any query. And if 50% is too much, you can reduce the error to anything greater than 0. The algorithm becomes more complicated on the number of bits you need to store grow although the number of bits remains proportionate to (log2N). It's just the constant factor that grows in inverse proportion to the desired error bound. Okay, to begin the story we need to introduce the idea of a timestamp. Every bit that arrives in the stream gets a timestamp. You might think that we need an arbitrary number of bits to represent the time stamp since there's no limit on how long the stream can be. But it's really only necessary to represent timestamps modulo N, the window size that is we can divide the timestamp by N and take the remainder. The net effect is the timestamp start out at 0, 1, and so on up to N minus 1 and then go to 0 again, 1, 2 and so on. Regardless of where the window is in the stream, its N bits will all have different timestamps. We're going to partition the window of length N into buckets. Each bucket is represented by a record, and records can be stored in on the order of log N bits. As we shall see, we only need on the order of log N buckets to represent the window, so on the order of log squared N bits suffices. The record contents are the following. The timestamp of its end. The most recently arrived bit. As I mentioned we'll record timestamps modulo N. So we need log N bits to represent the timestamp. The number of 1's between beginning and the end of the segment. We call this count of 1's the size of the bucket. However the number of 1's in this segment represented by a bucket must be a power of 2. That explains why we only need log log N bits to represent the count of 1's. We can store the logarithm of the count instead of the count itself since, we know that log base to the count must be an integer. The count itself can't be higher then N so it's logarithm can't be higher than log base 2 of N. Since the logarithm is an integer r i, and we only need log i bits to represent the i in binary, log log N bit suffices. It really doesn't matter much, since we still need order log N bits in the record for the bucket, just to store the