Subtitles section Play video Print subtitles So, the topic today is dynamic programming. The term programming in the name of this term doesn't refer to computer programming. OK, programming is an old word that means any tabular method for accomplishing something. So, you'll hear about linear programming and dynamic programming. Either of those, even though we now incorporate those algorithms in computer programs, originally computer programming, you were given a datasheet and you put one line per line of code as a tabular method for giving the machine instructions as to what to do. OK, so the term programming is older. Of course, and now conventionally when you see programming, you mean software, computer programming. But that wasn't always the case. And these terms continue in the literature. So, dynamic programming is a design technique like other design techniques we've seen such as divided and conquer. OK, so it's a way of solving a class of problems rather than a particular algorithm or something. So, we're going to work through this for the example of so-called longest common subsequence problem, sometimes called LCS, OK, which is a problem that comes up in a variety of contexts. And it's particularly important in computational biology, where you have long DNA strains, and you're trying to find commonalities between two strings, OK, one which may be a genome, and one may be various, when people do, what is that thing called when they do the evolutionary comparisons? The evolutionary trees, yeah, right, yeah, exactly, phylogenetic trees, there you go, OK, phylogenetic trees. Good, so here's the problem. So, you're given two sequences, x going from one to m, and y running from one to n. You want to find a longest sequence common to both. OK, and here I say a, not the, although it's common to talk about the longest common subsequence. Usually the longest comment subsequence isn't unique. There could be several different subsequences that tie for that. However, people tend to, it's one of the sloppinesses that people will say. I will try to say a, unless it's unique. But I may slip as well because it's just such a common thing to just talk about the, even though there might be multiple. So, here's an example. Suppose x is this sequence, and y is this sequence. So, what is a longest common subsequence of those two sequences? See if you can just eyeball it. AB: length two? Anybody have one longer? Excuse me? BDB, BDB. BDAB, BDAB, BDAB, anything longer? So, BDAB: that's the longest one. Is there another one that's the same length? Is there another one that ties? BCAB, BCAB, another one? BCBA, yeah, there are a bunch of them all of length four. There isn't one of length five. OK, we are actually going to come up with an algorithm that, if it's correct, we're going to show it's correct, guarantees that there isn't one of length five. So all those are, we can say, any one of these is the longest comment subsequence of x and y. We tend to use it this way using functional notation, but it's not a function that's really a relation. So, we'll say something is an LCS when really we only mean it's an element, if you will, of the set of longest common subsequences. Once again, it's classic abusive notation. As long as we know what we mean, it's OK to abuse notation. What we can't do is misuse it. But abuse, yeah! Make it so it's easy to deal with. But you have to know what's going on underneath. OK, so let's see, so there's a fairly simple brute force algorithm for solving this problem. And that is, let's just check every, maybe some of you did this in your heads, subsequence of x from one to m to see if it's also a subsequence of y of one to n. So, just take every subsequence that you can get here, check it to see if it's in there. So let's analyze that. So, to check, so if I give you a subsequence of x, how long does it take you to check whether it is, in fact, a subsequence of y? So, I give you something like BCAB. How long does it take me to check to see if it's a subsequence of y? Length of y, which is order n. And how do you do it? Yeah, you just scan. So as you hit the first character that matches, great. Now, if you will, recursively see whether the suffix of your string matches the suffix of x. OK, and so, you are just simply walking down the tree to see if it matches. You're walking down the string to see if it matches. OK, then the second thing is, then how many subsequences of x are there? Two to the n? x just goes from one to m, two to the m subsequences of x, OK, two to the m. Two to the m subsequences of x, OK, one way to see that, you say, well, how many subsequences are there of something there? If I consider a bit vector of length m, OK, that's one or zero, just every position where there's a one, I take out, that identifies an element that I'm going to take out. OK, then that gives me a mapping from each subsequence of x, from each bit vector to a different subsequence of x. Now, of course, you could have matching characters there, that in the worst case, all of the characters are different. OK, and so every one of those will be a unique subsequence. So, each bit vector of length m corresponds to a subsequence. That's a generally good trick to know. So, the worst-case running time of this method is order n times two to the m, which is, since m is in the exponent, is exponential time. And there's a technical term that we use when something is exponential time. Slow: good. OK, very good. OK, slow, OK, so this is really bad. This is taking a long time to crank out how long the longest common subsequence is because there's so many subsequences. OK, so we're going to now go through a process of developing a far more efficient algorithm for this problem. OK, and we're actually going to go through several stages. The first one is to go through simplification stage. OK, and what we're going to do is look at simply the length of the longest common sequence of x and y. And then what we'll do is extend the algorithm to find the longest common subsequence itself. OK, so we're going to look at the length. So, simplify the problem, if you will, to just try to compute the length. What's nice is the length is unique. OK, there's only going to be one length that's going to be the longest. OK, and what we'll do is just focus on the problem of computing the length. And then we'll do is we can back up from that and figure out what actually is the subsequence that realizes that length. OK, and that will be a big simplification because we don't have to keep track of a lot of different possibilities at every stage. We just have to keep track of the one number, which is the length. So, it's sort of reduces it to a numerical problem. We'll adopt the following notation. It's pretty standard notation, but I just want, if I put absolute values around the string or a sequence, it denotes the length of the sequence, S. OK, so that's the first thing. The second thing we're going to do is, actually, we're going to, which takes a lot more insight when you come up with a problem like this, and in some sense, ends up being the hardest part of designing a good dynamic programming algorithm from any problem, which is we're going to actually look not at all subsequences of x and y, but just prefixes. OK, we're just going to look at prefixes and we're going to show how we can express the length of the longest common subsequence of prefixes in terms of each other. In particular, we're going to define c of ij to be the length, the longest common subsequence of the prefix of x going from one to i, and y of going to one to j. And what we are going to do is we're going to calculate c[i,j] for all ij. And if we do that, how then do we solve the problem of the longest common of sequence of x and y? How do we solve the longest common subsequence? Suppose we've solved this for all I and j. How then do we compute the length of the longest common subsequence of x and y? Yeah, c[m,n], that's all, OK? So then, c of m, n is just equal to the longest common subsequence of x and y, because if I go from one to n, I'm done, OK? And so, it's going to turn out