Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • DAVID J. MALAN: All right, this is CS50, and this is week 4.

  • So let's consider first what we did last time, which was focus on numbers

  • and focus on how we can search them and how we can sort them.

  • And in particular, we took a look at a number of algorithms

  • that we finally gave names to, a couple of which we actually saw back in week

  • 0, but didn't treat as formally.

  • So we looked more formally last time at linear search,

  • this process of searching elements from left to right or right to left

  • as though you're walking along a line looking for some value.

  • We looked at binary search, a divide and conquer algorithm

  • whereby we started, in this case, in the middle,

  • and then we looked left to right and then went to the left,

  • went to the right, and then did the same problem again and again.

  • We looked also at bubble sort.

  • How do you get to the point where you can

  • use an algorithm like binary search, which

  • assumes that your inputs are sorted?

  • So how do you get to that sorted state?

  • Well, bubble sort was one algorithm that essentially

  • swapped elements pairwise from left to right, from left to right,

  • until finally the biggest elements bubbled up to the top,

  • and the whole list was ultimately sorted.

  • But we also looked at selection sort.

  • And selection sort is the name implied, selected,

  • then smallest element again and again and again, plucking it out of the list,

  • and putting it where it belongs.

  • Insertion sort, meanwhile, walked through the list really just once,

  • and as it encountered each element, it then

  • inserted it into the sorted half of the list, the left half of the list

  • in its proper place.

  • And then lastly, and where we really concluded and really got more bang

  • for our buck was this final algorithm merge

  • sort, which actually solved this same problem no less correctly, but so much

  • more efficiently.

  • And indeed, lends itself perhaps to a fundamentally better design

  • in so far as it leverages a much better running time, so to speak.

  • In fact, we started to ascribe these kinds of formulas

  • to the running times of or the performance of our algorithms n

  • squared, otherwise known as quadratic, whereby

  • you take n times n number of steps or minutes or seconds

  • or some unit of measure, n log in, which is actually what we achieved

  • with the merge source, which was much faster

  • recall than the end squared algorithms like bubble sort and selection sort

  • and insertion sort in their upper bounds.

  • n would be something linear, like a straight line relationship.

  • Logarithmic of n, log of n, log base 2 of n,

  • technically, would be something like week 0's divide

  • and conquer approach looking for Mike Smith,

  • as well as last time's binary search.

  • And then one, which doesn't have to be literally one step.

  • Maybe it's two, maybe it's 10, but it is a fixed finite constant number

  • of steps.

  • And that too might be the running time of some algorithm.

  • Now how do we describe the running times of algorithms?

  • Well, we use some special notation, asymptotic notation,

  • so to speak, which while it might look cryptic at first glance,

  • really is just a handy way of succinctly expressing the fact that you know what

  • the upper bound on some algorithms running time is,

  • what the lower bounds on some algorithms running time is.

  • And if those are one in the same formulas in theta,

  • do you have a coincidence of the two of big O, so speak, and capital theta.

  • So that, while Greek, literally is just a way

  • of expressing a bit more succinctly what these running times are.

  • And we'll continue to revisit this issue as we

  • look at more algorithms and soon data structures still.

  • But this time, we apologize.

  • We pull back a layer here and admit that there is no such thing as a string.

  • Indeed, all this time we've been saying that there's ins, and there's floats,

  • and there's chars, and there's doubles and more.

  • And we've also been saying there are strings,

  • but there really aren't strings.

  • This is sort of a figment of the imagination of our so-called CS50

  • library.

  • But it's a pedagogical simplification that we've

  • been using for the past several weeks, so as to not get

  • lost in the weeds, the lower level implementation

  • details of what a string is, so that we can just get

  • real work done in these first weeks.

  • But now we'll begin to look underneath that hood

  • and see what a string actually is and what the implications are.

  • And it turns out, while more complicated in some sense,

  • it really just boils down to some first principles,

  • what it is the computer is doing underneath that hood.

  • So let's take a look at string first by way of a couple of examples.

  • Let's go in to CS50 IDE, create a new file, save it as Compare zero dot C,

  • and look at the little program that actually doesn't necessarily

  • do what we think it's going to do.

  • In particular, let me go ahead and include our typical,

  • include CS 50 dot H. And let me go ahead and include standard IO dot H as well.

  • Let me go ahead and use main void, so I'm not

  • going to worry about any command line arguments for now.

  • And then I'm going to go ahead and just prompt the user.

  • Hey user, give me a string called S for instance.

  • And then I'm going to have no newline.

  • I just want that all in one same line.

  • Now I'm going to go ahead and do string S,

  • gets gets string, open paren and close parens,

  • so as to get a string from the user.

  • And then, let me do this.

  • Let me also print T colon and ask the user, essentially, for a string

  • that I'll call T, since T comes after S quite simply.

  • And now let me just compare these strings as the filename suggests.

  • So I know how to compare, not with if s equals t,

  • because that's the assignment operator.

  • But we know that s equals equals t should compare the values on the left

  • and on the right.

  • So let's try this.

  • So if s equals equals t, then I'm going to go ahead and print out same.

  • Elts, they are presumably difference, so I'm

  • going to go ahead and print out different with a newline character

  • and then save it.

  • So a pretty simple program, so let me go down into my terminal window and run

  • make compare 0.

  • Now let me go ahead and run dot slash compare 0. s,

  • I'm going to go ahead and type in Zamyla.

  • And I'm going to go ahead now and type in Maria.

  • OK, and they're different.

  • I expected as much.

  • Now let's go ahead and run this again.

  • Again with Zamyla.

  • And let's just say Zamyla again, different.

  • What did I do wrong?

  • Let me try this again.

  • Maybe it's the capitalization.

  • So Zamyla in all lower case, different.

  • Well, maybe it's just Zamyla's name.

  • Let me try Rob or RLB?

  • How about RLB?

  • Those are different, as is Rob and Rob.

  • So what is going on?

  • Those strings pretty much look the same to me.

  • I'm typing the same incantation of strings,

  • so what is it that's going on here?

  • You know what, let me do a test with something else.

  • Let me go ahead and create a new file here.

  • I'm going to call this copyzero.c.

  • Maybe I'm just misunderstanding how comparison works.

  • But surely I should be able to copy a string and make an identical copy,

  • so let's do that now.

  • Let me go ahead and create a file called copyzero.c.

  • And let me do an include cs50.h.

  • Include standard io.h into main void, so just as before.

  • And now, let me go ahead and just prompt the user like before.

  • Give me a string s.

  • And I'll put that in a variable, s, calling get string as before.

  • And now, I remember from prior classes that I'm supposed to do,

  • if s equals equals null, maybe I need to do some error checking here.

  • And if that's the case, I'm going to arbitrarily return 1.

  • So recall that main can return values.

  • 0 means all is well.

  • 1 or any other non-zero value means something is wrong.

  • So maybe that's what I was doing wrong earlier, just not error checking.

  • So let's try that.

  • Now, let me declare a variable of type string called t and assign it to s.

  • So in other words, I want to copy s into t.

  • And I know that this happens from the right to the left.

  • So that's as I think it should be.

  • And then, if strleng of t is greater than 0--

  • you know what, I'm going to do one other thing.

  • Not just copy it.

  • I'm going to go ahead and do this.

  • Recall, that you can treat strings as arrays.

  • So I'm going to say the zeroth character, the first character of t--

  • you know what?

  • I want to uppercase it.

  • I want to really make sure that these two strings are indeed

  • different as I intend.

  • And let's go ahead now and do this.

  • Let me go ahead now and do printf of s.

  • And let me plug-in s's value in a new line.

  • Plug-in s.

  • And then printf t:%s for a placeholder again.

  • Comma t semi-colon.

  • So in other words, if the length of t is greater than zero,

  • let's capitalize it by changing that first character.

  • And then, just print out s and t.

  • And surely, by capitalizing only t, I should see only one capitalized word.

  • Now, I'm using both strleng and two upper.

  • So rather than let Clang have a chance to yell at me, I'm going to go in here

  • and preemptively add ctype.h.

  • Which recall is a library you might have seen

  • or certainly will soon see that has a number of functions in it.

  • Among them, two upper.

  • And then, I also need to include string.h, so that I can use strlen--

  • the function that gives me the length of the string.

  • So Clang would have yelled at me if I forgot that, but let me preemptively

  • solve that problem.

  • And now, do make copies zero.

  • All seems to be well.

  • Let's run dot slash copy 0.

  • All right.

  • Let's go ahead and type in-- how about just my own name in all lowercase.

  • Huh?

  • Now, why is this confusing?

  • So I wrote code that got a string from the user and stored it in s.

  • I then wrote code that declared a second variable, t.

  • And I set t equal to s, thereby making I would

  • think a copy-- as in past weeks of using the assignment operator.

  • Then down here, I made sure t was long enough.

  • That's just a quick sanity check.

  • And then on this line here, I'm just saying,

  • set the first character of t, t bracket 0, equal to the result of upper casing

  • the first character of t.

  • So the only code that's touching t is this one here.

  • And yet, somehow my name gets capitalized both in s and in t.

  • So what is it that's actually going on here?

  • It seems to be broken still.

  • In fact, let me go ahead and open another example, rather than type

  • this one out ourselves.

  • Let me go ahead and open up an example called no swap.

  • As the name suggests, it's a bit of a spoiler.

  • I wrote this program in advance to do the following.

  • First, I've included standard io.h, so I can use printf.

  • I have a prototype of my function called swap up here, because indeed,

  • the goal at hand I decided was I just want to write a simple function that

  • swaps two values.

  • Given a and b, make b a and a, b.

  • And then, return.

  • So I'm just going to arbitrarily test this out by declaring a variable x.

  • Setting it equal to 1.

  • A variable y setting it equal to 2.

  • And then, just as a sanity check, I'm going to print out x is such and such,

  • y is such and such.

  • And then, I'm going to claim swapping dot dot dot.

  • And then, the key line is apparently this one.

  • Call a function called swap, passing in x and y.

  • And if I implemented swap correctly, this should swap the two variables.

  • Thereafter, I'm going to claim swapped x as such and such, y as such and such.

  • So let's run this program and see what else is apparently a lie.

  • Make no swap in my source directory.

  • ./noswap enter.

  • And if we scroll back up in my history, you'll see x is 1, y is 2.

  • Swapping swapped.

  • x is 1, y is 2.

  • All right.

  • So maybe just the swap function is buggy.

  • This isn't necessarily indicative of a misunderstanding.

  • So let's look at the implementation of swap.

  • Swap returns no value and I think that's OK, so long as it takes inputs.

  • Swap takes two inputs, an ints and an ints, called a and b respectively.

  • And then, let's consider how this works.

  • So I've declared a temporary variable, called temp--

  • though I could call it anything I want.

  • And I'm storing in it the value a.

  • So I'm taking a-- and it's the number 1.

  • And I'm just temporarily storing the number 1

  • in this additional temporary variable, so that I now

  • have two copies of the number one-- in a and in temp.

  • Then, I change the value of a to equal b.

  • So at this point in the story, a should have a value of 2.

  • b should have a value of 2.

  • Which could be problematic, except I've saved the original value 1

  • in my temporary variable.

  • And so, indeed I change b to equal that temporary value.

  • So indeed, I can't do this magical switcheroo.

  • Because if I simultaneously took one value and one value

  • and put them in each other's places, I would essentially

  • risk clobbering one or the other.

  • Because the computer typically can only do one thing

  • at a time underneath the hood, even though it's super fast.

  • So I run the risk of somehow getting two copies of that same value,

  • unless I'm careful.

  • I really want to take one value, put it in a temporary space.

  • Then, copy one value over.

  • And then, do that switcheroo.

  • But it doesn't seem to work.

  • And yet, this is a good opportunity for one more tool.

  • Let me do a further check.

  • Let me go ahead and run debug 50 on no swap.

  • And you know what?

  • I'm going to set a break point right at the beginning of this program on line

  • 10 by simply clicking in the so-called gutter of the line number here.

  • A red dot appears, so that now when I run debug 50,

  • notice my right hand bar pops out.

  • Notice that the line 10 in question is highlighted, because execution is

  • paused at this line.

  • Let's take a look at what's going on the right hand side of my debugger.

  • Notice that x has a value of 1 and y has a value of 2.

  • And that's what I intend.

  • Now, let's begin to step over these lines.

  • So let's step over the printf.

  • And we see on my terminal window that x is 1.

  • Step over the next printf and we see that y is 2.

  • Now, let's step over that and we see indeed swapping dot dot dot.

  • But rather than step over my call to the swap function, let's step into it.

  • So I can actually see maybe my logic is buggy.

  • So let's go ahead and step into swap.

  • And notice that debug 50 immediately jumps down to the first line and swap.

  • And notice now on the right hand side, what's going on?

  • a is 1 and b is 2.

  • And that's as expected, because I passed an x and y.

  • Swap calls it's two inputs a and b respectively, so a is 1, b is 2.

  • Temp though, very curiously is 32,767.

  • And where is that value coming?

  • Well, it turns out when you declare a variable,

  • but haven't yet said it's of some value it's going to by definition have

  • an undefined value.

  • A so-called garbage value.

  • Remnants of other aspects of your program

  • having used that chunk of memory before.

  • And we'll see why this is in just a bit.

  • So this is anomalous.

  • It's not a bug in the program.

  • It's to be expected, because I haven't yet assigned

  • temp a value until-- let me zoom out.

  • Until I step over line 21 and actually let temp equal a.

  • And notice, if you watch the right hand side of my screen now,

  • temp is currently 32,767.

  • And when I step over it, OK.

  • It becomes 1.

  • So that's as expected.

  • Now, a is about to become b.

  • And indeed, we now see that a is 2 b is 2 and temp is still 1.

  • So now we have to put temp into b, so let's step over that line.

  • And indeed, we've done a switcheroo of a and b.

  • a is 2, b is 1.

  • OK.

  • Let's go ahead and play this out.

  • Let's continue here.

  • I claim that I've swapped.

  • I'm printing x-- and somehow x and y are still 1 and 2.

  • Even though the debunker told me it would

  • seem that my swap function was correct.

  • So what is happening after all of these weeks of CS50 and all of these problems

  • solved thus far?

  • It would seem that all of our understanding of things

  • is kind of unraveling.

  • And that's because we've been very careful over the past few weeks,

  • at least in class.

  • And in sections to try to avoid tripping over some of these lower-level

  • implementation details.

  • And there's not that many of them.

  • But today, is now a time to peel back this layer

  • and understand exactly what it is that's going on.

  • Indeed, all of this time when you run a program on your computer,

  • double-click an icon or run the program's name with dot slash

  • something or other at your terminal window-- what happens

  • is that your computer gives that program the illusion of a really big chunk

  • of memory all of its own.

  • Maybe two gigabytes of memory, even though it doesn't necessarily

  • use all of that memory.

  • And that memory-- if you just think of your computer's memory,

  • as we've done before, is like a rectangular region.

  • And we could number of the bytes in my memory.

  • But it doesn't really matter what the addresses are, what the numbers are,

  • just that it exists.

  • It turns out that when you run a program,

  • your operating system typically lays out your program's memory in a fairly

  • standard way.

  • There's a chunk of memory down here for something called environment variables.

  • There's a bigger growable chunk of memory down here called

  • the so-called stack.

  • On the opposite side of the picture is a so-called heap.

  • Another chunk of memory that actually grows in the other direction.

  • So long story short, bad things can happen if both of those

  • grow bigger than you intend.

  • Then, there's some kind of uninitialized and initialized data up top.

  • And then, text.

  • Now, it turns out text is the segment of memory where

  • your program's zeros and ones live.

  • So when you double-click an icon on your Mac or PC

  • or when you run the command dot slash something,

  • those zeros and ones are loaded from your hard disk or solid state disk

  • into RAM or memory.

  • And it's put conceptually at the top of the memory

  • that your computer program is using.

  • And below that is the actual data that your program is using.

  • The variables and the values inside of it.

  • Now, each of these types of memory have different purposes.

  • And we'll see in just a moment what it is that's going on.

  • And we'll ultimately peel back these layers.

  • So what is it that's actually going on underneath the hood here?

  • Well, let's consider this to be my computer's memory.

  • So focusing on just that bottom most portion,

  • which I called the stack a moment ago.

  • So if we draw just the bottom of my computer's memory kind of like this,

  • the bottom of it has technically environment variables.

  • But let's focus on the region known as the stack.

  • And the stack, as the name implies, is kind of like the stack of trays

  • that you might see in a cafeteria or a dining hall,

  • where you put trays on top of another until it can get potentially pretty

  • tall.

  • And it turns out when you run a program, not

  • only is your program given the illusion of this really big memory space laid

  • out as proposed, but it also by convention

  • uses this memory in a fairly standard way.

  • Specifically, when main is called, main is given a chunk of memory

  • at the bottom, so to speak, of this stack space.

  • And so, let me go ahead here and write main.

  • And any local variables that main has and any arguments to main,

  • namely argc an argv, end up inside here.

  • So if indeed you are using something like argv and argc,

  • you might have a value like this down here.

  • And you might have another chunk of memory carved out here for argv.

  • And if you have a couple of local variables, for instance x

  • and another one, y, those two would be allocated space in this slice

  • if you will.

  • This frame of memory.

  • Meanwhile, if main calls a function, like swap-- swap

  • is allocated a swath of memory, a frame of memory, above main by design.

  • So if I've called swap, its memory ends up here.

  • And if the swap function itself has arguments,

  • like a and b or any other local variables,

  • those values too are put inside of the so-called stack frame.

  • So this might be a and this might be b.

  • In other words, the concepts that we've been taking for

  • granted, both in Scratch and in C, at the end of the day, boil down to values

  • needing to go somewhere physically.

  • And so, if you assume that the big rectangular region

  • here is your computer's memory.

  • And then, you consider that the operating system really

  • just slices and dices this memory, such that mains memory is down here.

  • Any function that main calls is immediately above it.

  • And frankly, if swap called its own function, it would end up above it.

  • But now, given this basic definition of memory management

  • and the layout of computer program's memory space,

  • you can perhaps start to infer why all of these failures

  • have started to happen in my program.

  • A moment ago, I didn't have argv and argc.

  • I just had for instance x and y.

  • And I had the value 1 and I had the value 2.

  • Main then called swap and put a copy of 1 there and a copy of 2 there.

  • And indeed, that's the key insight.

  • When a function calls another function, passing in arguments as inputs,

  • that the function is being passed to copies of those inputs.

  • So at this point in time, if you opened up the lid of your computer

  • and looked inside digitally, you would see 1 and 2 down here.

  • And you would see another pattern of bits

  • representing 1 and 2 up here in duplicate.

  • So now when my swap function operates, it

  • declares a temporary variable recall, called temp.

  • So let me draw that here.

  • And as I recall, it stores in temp, which value?

  • The value of a.

  • The value of a is 1.

  • It then took the value of b, put it in a-- which puts that value here.

  • And now at this point in the story, a and b are incorrect.

  • We still need to put the value 1 in b.

  • And that's why we then took temps value, put it in b.

  • Thereby giving the me ultimately the number 1 in b's slot as well.

  • And so, at this point in the story, temp still exists.

  • And a and b have the correct answers, 2 and 1 respectively.

  • But the catch is the moment that swap returns, this happens.

  • Essentially, everything that was being used above main disappears.

  • It's not actually deleted.

  • All of those bits are still there, so technically the numbers

  • are still there.

  • But the computer just forgets about those values.

  • And indeed, the key takeaway here is that when

  • execution returns to main-- that is when swap says, return I'm done.

  • And main takes over operation again, 1 and 2 are completely unchanged.

  • So all of the word that we just did up there

  • was a complete waste of time in some sense.

  • Because even though it was correct in so far as it swapped a and b,

  • it did not actually have a permanent impact on the variables

  • that I actually cared about.

  • Let's consider another example now.

  • Recall my previous example where when I whipped up some code to compare values,

  • I called get string once and stored its return value in s.

  • I called get string twice and stored its return value in t.

  • And then, I just tried to compare s and t.

  • Well, what does that actually mean?

  • It turns out that when you call get string,

  • it's not technically returning a string to per say.

  • It's not technically returning to you a sequence of characters.

  • It's actually returning something much simpler.

  • When I call get string and do this-- string s gets get underscore string,

  • all that's happening is this.

  • The left hand side of this expression is telling the computer, hey, computer.

  • I need a variable called s that's big enough to fit a string so to speak.

  • On the right hand side, get string indeed gets a string from the user,

  • like D-A-V-I-D or Z-A-M-Y-L-A.

  • But where does that string come from?

  • It turns out it comes from somewhere else in memory.

  • And suppose that the user has indeed typed in Z-A-M-Y-L-A quote unquote

  • so to speak.

  • That's just kind of floating somewhere in my computer's memory.

  • And if this string happens to exist at address number 123,

  • byte number 123 in my computer's memory.

  • What's actually going to get returned is the value 123.

  • In other words, a string is technically just the address

  • of a character in memory.

  • In fact, if we zoom in further on Zamyla,

  • recall that Zamyla's name really looks like this underneath the hood.

  • It's an array, a block of contiguous memory with a backslash 0

  • at the very end.

  • And if this first byte just happens to be the number 123,

  • the second byte is going to be 124.

  • The third byte it's going to be 125.

  • And so forth.

  • In other words, if my computer has a billion bytes

  • of memory, 2 billion bytes of memory-- like a gigabyte or two gigabytes,

  • you can certainly number each of those bytes and the computer

  • does underneath the hood.

  • And so Z-A-M-Y-L-A backslash 0 is simply a sequence of 7 characters that live

  • somewhere physically in memory.

  • And what a string technically is, is just kind of a breadcrumb--

  • specifically the address of, the number of the first byte of Zamyla's name.

  • So when you return a string, you're not handing back Z-A-M-Y-L-A backslash 0,

  • you're handing back something much smaller.

  • A little scrap of paper, if you will, that's a map to Zamyla's name

  • somewhere in your computer's memory.

  • And so over here, this number 123 is generally called, not just a number,

  • but an address.

  • Much like post boxes and houses have addresses physically that uniquely

  • identify them, 123 uniquely identifies that byte in memory.

  • And so via this address, this map if you will,

  • can my program later on find Zamyla's name and do something with it.

  • Like printed out, or capitalize it, or compare it, or anything else.

  • And so, how though does the computer know

  • where Zamyla's name begins and ends?

  • Well, it begins at 123.

  • And again, recall from two weeks ago, it ends with backslash 0.

  • So with those two markers, where does it begin?

  • Where does it end?

  • Can you infer the entirety of someone's name

  • or the entirety of a string in between?

  • Now, consider my second line of code.

  • String t gets get string.

  • So what does this mean?

  • This is another function call.

  • And suppose that I just happen to type in Z-A-M-Y-L-A enter, just like before.

  • What's happening there is that the computer, blind to the fact that,

  • that might already exist somewhere in memory,

  • is going to give me another chunk of memory somewhere with Z-A-M-Y-L-A.

  • And then, backslash 0.

  • Whereby, this is effectively an array of its own.

  • And I don't know where it is.

  • Maybe it's at address number 234, followed by 235 dot dot dot.

  • It's somewhere else in memory.

  • I don't know or need to even care where it is.

  • But the key detail here is that t in so far is it is a variable itself.

  • It gets what value?

  • Well, if the string's Zamyla, the second time I type it in it

  • just happens to end up at location 234.

  • What ends up in t is simply the number 234.

  • So if we change back to my code where I was comparing s and t, as I am here.

  • I've called get string once, stored the value in s.

  • Called get string twice stored the value in t.

  • And then, I'm quite simply saying, if s equals equals t, then say same.

  • Else, say different.

  • Now, when you consider what is really going on underneath the hood all

  • these weeks, certainly 123 is not the same as 234.

  • And so, s does not equal t.

  • Capitalization meanwhile.

  • What happens with that other program?

  • Well, we'll do this one more quickly.

  • But if I for instance, call string s gets get

  • underscore string open paren close paren.

  • That again, is going to give me something like this.

  • And it's going to give me something like Z-A-M-Y-L-A backslash 0,

  • all of which I can think of, as before, like an array-- if sloppily drawn.

  • Maybe starting again at address 123.

  • And so that's what ends up in s.

  • But in my copy 0 program, recall that I did this.

  • String t gets s.

  • I didn't call get string again.

  • I just said, store s inside of t.

  • Effectively, I thought, making a copy of it.

  • What this left-hand side gives me is another box called t.

  • If this is s, this is t.

  • But what goes inside of this new box t?

  • Well, what goes inside is literally a copy of s.

  • What is s? s is 123.

  • OK. t then, is 123, which means later in my copy program,

  • when I simply decided to capitalize t bracket 0,

  • the first character in my string t-- that's

  • kind of misleading, because my string t is really just my string s.

  • They're sort of synonyms for one another at this point,

  • because one indeed equals the other.

  • And so what has happened is that I have gone to the first character in t,

  • which of course is z.

  • And recall from my example earlier when I typed it in all lowercase, David,

  • for instance, with a lowercase d, it capitalized it not only in t, but also

  • in s.

  • In other words-- and frankly, it doesn't really matter, typically,

  • what these addresses are.

  • I'm just using 123 and 234 because they're sort of easy to say.

  • But you can really think of s as again, being a map that

  • leads you to the string you care about.

  • A pointer, if you will.

  • Literally an arrow.

  • And t, similarly, can be thought of as a pointer.

  • And the key detail here is that because I've set s equal to t-- or t

  • equal to s-- they are effectively pointing at the same thing.

  • So strings are a lie.

  • There is no such thing as a string data type.

  • There are things called chars, characters

  • that can live somewhere in memory.

  • And we humans can arbitrarily decide that hey,

  • if we put a backslash 0 character at the end of a sequence of other characters,

  • we can all just agree universally to treat that as the end of a quote

  • unquote "string," that is a word or a phrase

  • or paragraph or anything even bigger.

  • But we need a convention for remembering where strings begin.

  • We've already solved the where do they end problem.

  • So where does the string begin?

  • It begins at an address.

  • It begins with a pointer.

  • And so this special data type that we declare in CS50's library

  • called string really is just in CS50 IDE an 8-byte value,

  • a 64-bit value that is just a really big number that represents

  • the address in memory of a string.

  • And I say really big just because the IDE gives you access

  • to lots of memory, certainly numbers bigger than 123.

  • But a string is just a number, is just an address, AKA a pointer.

  • And that explains, then, why all three of these examples

  • did not behave as I might have hoped, because rather

  • they were taking things a little too literally.

  • Or I was failing to appreciate what's actually going on.

  • Let's pause for a moment, take things down a notch.

  • Make things a little more real with a bit of claymation

  • that will motivate, eventually, peeling back this layer further

  • and seeing what's really going on.

  • [STRUMMED CHORD]

  • NICK PARLANTE: Hey, Binky.

  • Wake up.

  • It's time for pointer fun.

  • BINKY: What's that?

  • Learn about pointers?

  • Oh, goodie!

  • DAVID J. MALAN: Binky, who exist here only in claymation form,

  • is the product of a good friend of ours, Nick Parlante at Stanford,

  • who teaches computer science there.

  • You'll see more of Binky and hear more of Nick in just a moment.

  • But these here are sort of metaphorically

  • the training wheels that we've had on for the past few weeks.

  • And the goal now at hand is to take these off,

  • and to finally start looking at what's really going on underneath the hood.

  • And starting to remove, if you will, let's

  • see if-- [BANG] --probably not the best idea.

  • Remove, if you will, these training wheels,

  • and actually see what's going on, and understand

  • and take advantage of the same.

  • As follows.

  • Let's go ahead now into CS50 IDE, and go ahead and open up,

  • let's say, compare1.c, which I wrote in advance to look as follows.

  • And you'll notice that it works a little differently from version zero.

  • Here we have a prompt for string s.

  • And we store in it the return value of get string.

  • But notice what's on the left-hand side.

  • Char star s, all of a sudden.

  • Indeed, all of this time, I've been treating things

  • as though they are strings, literally.

  • But it turns out a string is just a synonym

  • for a data type known as a char star.

  • And the new syntax today, then, is this star operator.

  • The asterisk that actually has special meaning in certain contexts, not just

  • multiplication.

  • But in this case, it specifies that the type of a variable

  • is not a char literally, but a char star, the address

  • of a char, a pointer to a char.

  • Now, why char?

  • I thought we were talking about strings.

  • But again, recall that a string is just a sequence of characters

  • back to back to back, and therefore, you can define a string

  • by the address of its first character.

  • Ergo, what we really need underneath the hood

  • is a data type that lets us store the address of a character.

  • There is no string.

  • And so what does this allow us to do?

  • Let me go into CS50 IDE, and let me declare then, on this line

  • here, that s, this time, will not be a "string" quote unquote.

  • That was from the CS50 library.

  • But rather it's going to be a char star.

  • It turns out that all this time get string, again, does not

  • return a string, it returns the address of a string, AKA

  • the address of the first character in a string.

  • And so the type of value it's returning is not just a number.

  • It's not just an int.

  • It's a special type of an int.

  • It's used for a different purpose.

  • It's simply an integer that represents the address of a char.

  • And the way you type out address of a char is literally char star.

  • So this, then, is identical to my previously in weeks past having typed

  • string s.

  • Now I'm going to start typing it as char star.

  • Meanwhile, t is going to be the same.

  • So when I prompt the user for another string there,

  • I'm going to store that return value inside of t.

  • And now, notice, just for good measure, I'm making sure

  • that both s and t are not null.

  • I'm using a bit of conditional logic there, saying if s is not null

  • and t is not null, it's safe to proceed.

  • Because recall that get string can accidentally return null sometimes

  • if your computer is out of memory, or something else goes wrong.

  • Or not so much accidentally, but by design.

  • But notice this new chunk of code.

  • It turns out-- and we know now from a moment

  • ago-- you can't just compare s against t.

  • They're not going to equal the same thing

  • if you type in two independent strings.

  • We need a special function that actually compares strings in a conceptual way.

  • I mean that a string is equal to another string if every one of its characters

  • equals every one of the other string's characters.

  • Thankfully, there exists in C a function that does exactly that called strcomp,

  • string compare, and it takes two arguments.

  • The first is a string.

  • The second is a string.

  • Or more properly, the first is the address of a string.

  • Or even more properly, the first is the address of a character.

  • The second is the address of a character.

  • And str compare is just going to hope that both of those strings

  • eventually end in a backslash zero, so that they

  • don't loop forever through memory.

  • They eventually hit that special null terminating byte.

  • And if so, and those characters are all entirely equal, you print same.

  • Else, as before, we print out different.

  • So now, when I compile this program, make compare 1,

  • and then I do ./compare1.

  • Now I'll type in david in all lowercase, david in all lowercase.

  • They're indeed the same.

  • Let's do it again.

  • Zamyla with a capital Z. Zamyla with a capital Z. There are indeed the same.

  • Let's do Zamyla with a capital Z. zamyla with a lowercase z.

  • Different.

  • And then FUBAR, clearly different.

  • Now we're actually comparing these things properly.

  • Because now we're appreciating what it actually means to be a string,

  • and we are underneath the hood comparing what we should be doing.

  • Now, underneath the hood, what is str compare doing?

  • Honestly, it's probably just a while loop or a for loop

  • that is iterating over the string and their lengths

  • and looking at the i-th character in each string,

  • and making sure they're all in fact equal.

  • Let's go ahead and fix copy with version 1 here.

  • Copy 1.c that I've written in advance now looks like this.

  • I still prompt the user for a string s with these lines here.

  • I then, just to be safe, say hey, wait a minute, if s equals

  • equals null, return 1.

  • And again, 1 is just arbitrary.

  • I just want to get out, lest I break something later.

  • Now, down here, this is a new line of code.

  • And this is perhaps one of the most powerful ingredients

  • we'll see this week, is this new function called malloc,

  • memory allocate.

  • This is a special function via which you can ask the operating system, Linux

  • in the case here, or Mac OS, or Windows, if you're running the code locally,

  • hey operating system, please give me a bunch of bytes of memory.

  • Now, why do I want this?

  • This program is copy 1.c.

  • The goal at hand is to create a program that copies a string s

  • and stores the copy in t, so to speak.

  • Last time, it was not sufficient just to say t equals s,

  • because that copied the addresses.

  • That didn't give me a copy of Z-A-M-Y-- Z-A-M-Y-L-A backlash 0.

  • It instead just gave me a copy of the address.

  • So how do I get a complete copy of Zamyla's name?

  • I need to preemptively do a little bit of arithmetic

  • and say, all right, how long is Zamyla's name?

  • Well, it's the length of s, str len of s.

  • But plus one.

  • Why plus one?

  • Why plus one?

  • Yeah?

  • Exactly.

  • We now were hit the lowest level of the computer.

  • If we don't ask the operating system for memory for that extra backslash 0 byte,

  • we're not going to get it.

  • So we have to explicitly say, give me one more byte,

  • because I know how strings are implemented underneath the hood.

  • I need to put that backslash zero there, ultimately,

  • and then whatever that expression is, the length of Zamyla, so Z-A-M-Y-L-A,

  • six, plus one, seven bytes.

  • Times the size of a character.

  • Turns out it's always going to be one, by definition.

  • But just for good measure, I'm clearly saying, give me seven times

  • the size of a char, which is going to be one.

  • That gives me seven total bytes.

  • So just to simplify.

  • If you multiply all this out, because the line looks unnecessarily

  • cryptic at the moment, this really is equivalent, at the moment,

  • to just this.

  • Call the function malloc.

  • Give it the number seven, so that malloc, and in turn, the operating

  • system, looks inside of its memory bank, so to speak, and says,

  • hmm, where are there are seven available bytes that aren't currently in use?

  • Ah, here is a chunk of them.

  • And it's a contiguous chunk.

  • It's going to find a block of memory, a rectangular region, if you will,

  • and grab seven bytes, and return them to my main function.

  • But what do I mean to return a chunk of memory?

  • Well, just as get string returns a string

  • by returning the address of the first character in that string,

  • so does malloc equivalently simply return the address

  • of the first byte of memory.

  • But the danger now is that unlike a string,

  • malloc is not giving you characters.

  • It's just giving you seven bytes in a row that you are now free to use.

  • It does not give you a backslash zero at the end of them.

  • If you want to remember the length of the chunk of memory

  • you just allocated, the burden is entirely on you, the programmer.

  • And indeed, one of the most common sources of bugs in writing code in C

  • is to forget about how long was this chunk of memory,

  • and to accidentally, with a loop, go too far past the end of it.

  • And we'll see what can happen in those cases.

  • So now, assuming I do have in t the address of that chunk of memory,

  • let me just say, if t equals equals null, return 1.

  • Something happened that's bad, probably the operating system just

  • didn't have seven extra bytes of memory to give me.

  • So fine, I'll quit.

  • Then down here, what do I want to do?

  • Well, I now need to implement, at least in this example, my own copying

  • process.

  • Here, at this point in the story, I have two variables, s and t.

  • s contains the address of Zamyla's name. t

  • contains the address of a new chunk of memory of length seven.

  • So here's what I want to do.

  • Just like a couple of weeks ago, I'm going

  • to iterate from zero on up to the length of the string.

  • But not up to, but up through the length of the string.

  • Because in this case, I actually want to iterate with a for loop

  • up through that backslash 0 byte.

  • And then just this syntax from a couple of weeks ago,

  • when we simply manipulated strings as for our cryptography ciphers, character

  • by character.

  • Make the i-th character of t equal to the i-th character of s.

  • And this is perfectly valid, because so long as this loop doesn't

  • go past n, the number of characters that I allocated, seven, in this case,

  • I can go to t bracket 0, bracket 1, bracket 2, all the way up through n,

  • effectively copying the string.

  • And so now when I actually print out s and t, I should see truly a copy of t.

  • Because even when I force its first character

  • to lower case with this same line of code here as before,

  • I'm actually changing different memory.

  • So let's compile this.

  • Make copy 1, ./copy1.

  • And let me go ahead and type in zamyla in all lowercase,

  • and now notice the program does seem to work.

  • Zamyla is reprinted in lower case for s, but it's then

  • print in uppercase for its first letter for t.

  • And because the z's look pretty similar, let's do my name again,

  • whereby I type david in all lowercase.

  • Type Enter.

  • And now, you see s is still david all lowercase,

  • but t has only now been capitalized itself.

  • It hasn't had a side effect of some sort on s,

  • because they're different chunks of memory.

  • Why?

  • Well, what has just happened in this program is this.

  • We have, again, done string s gets get string.

  • And when we typed get string, this gives me

  • a chunk of memory for the address of s.

  • Get string gives me a name, like D-A-V-I-D in all lowercase,

  • plus that backslash 0.

  • Which again, is really just an array underneath the hood like this,

  • that starts at some byte, and maybe it's again, by coincidence, 123.

  • Just because it's easy to say, but that's not

  • where it's necessarily going to end up.

  • And so what ends up here is 123.

  • And then later, when I allocate t, I again

  • get this little chunk of memory that's supposed

  • to store the address of a character.

  • And actually, recall that we're now doing this

  • as char star, not even string.

  • So t is similarly a char star.

  • And what happens, malloc, when I ask it for seven bytes,

  • gives me 1, 2, 3, 4, 5, 6, 7 bytes of memory.

  • There's no null terminating character just yet.

  • It's just a block of memory.

  • And frankly, there could be some random values here, as

  • denoted with question marks here.

  • It's just a chunk of memory that might have been used previously

  • in my program for some other purpose.

  • But what gets stored here, if this happens to be at address 234,

  • is this value here, 234.

  • And if you're not liking the numbers, again, you

  • can think of these as just being pointers, arrows,

  • to these chunks of memory.

  • But now, in my C code, when I have a few lines above this loop whereby

  • I am copying from s bracket i into t bracket

  • i, each of the characters in my loop, what's actually happening?

  • Well, fairly intuitively, this lower case d ends up getting copied here.

  • This lower case a ends up getting copied here.

  • V-I-D, on through.

  • And David can't count, so-- backslash-- oh, right.

  • David's name is shorter than Zamyla's name, which means we didn't actually

  • ask for this many characters over here.

  • But we have taken the computer more literally now.

  • Give me six bytes, not seven bytes, in this case.

  • And then literally copy each of the characters from the original string

  • into this new string all the way up through that backslash 0 character.

  • And then when you capitalize the first character in t,

  • you are literally only changing this-- we can do better than this.

  • We are only changing this first character here,

  • which looks like that now.

  • And that's what's going on underneath the hood.

  • So this is why, then, in the beginning of the class,

  • we don't introduce strings as char stars,

  • because you very quickly get caught up in a lot of this minutia.

  • But at the end of the day, it's not all that complicated

  • once you realize that a string is just an address,

  • and malloc, this new function, also just returns an address.

  • This is very powerful, because now you have

  • these sort of breadcrumbs that can lead you to different places in memory.

  • A little map, so to speak, that can lead you to actual strings in memory,

  • and can actually now solve problems more effectively.

  • For instance, we can go back and solve one other problem

  • we saw a moment ago, which was swap.

  • So this version of swap was broken why?

  • What was the source of this fundamental problem?

  • Yeah?

  • STUDENT: [INAUDIBLE]

  • DAVID J. MALAN: Yeah.

  • When you went back to main, you erased the memory on top of it,

  • and the fundamental problem there was when I passed in x and y,

  • they became copies called a and b, in different chunks of memory.

  • And so the fundamental problem seems to be

  • that swap is incorrectly implemented.

  • It's logically correct.

  • It does swap two values.

  • And we saw that with debug 50, but it's kind of fundamentally flawed in so far

  • as it requires, it seems, by design of C,

  • that a and b be passed in by value as copies, so to speak.

  • We need some way to change this function to say main, hey, uh-uh, don't give me

  • copies of your variables.

  • Give me a treasure map that will lead me to your variables.

  • Give me the address of x.

  • Give me the address of y.

  • And I'll still call them a and b, or whatever I want,

  • but lead me to the original values.

  • Don't just pass me copies of those values.

  • And so we can change swap as follows from a program or a function that's

  • incorrect entirely, but into one that is correct.

  • And we need to change the syntax a little bit.

  • So before is what we had here.

  • After is what we now have.

  • Before, after.

  • Before, after.

  • So if you see it in rapid succession there,

  • all you see is that a whole bunch of stars are appearing in the code.

  • And unfortunately, C was not designed in the best of ways

  • to make clear what star means in different contexts,

  • but it's all related as follows.

  • The fact that I have now changed a and b to be not ints,

  • but ints stars, int pointers, if you will, means that when main calls swap,

  • it is by design of the C language, going to pass in the address of x

  • and the address of y.

  • So that's what the star means.

  • Give me the address of an int and the address of an int, not actual ints.

  • Now, down here, the star unfortunately means something slightly different,

  • but related in spirit.

  • Int temp just gives me an integer, an int variable called temp.

  • Star a, in this context, without the word int in front of it again,

  • means go to that location.

  • Follow the treasure map, so to speak.

  • Go to the address that is in a.

  • So for instance, if the address of a is say,

  • 33 Oxford Street, Cambridge, Massachusetts.

  • That happens to be the address of the CS building at Harvard.

  • Star a means go to 33 Oxford Street in Cambridge, Massachusetts.

  • The star just means go to that particular address.

  • So what does that mean, then, down here, when I say star a gets star b?

  • That means go to the address in b and get its value,

  • and store it at whenever a is pointing at to.

  • So go to a and wait for me for a value.

  • Go to b, get a value, and put that value at the location in a.

  • And then lastly, this just means go to the location in b,

  • go to whatever building that is, so to speak,

  • and put the value that is in temp inside of that building.

  • So a pointer is just an address.

  • These stars just mean pointers are involved.

  • Give me the address of an int, give me the address of an int.

  • And again, confusing, admittedly, the star

  • in this context where we don't have the word int in front of it again,

  • on the side of the equal sign, just means go to that address.

  • Go to that building.

  • Go to that other building and put something there.

  • So we can now fix our swap program correctly.

  • We can now open up, as I will here, swap dot c, which I wrote in advance.

  • That looks almost the same, except that I've changed the swap function

  • as follows. a is now int*, b is now int*,

  • and I also borrowed the stars inside of the function, as well.

  • But something's gotta change.

  • There's one more line of code I need to change for all of this to work.

  • What is that?

  • What line needs to change?

  • Well who cares about swap?

  • It's main that was calling this thing in the first place,

  • so let's go back to the original story.

  • Main, here, declares x and y as 1 and 2, does some printfs here, as before.

  • But notice this line has to change.

  • So one more piece of syntax today.

  • And we're running out of new symbols.

  • We've seen most of C already.

  • &x and &y means get me the address of x, and get me the address of y,

  • and pass those in instead.

  • So x,y would just mean pass in a copy of x and a copy of y,

  • or the values thereof.

  • &x &y means give me a little, you know, map with the address of x and a little

  • map with the address of y, so that swap-- who's receiving those maps--

  • can go there.

  • So what does this mean in pictorial form?

  • If we now go back to the beginning of this story,

  • where we were looking at my computer's memory as this big rectangular region

  • like this.

  • With main's chunk of memory at the bottom here.

  • And inside of main was two variables, like x, and another variable y.

  • And inside of those were the numbers 1 and 2.

  • And then I called swap.

  • And so swap gets its own frame on the stack, so to speak.

  • This is swap's frame.

  • It, too, had a variable called a and a variable called b.

  • But what goes in there now?

  • It's not 1 and 2.

  • We need to know a little something more about my computer's memory.

  • And I don't know where everything's laid out,

  • but let me just arbitrarily assume that, you know,

  • it's inside of my computer's memory.

  • Maybe this is byte number 90.

  • This is going to be 91.

  • This here is going to be 92, 93, 94, 95, and so forth.

  • I just need to know that there's some kind of numbering scheme there.

  • So what goes inside of a is 91.

  • What goes inside of b is 92.

  • And not the values 1 and 2, but rather the addresses of those values 1 and 2.

  • Because now my code for the swap function, consider what it does.

  • It says, upon receiving the address of an integer,

  • called a, upon receiving the address of another integer, called b,

  • go there and store that value in temp.

  • Go to the address in b and store that value at the address in a.

  • Store the value in temp at the address in b.

  • So let's see what happens then.

  • So first of all, I need another variable here, called temp.

  • Temp, meanwhile, is not a pointer.

  • It's just an integer, but what does it store?

  • Well, according to my code, temp gets the value

  • of going to a, going to the address in a.

  • So what is a?

  • a is 99.

  • That's like a treasure map leading to, OK, this chunk of memory

  • down here in my computer.

  • And what value is there once I've gone there?

  • Once I've gone to the CS building inside of it, I see the number 1,

  • and so I put the number 1 in temp.

  • Meanwhile, my second line of code says go to the address in b

  • and grab its value, and put it at the address that's in a.

  • So what does that mean?

  • Well, star b means start here and go to 92.

  • So it's like an arrow-- kind of like chutes and ladders,

  • if you know the game-- like go to address 92.

  • What value is there?

  • The number 2.

  • And the other half of the equation, on the left,

  • said, go to the address in a, which is here,

  • and copy the 2 into that location.

  • And then the last line-- only one more line--

  • is this, get the value in temp-- that's easy-- and put it at the address in b.

  • So we have to go to b and put temp in there.

  • So to do that, here's temp, it's the number 1.

  • I have to go to the address in b.

  • The address and b is 92.

  • So let's go there, and aha, let me go ahead then,

  • and overwrite the value that's there with the number 1.

  • So now this frame of memory on the stack--

  • the 91, the 92, and the temporary variable--

  • they are, by design of my new function, disposable.

  • I really don't care, after swap returns, if those things continue--

  • I did care a little bit about that.

  • I don't care if those things continue to exist.

  • All I care about is that x and y continue to exist.

  • So in this way is the new and improved version of the swap function

  • actually having a permanent impact on my data?

  • And with the frame, the memory still looks like that,

  • because it's gone to the address in a.

  • Gone to the address in b, which leads it to the original x and y.

  • And so by way of pointers, by way of these addresses,

  • do we have the ability to actually go much, much deeper into a program

  • and actually get at values that previously we

  • had no way of even expressing.

  • So it's at this point in the story where I usually admit that, at least for me,

  • this has been among the most challenging topics when I, myself, was a student.

  • And in fact, all these years later-- it's like, 20, 20 year-- yeah,

  • I think we're up to 20 years ago.

  • 20 years ago-- I didn't take this photo then--

  • but I sat in what was, at the time, the back right hand corner of Elliot

  • House's dining hall, here at Harvard.

  • And I sat down with my teaching fellow, who of all the TFs I had

  • as an undergrad, still remember to this day, [? Nishat ?] [? Meda ?].

  • And we just reconnected on Facebook, all these years later.

  • Very exciting.

  • And it was he who wonderfully sat down with me

  • at office hours one day in the dining hall,

  • trying to help me understand pointers, because it was just so much more

  • technical than all the other stuff.

  • Like, there is no puzzle piece in Scratch for the address of something

  • that leads you somewhere so powerfully as these stars seem

  • to be able to, here.

  • And this is only to say that this is among those topics that

  • might take a little bit of time to sink in, but it does.

  • And when it does, it really is that proverbial light bulb that goes off.

  • And for me, that light bulb went off right then and there.

  • Now, what more can we do with these things, after that motivational speech?

  • Pointer arithmetic.

  • So, sort of complicated sounding topic, but really, it just

  • goes back to first principles, as to what a pointer actually is.

  • And it allows us now to do things like this.

  • Let me go ahead and open up one other program that I wrote in advance here,

  • called pointers dot c.

  • And take a look at what this thing does here.

  • It works a little differently from the syntax we're used to,

  • and from any of our crypto problems thus far.

  • So notice on this first line here, I get string and I store in s.

  • No more string right now, just char star.

  • We can be real and talk about it as the address of a char.

  • A little sanity check, is s equal equal to null?

  • If so, just return.

  • Something went wrong, so let's not deal with it now.

  • Down here, a for loop.

  • For i gets 0 all the way up to n.

  • So this is just a standard syntax we've used a few times now, even back

  • in week 1 when we just wanted iterate over.

  • Or in week 2, when we wanted iterate over the characters in a string.

  • But we've never seen this kind of craziness before.

  • A star, and then some arithmetic in parentheses.

  • In the past, when we wanted to print out a character, as implied by %c here,

  • we quite simply, as I recall, did this.

  • Which was nice and intuitive, right?

  • The square brackets denote to treat the string

  • as though it's an array, which it really is, an array of characters.

  • And that means get the i-th character of s.

  • But now that we understand what s is, we don't need to use this syntactic sugar,

  • as it's called.

  • Any time a language has a feature that's convenient to use,

  • and easier to read sometimes, but isn't fundamentally

  • necessary to express yourself, it's often

  • called syntactic sugar, which means it's just kind of a nicety to have.

  • And indeed, that square bracket notation is just

  • sugar for this more arcane, but perhaps more well-defined syntax now.

  • The star operator in this context is the dereference operator, technically.

  • It's the go there operator, as I've been describing it.

  • Go to some address.

  • Well s, recall, is a string.

  • But there is no string.

  • Strings are just the addresses of characters now.

  • The first in a string.

  • So initially in this loop, what am I doing?

  • s is the address of a string, the address of its first character.

  • And I'm saying, add to s, the value i.

  • Well, i is just this variable in my for loop that's initialized to 0.

  • So s plus 0 is obviously just s.

  • s is the address of a char.

  • *s means go to s.

  • What do you find when you get there?

  • A character, because s is a char star, the address of a character.

  • And so printing out %c *s effectively means,

  • go print that character right there.

  • On the second iteration of the loop, i is, of course, 1.

  • So s plus 1 is 1 byte farther from the beginning of the string.

  • And the star means go to that character and print it with %c.

  • One more iteration, i is now 2.

  • s plus 2 is 2 bytes away from the start of the string.

  • Go there and print that character in the string.

  • And so forth.

  • And do this up until the length of the string.

  • Now this is perfectly correct.

  • And if you really kind of want to look cool with your code

  • you can use pointer arithmetic in this way, because all it is

  • is just expressing more precisely what is going on underneath the hood.

  • But it's a little more cryptic.

  • It's certainly a couple more characters.

  • But it is functionally equivalent to what

  • we've been doing for weeks, which has been, again, just s [i].

  • So whereas some of today's ideas are admittedly

  • new-- allocating memory and actually looking

  • underneath the hood at what a string is--

  • we're not really getting newfound capabilities

  • that we didn't already have when it comes to manipulating existing strings.

  • So this is pointer arithmetic, so to speak, insofar

  • as we are doing arithmetic with pointers.

  • Math with pointers.

  • All right, let's take a look now at where things can go wrong.

  • So this is a program written by our friend Nick Parlante at Stanford,

  • inspired by Binky, who will return in just a moment.

  • And it's fundamentally broken.

  • This code is incorrect.

  • It also doesn't really do anything useful.

  • But it's meant to be demonstrative of things that can go wrong.

  • So at the top of this program, notice we are declaring two variables, x and y.

  • But today those variables are not ints, as they might have been in weeks past,

  • they are int*s, the addresses ints.

  • They're not being initialized yet to any value, so that's fine.

  • So really, this is just giving us, if you will,

  • like, two boxes on the screen.

  • So x at this point in the story looks like this, y at this point in the story

  • looks like this.

  • I have no idea what's inside of them.

  • They have garbage value, so to speak.

  • Because if we didn't assign them a value,

  • the computer is not going to do it for us,

  • so they might just have remnants of some past usage of that memory.

  • So we don't know what's inside of them.

  • But that's OK, because on this next line I call malloc and actually allocate

  • enough memory for one int.

  • Now this is kind of a silly use.

  • It's not the best way to give yourself an int.

  • We've seen for weeks now how you get an int.

  • You say, like, int x, or int y, or int z, and the computer gives you an int.

  • But if we want to use malloc in the simplest way possible,

  • we can just say, hey, malloc, give me enough space for an int.

  • And recall from the past that an int here is generally 4 bytes.

  • So give me 4 bytes of memory, specifically

  • the address of the first of those bytes.

  • That's all malloc is doing, and it's storing that address in x.

  • For good measure I should check for null, but we're not, in this case,

  • per Binky.

  • So what's the next line do?

  • This next line means go to the address in x and put the number 42 there.

  • That seems OK.

  • Because assuming malloc returns the address

  • of a chunk of memory, the first address of 4 bytes,

  • we can go there and put the number 42 in binary, in 4 bytes worth of bits.

  • But what about this line?

  • I've flagged it in red because this program probably

  • is going to go no further.

  • In fact, something very, very bad is about to happen.

  • Why?

  • Well, what is the value in y?

  • Well, originally x was a garbage value until I called malloc

  • and asked malloc, hey, malloc, give me enough space for an int.

  • So I'll draw it as a box here.

  • Give me the address of that chunk of memory.

  • So this question mark is really now an arrow to that chunk of memory.

  • And then I said, hey, computer, go ahead and go there and put the value 42.

  • But then my next line of code said, hey, computer, go to the value in y

  • and put the unlucky number 13 there.

  • So that's like saying, go-- I don't know where to go, because this has no value.

  • And so something very bad is going to happen.

  • Because the question mark implies, this is a garbage value.

  • Maybe it's 0, maybe it's 1,000, maybe it's some number in between or bigger.

  • It's some garbage value, which means if I just go there,

  • who knows where I'm going to end up?

  • But odds are I'm going to end up somewhere

  • I shouldn't, because I should not be touching memory that is not my own.

  • And indeed, thanks to Binky, we're about to see that bad things indeed happen

  • when you touch memory that you shouldn't.

  • Let's take a look.

  • [VIDEO PLAYBACK]

  • -Hey, Binky, wake up.

  • It's time for pointer fun.

  • -What's that?

  • Learn about pointers?

  • Oh, goody!

  • -Well, to get started, I guess we're going to need a couple pointers.

  • -OK.

  • This code allocates two pointers which can point to integers.

  • -OK, well, I see the two pointers, but they

  • don't seem to be pointing to anything.

  • -That's right.

  • Initially, pointers don't point to anything.

  • The things they point to are called pointees, and setting them up's

  • a separate step.

  • -Oh, right, right, I knew that.

  • The pointees are separate.

  • Er, so how do you allocate a pointee?

  • -OK, well, this code allocates a new integer pointee,

  • and this part sets x to point to it.

  • -Hey, that looks better.

  • So make it do something.

  • -OK, I'll dereference the pointer x to store the number 42 into its pointee.

  • For this trick I'll need my magic wand of dereferencing.

  • -Your magic wand of dereferencing?

  • Uh, that's great.

  • -This is what the code looks like.

  • I'll just set up the number and--

  • [POP]

  • -Hey, look, there it goes.

  • So doing a dereference on x follows the arrow to access its pointee.

  • In this case, to store 42 in there.

  • Hey, try using it to store the number 13 through the other pointer, y.

  • -OK, I'll just go over here to y and get the number 13 set up,

  • and then take the wand of dereferencing and just--

  • [BUZZER]

  • Oh!

  • -Oh, hey, that didn't work.

  • Say, Binky, I don't think the dereferencing y is a good idea,

  • because, you know, setting up the pointee is a separate step,

  • and I don't think we ever did it.

  • -Oh, good point.

  • -Yeah, we allocated the pointer y, but we never set it to point to a pointee.

  • -Hm, very observant.

  • -Hey, you're looking good there, Binky.

  • Can you fix it so that y points to the same pointee as x?

  • -Sure.

  • I'll use my magic wand of pointer assignment.

  • -Is that going to be a problem, like before?

  • -No, this doesn't touch the pointees.

  • It just changes one pointer to point to the same thing as another.

  • -Oh, I see.

  • Now y points to the same place as x.

  • So wait, now y is fixed.

  • It has a pointee, so you can try the wand of dereferencing again

  • to send the 13 over.

  • -Oh, OK, here goes,

  • -Hey, look at that.

  • Now dereferencing works on y.

  • And because the pointers are sharing that one pointee, they both see the 13.

  • -Yeah, sharing, whatever.

  • So are we going to switch places now?

  • -Oh look, we're out of time.

  • -But--

  • -Just remember the three pointer rules.

  • Number one, the basic structure is that you have a pointer

  • and it points over to a pointee.

  • But the pointer and pointee are separate,

  • and the common error is to set up a pointer,

  • but to forget to give it a pointee.

  • Number two, pointer dereferencing starts at the pointer

  • and follows its arrow over to access its pointee.

  • As we all know, this only works if there is a pointee, which kind of gets back

  • to rule number one.

  • Number three, pointer assignment takes one pointer

  • and changes it to point to the same pointee as another pointer.

  • So after the assignment the two pointers will point to the same pointee.

  • Sometimes that's called sharing.

  • And that's all there is to it, really.

  • Bye bye, now.

  • [END PLAYBACK]

  • DAVID J. MALAN: So what else can go wrong now

  • that we have the ability to touch, correctly

  • or incorrectly, any memory that we actually

  • have access to inside of our program?

  • Well, memory leaks are one such problem.

  • Now that you have the capability to ask the operating system for memory

  • via the malloc function, you have the ability to accidentally

  • not give that memory back.

  • In fact, a very common mistake in some programming languages

  • is to ask the operating system for a chunk of memory,

  • use it, and then never actually free it.

  • To give it back so that other parts of your program,

  • or other programs on the computer, can actually make use of that memory.

  • But fortunately there exist tools via which we can detect this,

  • and one of them is called Valgrind.

  • So this is another debugging tool that you'll now

  • be able to use once you start dynamically

  • allocating memory in your program.

  • So up until now you probably have not used malloc,

  • and therefore you have not likely actually asked

  • the operating system for more memory in this very dynamic way.

  • Instead, you have just declared an integer or an array,

  • and asked the computer for memory in other ways, not using malloc.

  • But suppose that we have a program like this.

  • In memory dot c I've made some mistakes.

  • Deliberately, but mistakes nonetheless.

  • And indeed, this draws inspiration from the documentation for this very tool

  • to highlight a couple of its most useful features.

  • So let's look at this program here.

  • There's no prototype for this function f,

  • just because the example that Valgrind, this tool, gives just puts it up top,

  • and that's fine.

  • But let's focus on main first.

  • Main takes no command line arguments, per it's mention of void,

  • and it returns an int as usual.

  • The function f gets called here.

  • So f open bracket, close bracket just means call the function f,

  • but pass it no inputs, and then eventually return 0.

  • So let's look at f.

  • What happens?

  • Well, f takes no inputs, produces no outputs.

  • It only apparently has these two lines of code.

  • And let's consider what it does.

  • It calls malloc, memory allocate, 10 times the size of int.

  • So that just means, hey, malloc, give me enough space for 10 integers.

  • Now we know on CS50 IDE that the size of an int recall is 4 bytes, or 32 bits.

  • So this is like saying, hey, computer, give me 40 bytes total.

  • 4 times 10.

  • Malloc recall returns the address of that chunk of memory,

  • and it stores it in x, which, according to the star, is the address of an int.

  • So that's all.

  • So borrowing some of the same ideas from the Binky example.

  • What is bad about this is this next line.

  • Why is x[10]=0 a problem?

  • Well, the size of this chunk of memory is what, 10 ints worth.

  • So 10 times 4, 40 bytes.

  • But remember how chunks of memory are indexed

  • when you use square bracket notation.

  • When you use square bracket notation, you better

  • start counting at 0, which means if we have 10 ints, first one is 0,

  • the last one is 9, the 10th one does not exist.

  • And so if I'm saying xx[10] gets any value,

  • that's like saying go there and put the value 0 there,

  • but I don't have access to that chunk of memory.

  • I did not ask the computer for that memory.

  • I asked the computer for all of this other memory, only 10 of these bytes.

  • So this feels problematic.

  • Moreover, I have asked the operating system for memory via malloc,

  • hey, give me 40 bytes.

  • I never handed it back.

  • Thereby introducing what we would call a memory leak.

  • Now, as an aside, once your program quits, all of the memory it's allocated

  • is actually automatically given back, but the problem is

  • that, for long running programs, things like your browsers,

  • or word processing programs, or any number--

  • Skype-- or any number of other programs that you might run locally

  • on your Mac or PC, if those bigger programs have memory leaks, such

  • that they keep asking Mac OS or Windows for memory

  • and never actually remember to give it back,

  • you might experience what you might have in the past of your Mac or PC

  • really starting to slow down and kind of crawl.

  • And that can be for any number of reasons, but one of them

  • is if your program or programs inside of them have some form of memory leak.

  • Not your fault. It's the programmer's fault. But memory leaks, nonetheless.

  • So what is Valgrind good for here?

  • Well, it turns out, that in Valgrind you can run the command like here.

  • Valgrind.

  • Leak check equals full.

  • Dot slash.

  • And then the name of your program.

  • And so let me do that.

  • Let me go ahead back into CS50 IDE and let me go ahead and make memory.

  • Compiles OK, so it doesn't appear to be syntactically flawed.

  • Dot slash memory.

  • Nothing bad seems to happen.

  • So that's all fine and good.

  • And you might think that, OK my program works, it's time to submit,

  • all is well, but not necessarily.

  • Let me go ahead and run Valgrind.

  • And let me actually go ahead-- let me go up here

  • and not create a new file, but a new terminal.

  • Just so that we have a much bigger window.

  • And let me go ahead into this directory and run Valgrind, dot slash memory.

  • Without any command line arguments besides that.

  • And hit Enter.

  • And the downside of Valgrind, frankly, is

  • that it's just-- output is atrocious.

  • Like it's just not easy to read.

  • But you start to learn to notice patterns in it.

  • So I've learned to notice when I make mistakes in programs that invalid right

  • of size 4, not a good thing.

  • That's indeed a problem of some sort.

  • But what's nice about Valgrind is that it tells me

  • that the source of this invalid right, whatever that means,

  • starts in memory.c line 13 and really rears its head in line 8 of memory.c.

  • Well, what are those lines?

  • Let's go to line 13.

  • Ah, that's my call to F. And line 8.

  • Oh, that's my use of x bracket 10.

  • So why is Valgrind telling me invalid right of size 4?

  • Well, size 4 sounds familiar.

  • That's the size of an int.

  • Invalid just means bad.

  • Right means change something, like mutate a value.

  • So invalid right of size 4 maybe just means this.

  • I am in an invalid way changing 4 bytes of memory

  • by storing 0 from the right hand side into x bracket 10 on the left hand

  • side because, again, that x bracket 10 is an 11th byte, rather,

  • an 11th int, that I never ask the operating system for.

  • I only asked it for 10, not 11.

  • And so Valgrind is telling me, somewhat cryptically, that I've screwed up.

  • On line 8 ultimately I've screwed up by touching 4 bytes of memory

  • that I shouldn't have.

  • Meanwhile, a little more worrisome is this.

  • Address such and such as zero bytes after a block

  • of size 4 allots, da-da da-da.

  • Oh, not that.

  • Let's focus on-- here we go.

  • Uh oh.

  • Leak summary.

  • Definitely lost.

  • Oh, my god.

  • Like I have lost 40 bytes of the computer's memory.

  • Now what does that mean?

  • This simply means that Valgrind has noticed

  • that I must have called malloc or some similar function,

  • asked for memory, 40 bytes' worth, and never gave it back.

  • Well, we haven't introduced the way of giving memory back so let

  • me at least address that now.

  • But let me take this advice.

  • Rerun with leak check full to see details of leaked memory.

  • So let me go ahead and do that.

  • Let me go ahead and copy this so I can rerun Valgrind.

  • Whoops.

  • Valgrind.

  • With that command line argument.

  • Dot slash memory.

  • And now I'll see a little more detail sometimes.

  • Although, it appears in this case it's already pretty verbose.

  • And so indeed I've still lost 40 bytes in one block.

  • And up here again is a reiteration of that invalid right of size 4.

  • So how do I fix this?

  • Well, let me go back into my program.

  • And you know what?

  • The program's still pretty useless, but at least let me fix that mistake.

  • And go ahead now and go at x bracket 9.

  • Let me remake my program.

  • Make memory.

  • And let me in my bigger terminal window do Valgrind dot slash memory.

  • And now, oh, damn it.

  • Still have definitely lost 40 bytes in one blocks,

  • but notice the output is much shorter now.

  • I don't have a mistake that was higher up here before.

  • So I fixed one problem.

  • Let me go ahead and fix that other problem.

  • After using this memory, let me go ahead and free x.

  • So free is the opposite of malloc in this case.

  • And if I go ahead and recompile make memory and then run it

  • in my bigger terminal window, nice.

  • All heap blocks were freed.

  • No leaks are possible.

  • Now what does that mean?

  • Well, it turns out, when we look at our computer's memory,

  • there is indeed that stack at the bottom that we talked about earlier.

  • And the stack is where frames of memory go, slices of memory go,

  • that are used when functions are being called.

  • And they get layered on top of each other

  • and get un-layered only once the function's return.

  • The heap, meanwhile, is a chunk of memory above the stack, by design.

  • At least as we've drawn it here.

  • And in the heap you have just a pool of available memory

  • that you can draw from at any point.

  • And what's powerful about the heap is that,

  • whereas the stack keeps growing up and then as functions return

  • it grows back down thereby throwing away, effectively

  • losing track of, any memory that we had on the stack,

  • like x's and y's and a's and b's and temp variables still,

  • the heap does not experience that kind of disposal of memory.

  • If you ask for memory via malloc, malloc gives you a chunk of memory over here,

  • or over here, or over here, up above the stack inside of the so-called heap.

  • And even when your function returns, whether it's f, or swap,

  • or something else, that use of memory persists permanently.

  • Only once you call free do you give that chunk of memory

  • back to the operating system and allow it to reuse

  • that memory for something else.

  • But unfortunately this picture kind of suggests

  • a problem and bad design in some sense.

  • Although, given a finite amount of memory, eventually

  • something's got to bump up against something else

  • if you want to keep asking for and using memory.

  • Those arrows kind of suggest that the stack and the heap

  • are really destined to collide with each other if what happens?

  • Well, if the stack grows too large.

  • In other words, if you keep calling function after function after function

  • after function and never return.

  • And maybe this is accidental.

  • Maybe you're using recursion, per last time,

  • and you accidentally don't have a base case

  • so you just keep calling functions.

  • So much so that you run out of stack space.

  • Or maybe you call malloc so many times that the heap keeps growing down

  • and down and hits the stack, meanwhile.

  • Either of those problems can persist.

  • And you might recognize some familiar terms.

  • The first of those scenarios describe stack overflow.

  • So if any of you have ever discovered the website, Stack Overflow,

  • which is a wonderful place to go for advice

  • and tips and tricks with programming, more generally.

  • stackoverflow.com draws its name from exactly that programming

  • risk and problem.

  • Heap overflow is similar in spirit.

  • It's just the opposite.

  • When the heap grows down too far, as we've drawn it.

  • And they're both examples, more generally,

  • of what you might call buffer overflows.

  • A buffer is a chunk of space that's typically finite in some form

  • and you can eventually overflow it by trying to use more memory then

  • you should have.

  • And a buffer overflows typically actually relates

  • to arrays and chunks of memory.

  • And, in fact, a great example from Wikipedia is this one here.

  • The code is a little cryptic too because it doesn't really do anything useful,

  • but it's fundamentally flawed in the following way.

  • Let's consider exactly what happens when you have a stack overflow.

  • It turns out that this is a bad enough problem that you can actually

  • lose control over your computer itself.

  • It can be hacked into if an adversary, a bad guy,

  • has the ability somehow to inject his or her own adversarial code, code that

  • deletes your files, or encrypt your files, or does something malicious,

  • into your program because of a memory related mistake you've made.

  • Let's consider this example.

  • Here's the main function.

  • In this particular example, it takes command line arguments arg-c and arg-v

  • and just apparently calls this function foo, passing in the second word

  • that the user typed in.

  • So the first command line argument that the user ran at the prompt.

  • And it just blindly does this.

  • It doesn't check arg-c, it doesn't check for null, it doesn't check anything.

  • It just blindly passes in arg-v bracket 1.

  • And that's generally bad practice.

  • Right?

  • If you're not error checking, bad things are likely going to happen eventually.

  • Well, what does foo do?

  • Foo takes a char star as inputs.

  • Now, a week ago, we would have said string bar,

  • but we don't need to hide that detail anymore.

  • Bar is just the address of a character.

  • It's a string.

  • Foo returns nothing, but has two lines of code.

  • First, it declares an array called c-- just because.

  • This is kind of a contrived example online,

  • but the pictures will tell it all-- of size 12.

  • So this is saying, hey, computer.

  • Give me an array of 12 characters.

  • Or give me a chunk of memory of size 12 that I plan to put characters into.

  • And then, mem copy, you might not have seen before,

  • but it essentially does this it copies into this chunk of memory

  • whatever is at that chunk of memory up to this many bytes.

  • So mem copy, as the names suggests, copies memory from here

  • to here a total number of this many bytes.

  • So why is this worrisome?

  • Well, if for whatever reason you have only allocated 12 bytes

  • and the user has typed a word at the prompt

  • when running your program that is more than 12 bytes,

  • it would seem that the user is able to touch memory

  • that you the programmer never intended.

  • Why?

  • Well, your use of mem copy is checking the length of something,

  • but it's checking the length of bar.

  • So if the user types in not 12 characters, but 20 characters,

  • strleng of bar is going to be 20.

  • The user's input is 20 characters long.

  • So your code is saying copy 20 characters from bar into c.

  • Unfortunately, c is only of size 12 and so that's 8 bytes you end up copying

  • that you shouldn't.

  • So if you were given a chunk of memory here that's only of size 8

  • and you're just 12, you're touching 8 more bytes to the right of it,

  • so to speak, that don't actually belong to you.

  • Now, at best, nothing's going to happen.

  • Your computer, your program is not going to notice.

  • Things are just going to hum along and all is going to seem fine.

  • Much like my memory program a moment ago.

  • Seemed fine, but Valgrind thought otherwise.

  • In this case, really bad things can happen.

  • Because if what the human adversary, the bad guy, has typed at the prompt

  • is not just some word like foo, or bar, or David, or Zamyla,

  • or any number of sort of innocuous strings,

  • but is actually the ASCII equivalent, so the textual equivalent,

  • of some malicious code-- so if he or she actually somehow typed at the prompt

  • a pattern of characters that really represent a pattern of bits that

  • do something bad, like delete all your files,

  • or spam everyone in your address book, or any number of things.

  • If they provide a string that's longer than 12 characters,

  • the overflow is going to end up somewhere in memory.

  • And where is that?

  • Well, let's zoom in a little lower level to what your computer's memory looks

  • like on the stack.

  • So it's a little more detail.

  • So let's absorb this for just a moment.

  • Think of this now as the bottom of the stack, parent routines stack.

  • So you can think of this as main, or foo, or some function on the stack.

  • It turns out-- and I didn't mention this earlier--

  • that besides there being arguments or parameters on the stack

  • and local variables on the stack, it turns out that the computer also

  • uses the stack just to kind of store a reminder for itself of the address

  • in memory of the function that called it.

  • So if this picture represents foo, the function right now, who called foo?

  • Main called foo and so what foo does is, on its own stack, it

  • just kind of jots down a note to itself here in pink, where it says,

  • return address, when I'm done, return to this address of main.

  • The computer does that for itself.

  • So when foo is done executing, previously, I just wipe the screen

  • and remove foo.

  • But how does the program know where to go back to after foo is done?

  • It's because it put a little bread crumb for itself.

  • The address of main, itself.

  • Not the address of variables, the address of the function itself,

  • which, recall, is in the text segment of your computer's memory.

  • But never mind that for now.

  • So now, let's look at what the stack frame for foo is actually like.

  • Here is that array of size 12.

  • And it's drawn as a rectangle.

  • If we made it wider, we could just do 0, 1, 2, all the way through C 11,

  • but the author of this graphic has just drawn things a little more square like.

  • So this is the first byte in C and this is the last byte in C, 0 and 11,

  • respectively.

  • And then bar, recall, is the only argument to the foo function

  • and that, as I did mention before, belongs on the stack as well.

  • But here's the problem.

  • And this is by design.

  • It appears that whoever designed the stack,

  • has it growing upwards, as in this picture.

  • And that's how I described it earlier.

  • So the stack itself grows up.

  • Frames go up and up and up and up.

  • But within a frame, notice what happens.

  • Within a frame, if you've got a buffer, an array of memory, a chunk of memory,

  • and you write to it, you start writing to that memory, by design,

  • at the top left corner, so to speak, to the right, and then down,

  • and then down.

  • So top left to bottom right, so to speak.

  • Pictorially.

  • This was not necessarily a good thing, at least in this case.

  • Because if the stack is growing up and this is the top of foo's frame

  • and it's use of the array c is by design going down,

  • that's all fine if the user only provides a string of length 12 or less,

  • 12 or shorter, but in this case if the adversary types in a 20 character

  • string or 200 character string, the computer

  • is not just going to stop there.

  • Unless you wrote code with an if condition

  • to check the length that the user is trying to pass in and protect

  • against that, it's going to overwrite all

  • of these bytes, all of these bytes, whatever this is, all of these bytes,

  • and this is what's worrisome.

  • If you have a really smart adversary up against you,

  • trying to hack into your system, and he or she is smart enough or lucky enough

  • to figure out how to inject a pattern of bits into your program in this way,

  • such that he or she overwrites this return address,

  • a really clever adversary can trick your program

  • into returning to, not the function that called foo, main,

  • but returning to the address of the adversary's own function

  • that he or she has somehow injected by way of typing input at the prompt,

  • in this case.

  • In other words, if the user types in just, hello, h-e-l-l-o, backslash, 0,

  • all is fine and good and it fits perfectly in that frame.

  • But because we didn't have good error checking

  • on the length of the memory we're copying,

  • suppose the adversary includes a whole bunch of adversarial code, which

  • the author here abstracted away as a, a, a, adversary, or attack, attack,

  • attack, what might actually happen here?

  • Well, if that adversary gets lucky enough or is smart enough,

  • he or she might be able to overwrite these 4 bytes here

  • with the address, brilliantly, of his or her own attack code,

  • which is the bits or the characters that he or she typed at the prompt.

  • So again, a lot of this is luck.

  • Or a lot of this is trial and error sometimes,

  • to attack a program in this way, but if you

  • can trick the computer into jumping to an address that

  • happens to point at data you injected into the program,

  • you have effectively taken over that user's program,

  • tricking the program into running any and all code that in this case

  • you provided by a arg-v. So if you've ever heard of a server getting hacked,

  • or in the future you will hear of a server getting hacked,

  • or a program getting compromised, could mean any number of things,

  • but one of them, and an all too common approach,

  • is that the programmer who wrote the software used a buffer,

  • used an array, a chunk of memory, and he or she

  • did not check the boundaries of that array sufficiently.

  • Did not make sure that his or her own code

  • was not going too far as was this particular example here.

  • So now that we have this power of going anywhere

  • we want in memory, it's incredibly easy to accidentally or maliciously use,

  • but that's it.

  • That boils down to just this basic understanding

  • of what's going on underneath the hood and what you can now

  • do with the computer's memory.

  • And now, let's transition to the second of our real world demands.

  • Now that we have the ability to talk about memory addresses

  • and actually do things at this lower level, turns out,

  • we can start solving some really interesting problems, among them

  • related to images and forensics, the art and science of recovering information.

  • In fact, you might recall from various TV shows or movies, that it's all

  • too common for the good guys in the movies and the TV

  • shows to be looking over the shoulder of a tech

  • and seeing some footage of a burglary or some other crime that's committed

  • and say, hey, can you clean that up?

  • Or can you enhance that?

  • And, indeed, let's take a look at one such clip from the real world

  • and see how it relates to the actual world

  • and how we can leverage some of this new found savvy with memory addresses

  • and computing to actually solve some of these problems for real.

  • [VIDEO PLAYBACK]

  • -He's lying.

  • -About what?

  • -I don't know.

  • -So what do we know?

  • -That at 9:15 Ray Santoya was at the ATM.

  • -OK, so the question is, what was he doing at 9:16?

  • -Shooting the 9 mm at something.

  • Maybe he saw the sniper.

  • -Or was working with him.

  • -Wait.

  • Go back one.

  • -What do you see?

  • -Bring his face up.

  • Full screen.

  • -His glasses.

  • -There's a reflection.

  • -That's the Nuevita's baseball team.

  • That's their logo.

  • -And he's talking to whoever's wearing that jacket.

  • [END PLAYBACK]

  • DAVID J. MALAN: OK.

  • So this is just nonsense.

  • Like, you can't just clean up an image, or look

  • at something that's very pixelated, so to speak,

  • where pixels are dots on the screen, and just kind of zoom in

  • and clean that up and enhance it.

  • And, indeed, that's kind of this running joke.

  • Enhance doesn't really generally mean enhance when you

  • have a finite amount of information.

  • For instance, here is a finite amount of information.

  • A photograph of Zamyla that looks to be of very high quality.

  • You can see lots of detail.

  • And so you might think that we can maybe see the reflection of someone

  • in her eye if we really just zoom in and enhance, but no.

  • This is what you see, and it's actually kind of creepy at this distance.

  • This is what you see when you zoom in on that exact same photo of Zamyla

  • to find that glint in her eye of the bad guy, or the badge on his shoulder,

  • or whatever it was.

  • There is only a finite amount of information here.

  • There is literally only a finite number of rows and columns

  • and therefore really big pixels from which to glean information.

  • Now, there do exist algorithms and software that can smooth this out.

  • You can enhance the image in the sense that you can kind of maybe

  • tweak the color so that it looks a little more gradual, the color changes,

  • and a little less jarring, but you can't just put pixels there,

  • put bits there that are not existent in the first place.

  • And so, with this, invites the question well, then what is an image,

  • how do you represent it, and what can you actually glean from it

  • and do from it?

  • Well, let's consider in the simplest case what an image might be.

  • Here is an image of a wonderfully simple smiley face.

  • It's black and white.

  • And if it's black and white, frankly, we can get away with just ones and zeros.

  • I might arbitrarily say that the number 1 shall represent the color white

  • and the number 0 shall represent the color black and, as such,

  • if I just have a bunch of patterns of 8 bits, ones and zeros,

  • I can effectively think of them as a grid.

  • Rows and columns.

  • And if I see a 0 pixel, assume it's a black dot, if I see a 1 pixel,

  • it's a white dot.

  • And, as such, I can create a bitmapped image.

  • Bitmap, it's like a map, x's and y's, of bits, bitmapped image.

  • There is the simplest, perhaps, smiley face we can draw.

  • Now it's not all that interesting.

  • It would be a pretty useless photograph if you only

  • had dots of that level of detail.

  • And indeed we saw, when you really zoom in on Zumilah's eyes,

  • you have really big grids.

  • And those were colorful, but there isn't that much information.

  • We need more resolution.

  • We need more dots, more pixels, more bits.

  • And so the general case is to use not something as simple as this,

  • but a more standard format like JPEG.

  • JPEG is what you see on Facebook and on your cameras, most likely.

  • It's an image format that photographs are commonly stored in,

  • because it stores millions of colors-- potentially for photographs--

  • and it also allows you to compress them.

  • You can compress JPEGs in such a way that you throw away

  • some of the 0's and 1's, thereby degrading the image.

  • So it's a little blurrier, or a little splotchy,

  • but at least it's much, much smaller and can be emailed or texted or stored

  • with far less space involved.

  • And it turns out, if we put on our forensics hats

  • now-- it turns out that JPEGs all share something in common.

  • Any of the JPEGs you see on the internet or have on your hard drive

  • or on your phone start with the same three bytes, typically.

  • Specifically, the first three bytes are the numbers 255, 216, 255.

  • Why that?

  • It's just the standard that was adopted some time ago for JPEGs.

  • It's like a magic number, if you will, at the start of the image.

  • Now it turns out these are decimal numbers.

  • And frankly, in the world of computing, and really

  • the world of file inputs and output-- file management--

  • we don't really talk in decimal.

  • It's just not commonly done.

  • More commonly done is to talk, not in decimal, not in binary

  • as we did in week 0, but in hexadecimal.

  • So it looks a little spooky at first, but it really

  • is an alphabet of 16 characters instead of 10, instead of 2.

  • And you start counting at 0, as the real world does-- 0, 1, 2, 3, 4, 5, 6, 7, 8,

  • 9-- but there is no 10, because these are single digits.

  • So after 9 comes A, B, C, D, E. So with hexadecimal,

  • you can effectively count from 0 to 15 so long

  • as you borrow some letters of the alphabet

  • to do 10, 11, 12, 13, 14, and 15.

  • So let's map that onto this pattern of numbers

  • that demarcates the start of a JPEG.

  • These numbers, if we do them out in binary, equals this.

  • And we can leave this as an exercise at home

  • if you'd like, but trust me, per week 0, that these

  • are the patterns of bits that are equivalent to 255, 216, and 255.

  • Those are the light bulbs you would need to turn on and off.

  • Meanwhile, we can add some space there-- so before, after--

  • and I did this deliberately.

  • Because it turns out that hexadecimal, insofar as you can count as high as 15

  • for 16 total characters-- it turns out that hexadecimal is really useful.

  • Because with four bits, four 0's and 1's, you

  • can count from zero all the way up to what?

  • All the way up to 15, which is perfect, because if you have four 1 bits,

  • that's 15.

  • If you have four 0 bits, that's 0.

  • And it turns out that patterns of 4 bits-- so half of a byte--

  • map perfectly to hexadecimal characters.

  • You can express chunks of 4 bits with hexadecimal characters perfectly.

  • As such, it turns out that 15, again, is F. So we have FF.

  • And it turns out that 1101 and 1000 from 216.

  • If you take those 8 bits and just spread them out

  • slightly into two groups of four, is the letter D in hexadecimal and the number

  • 8 in hexadecimal.

  • And then we have another couple of 15's-- two sets of four 1 bits.

  • So that's FF.

  • Which is to say that the first three bytes at the start of any JPEG

  • happen to be FF D8 FF.

  • And it's human convention to prefix hexadecimal digits for clarity

  • in the real world with just 0x.

  • It means nothing fundamentally other than, hey world, here comes

  • a hexadecimal number, so that the world just knows what it's looking at

  • and doesn't confuse it for some other numeric or base system.

  • So 0x just means here comes a series of hexadecimal digits.

  • So that is what is at the beginning of a JPEG.

  • And in fact, one of the problems we put before you in CS50 each year

  • is to recover some JPEGs.

  • In fact, it's all too common, unfortunately, to lose photographs

  • from a memory card or to accidentally delete files from your computer.

  • And typically, you might freak out or be worried

  • that, damn, I didn't mean to delete those photos or delete those files.

  • And sometimes they are gone for good.

  • But it turns out that computers and even phones

  • don't necessarily delete photos right away.

  • They forget where they are, but they don't outright delete them.

  • They don't change the 0's to 1's and the 1's to 0's unless you start taking

  • more and more photos immediately.

  • But if you have accidentally formatted or corrupted your memory card

  • in a phone or a computer, such that you think you've now lost your photos,

  • what if you did this?

  • What if you wrote software that iterated from the start of your memory card

  • or your phone all the way to the end of your phone or your memory card's memory

  • looking for this pattern of bits, this pattern of numbers

  • or hexadecimal digits, this pattern of 24 bits-- 8 plus 8 plus 8?

  • With high probability, if, upon iterating

  • through your phone's memory or camera's memory, you see FF D8 and FF,

  • you can perhaps infer that the following megabytes-- 2 megabytes, however

  • big the photograph is-- is the entirety of a photo.

  • And if you just grab those bits and save a copy,

  • maybe you can recover all or some of the photo.

  • And indeed, one of the challenges you'll soon see in our problems

  • that focus on forensics is going to be exactly this.

  • We will accidentally format a memory card from a camera or a phone.

  • We will give you a forensic image of that memory card, so to speak,

  • a perfect copy of all of the 0's and 1's therein, give it to you as a file,

  • and challenge you with writing software that iterates over that file's bytes,

  • from the 0th byte all the way to the end of the file, looking for this pattern.

  • And any time you see this pattern, you will need to say,

  • this is probably a JPEG.

  • Let me start copying these bytes into a separate file name-- something.jpg.

  • And indeed, if you get the code right, suddenly out of this forensic image

  • pops any number of photographs that we thought were lost forever.

  • Bitmap is another type of image file-- older, perhaps,

  • than JPEG and simpler-- truly a bitmapped image

  • with slightly less fancy algorithms for compression.

  • But if you remember this world, this is perhaps

  • a perfect example of a bitmap image from Windows XP back in the day.

  • The wallpaper or desktop image that many of you might have had on your computers

  • looked like this.

  • As an aside, if you go poking around I think on Wikipedia,

  • someone wonderfully went to this same spot, found it

  • where the artist had taken this photograph for Microsoft,

  • and it now looks like this.

  • So it's not quite the beautiful place it once was,

  • but that, indeed, is the same hill.

  • And this is really just an opportunity to talk

  • about this picture, which is what's really

  • underneath the hood of that beautiful grassy meadow.

  • It's this kind of file header.

  • We very quickly can make nature seem very boring quickly with this.

  • But this is a very complicated, convoluted looking way

  • of just formalizing what we mean by a file.

  • Like, what is a file?

  • A file is just a bunch of bits, 0's and 1's, stored somewhere on your laptop,

  • or desktop, or phone, or wherever.

  • But what does it mean to be a Microsoft Word

  • file versus a JPEG versus a bitmap versus an Excel

  • file versus any number of other file formats

  • that you might have on your computer?

  • MPEG4, or MP3, or AAC, or any number of media formats as well-- well really,

  • what it means to be a file, is to be filled with 0's and 1's-- that's true--

  • but to follow a certain pattern of 0's and 1's such that the first several

  • bits or first many bits of a file typically follow a pattern.

  • And indeed, JPEGs have a pretty simple pattern.

  • Those first three bytes, those first 24 bits,

  • follow a certain pattern of 255, 216, 255.

  • And that demarcates the start of a JPEG.

  • And there's some more complexity therein, too.

  • In a bitmap, this looks overwhelming at first glance,

  • but this is how the world decided to standardize

  • what's inside of the first few bytes of a bitmap

  • file like that grassy meadow from Windows XP.

  • And we'll come back to this in more detail and the problem set before long.

  • But you'll notice some identifiers here that might jump out at you.

  • So width, height, size, compression-- so there are some key words in there

  • that might make some intuitive sense.

  • And indeed, somewhere inside of a bitmap image, like that grassy meadow,

  • is information stored, like what is the width of that image?

  • What is the height of that image?

  • Is it compressed, and how?

  • And then more interestingly, at the bottom of this picture

  • is the beginning of what bitmap calls RGB triples-- Red, Green, Blue triples.

  • It turns out with patterns of three bytes,

  • you can represent, top to bottom, left to right, essentially, the color

  • of every one of the dots in an image.

  • So when we enhanced Zamyla's image, and you literally saw those really big dots

  • or pixels.

  • Each of those dots or pixels has 24 bits representing

  • its color-- how much red, how much green, how much blue

  • per our discussion of colors in week 0.

  • And so that's what most of a bitmap file is here.

  • The dot dot dot just means these can go on for quite some time.

  • What color is this dot?

  • What color is this dot?

  • What color is this dot?

  • All of that is standardized inside of the file.

  • And it turns out in C, we can express and represent that kind of structure

  • using one last keyword for this week.

  • In C, there is also a keyword called struct-- structure--

  • that allows you to create an actual structure

  • inside of which you store information.

  • For instance, the programming language C does not

  • come with a data type that represents students.

  • It comes with ints, and chars, and floats-- heck, it doesn't even

  • come with string.

  • It certainly doesn't come with student.

  • But I would generally think of a student as being

  • the combination of a few things.

  • And let's keep it simple.

  • A student is a combination of the student's name and their dorm.

  • And bunches of other detail, but for now, let's just focus on name and dorm.

  • Now the syntax is a little new for us here,

  • but all this is saying is, hey, C, define a type for me that

  • is a structure containing two fields inside of it, name and dorm,

  • and call this new type students.

  • So it's a little more complicated than our typedef for a string,

  • because a string was just a synonym for char star.

  • This case, a student, is a synonym for this container of things.

  • It's like encapsulating two pieces of data-- name and dorm.

  • But this is really useful as follows.

  • With this kind of C code, can we now do things like this?

  • Let me go ahead and open up structs.h.

  • And you'll see a little header file I whipped up

  • here that contains very little information,

  • just that same typedef for struct, but I've also

  • included CS50.h in my own header file.

  • Why?

  • Well, just for today's purposes, I wanted

  • to relax our discussion of char star for a moment and talk about string.

  • But technically, I could just do this.

  • And technically I could do this.

  • And then I could get rid of that, because now this is true C code.

  • There is no need for the CS50 library.

  • But either is equivalent, and whatever you're more comfortable with for now

  • is certainly fine.

  • But let me now open up structs-0.c and do something with this.

  • So in this program, notice that the top of this file, I

  • first define a value called students to be 3.

  • This is an example of a constant.

  • Sharp define or pound define here says, define, not a variable,

  • but a constant-- and the convention in programming

  • is to use all capital letters-- define a constant equal to some value.

  • This is not a variable whose value can change.

  • This is just a keyword, STUDENTS, in all caps,

  • that is fixed now on that value 3.

  • And it's one way to define a constant in C, a value that does not change.

  • What am I now going to do with this value?

  • I first, here, declare a variable called students containing three students.

  • So it's a little confusing that we've got students, STUDENTS, student,

  • but they all mean different things.

  • students is just going by the name of my array.

  • STUDENTS in all caps is the value 3 from that constant up above.

  • And then student is the type of this array per the header file, structs.h.

  • So this is just really a succinct way of saying,

  • hey, computer, give me an array of size 3 inside of which are three students,

  • ultimately.

  • Now here's an array that iterates from i up to three.

  • I print out name colon, prompting the user with getstring for a name.

  • And then notice this new operator dot.

  • Because struct is being used, and students bracket i is a structure, not

  • an int, or a float, or a string itself, we have to go inside of that structure

  • to get its field called name.

  • So dot name says, here's the structure student-- go inside,

  • and get at the name field, and put at that name field whatever

  • the return value of getstring is.

  • And do the same thing for dorm.

  • Go into the i-th student's structure, and store at the dorm field

  • the string that has been typed in next.

  • Now let's scroll down a little lower and take a look

  • at what this program proceeds to do.

  • It's a very simple loop that sort of, just for demonstration's sake,

  • says so-and-so is in such-and-such, printing out the student's name

  • and dorm respectively.

  • In other words, if I go down here and compile structs-0, and then

  • run structs-0-- I'm going to go ahead and type

  • in David Mather and Rob and Kirkland and Zumilah and Courier,

  • and you see literally a regurgitation of what I wrote.

  • But what's key here is the syntax that I've introduced.

  • I'm using a struct, and I'm accessing its properties or fields

  • within by way of this dot operator.

  • Just means go inside the struct and get that value.

  • But more cool than that is that now that we understand files,

  • and now that we have the ability to store arbitrary structures that did not

  • come with C, so to speak, notice what else we can do.

  • In this version of the program, I'm adding one final feature,

  • combining today's discussion of pointers with this most

  • recent discussion of files.

  • It turns out, in C-- and it's a little inconsistent--

  • there is a data type called FILE in all capital letters, for whatever reason.

  • In lower case here, I'm just saying hey, computer, give me a variable called

  • file of type file star.

  • That is, give me a variable that will store

  • a pointer to a file, the address of a file, if you will.

  • fopen is a new function that we've not used before that says,

  • hey, browser, open up the file called students.csv

  • in write mode. w means write mode.

  • So I want to save a file, not open and read a file, but write and save a file.

  • Sanity check-- is file not equal to null?

  • If so, we're going to proceed.

  • I just want to make sure nothing went wrong.

  • And something might go wrong, if maybe I don't have permission on the computer

  • to create files, maybe the computer's out of space or something-- something

  • could go wrong and null is returned per the documentation for fopen or its man

  • page.

  • So here I have a loop.

  • For i from 0 to 3-- so do this 3 times-- fprintf.

  • So fprintf is identical to printf that you've

  • come to get to know in recent weeks, but fprintf is file printf.

  • And it allows you to print strings, not to the screen, but to a file.

  • So its first argument is the name of the file

  • that you've opened that you want to print to.

  • Second argument-- and all the others-- are just the same as printf as always--

  • a string, maybe with some format codes, and then some

  • values that you want to plug in here and here.

  • So if you've ever seen a CSV file in the real world-- a Comma Separated Values,

  • these are very simple Microsoft Excel files

  • or Apple Numbers files or Google Spreadsheets

  • can export these as well-- they are just text files whereby all of the fields

  • are separated by commas.

  • And so we now, today, have the ability not only to understand what's going on

  • underneath the hood memory-wise, but we now

  • have the ability to use the star operator-- and therefore pointers more

  • generally, which are required in order to manipulate files

  • with fopen and this file data type-- and with fprintf can I write information?

  • Can I print information to those files?

  • So for the first time ever, I have written a program

  • that, when I run in just a moment, is going to have the ability

  • to save information.

  • I have effectively implemented the equivalent of the File Save menu

  • for the first time.

  • Every other program we've written so far just

  • throws its information and its memory away, but not this time.

  • Let me go ahead and do make structs-1, and then

  • dot slash structs-1, and let me go ahead and type in David and Mather and Rob

  • and Kirkland and Zumilah and Courier.

  • Enter.

  • Nothing seems to happen.

  • But let me go to my file browser in the IDE, and notice this-- students.csv.

  • If I open that, I have created this file-- David comma Mather, Rob comma

  • Kirkland, Zamyla comma Courier.

  • And if I downloaded this file, double-clicked it

  • as you can now if you'd like, and open it on your Mac or PC,

  • if you have Numbers installed or Excel installed or some comparable program.

  • That should open that program, because it will recognize the .csv extension.

  • And it will display all of those names and all of those dorms

  • or houses in separate columns.

  • Because the world decided some time ago that the format known

  • as CSV, Comma Separated Values, is just simple text file

  • with commas separating values.

  • And so we now have the ability to express all of this and more.

  • So on the horizon is to solve some of these same kinds of problems

  • and to actually implement for yourself code that writes files and reads files.

  • But to solve actual problems motivated from the real world domain,

  • to recover information that's been deliberately

  • scrambled or obscured, to recover information

  • that's been accidentally or maliciously deleted,

  • to recover things, photographs, memories that you actually

  • care about-- all with a fundamental understanding of what's

  • going on inside of your own computer.

  • But until then, keep an eye out for such TV shows and movies as these.

  • [VIDEO PLAYBACK]

  • -OK.

  • Now let's get a good look at you.

  • -Hold it.

  • Run that back.

  • -Wait a minute.

  • Go right.

  • -There.

  • Freeze that.

  • -Full screen.

  • -OK, freeze that.

  • -Tighten up on that, will you?

  • -Vector in on that guy by the back of you.

  • -Zoom in right here on this spot.

  • -With the right equipment, the image can be enlarged and sharpened.

  • -What's that?

  • -It's an enhancement program.

  • -Can you clear that up any?

  • -I don't know.

  • Let's enhance it.

  • -Enhance section A6.

  • -I enhanced the detail and--

  • -I think there's enough to enhance.

  • Release it to my screen.

  • -Enhance the reflection in her eye.

  • -Let's run this through video enhancement.

  • -Edgar, can you enhance this?

  • -Hang on.

  • -I've been working on this reflection.

  • -Someone's reflection.

  • -Reflection.

  • -There's a reflection of the man's face.

  • -The reflection.

  • -There's a reflection.

  • -Zoom in on the mirror.

  • -You can see a reflection.

  • -Can you enhance the image from here?

  • -Can you enhance it right here?

  • -Can you enhance it?

  • -Can you enhance it?

  • -Can we enhance this?

  • -Can you enhance it?

  • -Hold on a second.

  • I'll enhance.

  • -Zoom in on the door.

  • -Times 10.

  • -Zoom.

  • -Move in.

  • More.

  • -Wait, stop.

  • -Stop.

  • -Pause it.

  • -Rotate it 75 degrees around the vertical, please.

  • -Stop.

  • And go back to the part above the doors again.

  • -Got an image enhancer that can bitmap.

  • -Maybe we can use the Pradeep Sen method see into the windows.

  • -This software is state of the art.

  • -The eigenvalue is off.

  • -With the right combination of algorithms--

  • -He's taken illumination algorithms to the next level,

  • and I can use them to enhance this photograph.

  • -Lock on and enlarge the z-axis.

  • -Enhance.

  • -Enhance.

  • -Enhance.

  • -Freeze and enhance.

  • [END PLAYBACK]

  • DAVID J. MALAN: All right.

  • That's it for CS50.

  • We will see you next time.

  • [VIDEO PLAYBACK]

  • -Everyone knows you went to Yale, but nobody knows what happened.

  • What can you tell me about that weekend?

  • What can you tell me about Rosebud?

  • [DRAMATIC CHORD]

  • [DRAMATIC MUSIC PLAYING]

  • [END PLAYBACK]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it