Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • DAVID MALAN: All right.

  • This is CS50 and this is lecture five.

  • And you'll recall perhaps that in this most current week,

  • you've probably been tackling a little something like this.

  • And if you don't actually have this experience from childhood,

  • what we're referring to in this problem set

  • is these kind of glasses that aren't 3-D glasses because it's both red eyes

  • or sometimes it's just a big piece of plastic.

  • But with this can you actually look up and see

  • what the answer is supposed to be.

  • And so this is the allusion to which we're referring

  • and the goal ultimately in problem set five's who done

  • it is to actually figure out how to implement that kind of red filter.

  • But to do that, you first have to understand

  • this thing, which at first glance, admittedly, looks pretty complicated.

  • But if you dived into the problem already,

  • you'll probably have wrapped your mind around at least

  • a few of these fields like the size of the image or the width of the image

  • and the height of the image, which should be a little more reasonably

  • straightforward.

  • But to implement this, you've had to deal with something

  • called a struct or a structure.

  • And so in C, we have this feature, recall.

  • And we didn't really play with this that much last time.

  • But you've seen it now in forensics, or you soon will.

  • And here we have the definition of a student.

  • So when C was invented decades ago, they didn't foresee

  • the need for a student data type.

  • They had int and char and float and double.

  • But we can invent our own data types much like in Scratch.

  • We can make our own puzzle pieces as follows.

  • Typedef to define a type, struct to say here comes a structure.

  • And what is the structure known as student?

  • Well in this case, I arbitrarily decided that a student would just

  • have a name and a dorm and both of those would be strains.

  • And you can imagine putting other things in there

  • like ID numbers, phone numbers, email addresses, or whatnot

  • so that you can actually combine all of this information together.

  • So let's just take a quick look at how you might use code like this.

  • Here is a file called struct.h.

  • It's common, but not necessary to declare your structures inside

  • of a file that also starts with .h so that we can share it across multiple

  • programs just like with other libraries' header files.

  • And here I've taken those training wheels

  • off as before where, string is actually just a white lie for char star.

  • But this is really the same data structure

  • and it's in a file called struct.h.

  • So let's take a quick look now at a program that actually uses

  • this in struct0.C. So let's take a look at what we've done here.

  • In struct0.C we have some header files up top.

  • But we also include this header file so that we have access

  • to this new custom data type.

  • And then in main we do a few things.

  • We first go ahead and ask the user for an integer called enrollment.

  • So hopefully they'll give us a positive number.

  • If we then do get back a number as expected in line 13,

  • what do we do in English here?

  • How would you just describe what line 13 is doing at this point in the term?

  • Anyone, yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, give us an array of students of size enrollment.

  • So even though on line 11 and prior we didn't

  • know how many students we needed, we did get on line 12

  • that answer, the enrollment.

  • And so on line 13 we declare an array using a variable saying,

  • give me this many elements, this many students in my array to store things.

  • And then we proceed, in the lines below, as follows.

  • We start iterating over the enrollment from zero on up to enrollment.

  • And we prompt the user on each iteration for a student's name and dorm.

  • And the right hand side of those two lines of code is pretty familiar.

  • You're just calling, get string.

  • But on the left hand side, we do have a slightly new piece of syntax.

  • We have students bracket I, which gives you the i-th students in the array.

  • But what piece of syntax perhaps jumps out at you?

  • Especially if you've never programmed before?

  • And we've not used this symbol just yet in this context.

  • What looks different?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, so the .name.

  • So you can probably infer that .name and .dorm is somehow accessing

  • the student's name and the student's dorm.

  • And that's literally all that's happening.

  • This dot operator tells the computer, go inside of the student's structure

  • at that i-th location and store the string that's coming

  • back from get string in that variable.

  • And similarly, do the same for dorm.

  • So it's like taking a struct and then looking inside of it

  • and going very specifically to one of the elements therein.

  • We've never needed this .operator before because in the past, any array,

  • any variable we've had has just been a string or an int or float.

  • We haven't had anything to dive deeper into.

  • So that's all that's going on there.

  • We've encapsulated, so to speak, inside of a student structure, name and dorm.

  • And then this last part is actually just a printing out

  • of that same information.

  • We're just printing out, so-and-so is in such-and-such a dorm

  • by passing in those two strings using our familiar percent S.

  • Now this program at the end of the day is kind of pointless

  • because you prompt the user for some number of students' names and dorms,

  • you print them out, and then you throw them away forever.

  • And that's not all that useful of a program long term.

  • And so we have in our second version of this program, struct 1.C a new trick,

  • too.

  • That's a teaser as well for the direction

  • we're going in this problem set, next problem set,

  • and beyond, where we're actually using files where files on a computer

  • are just a whole bunch of bits, zeros and ones.

  • Those zeroes and ones follow some pattern.

  • But we have yet to see a mechanism for actually saving files.

  • But here's how we can do it.

  • So above this line here, 21 and above, same program.

  • Just get a bunch of students from the user, per their name and dorm.

  • Then here line 24 we see something that's

  • a little new, though you have seen this in the forensics problem so far.

  • We call a function called F open, meaning file open.

  • That takes two arguments, according to its documentation.

  • The name of the file you want open and then the second argument

  • is how you want to open the file.

  • And even if you've never seen this before,

  • what might the W there represent?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, right.

  • So it's read and write are two of the most common operations for a computer.

  • R would be read, W would be write.

  • And this kind of makes sense if the goal now is to save these students to a file

  • so that the program is actually useful if you run it again and again.

  • So here we have a new data type on the left.

  • It's all caps, which is a bit of an anomaly, even in C.

  • But file, star, file just says, hey, give me a variable of type file

  • that can store the address of a file, so to speak.

  • And technically, that's not the address of the file on disk.

  • That's the address in RAM once you've opened the file.

  • But for now, just assume that this is an abstraction for the file.

  • And it's just called literally file.

  • So what is line 25's purpose in life?

  • Even though we've never written this code before now.

  • Yeah, what do you think?

  • AUDIENCE: If the file exists.

  • DAVID MALAN: If the file exists, or more generally, if the file was successfully

  • opened.

  • Because there could be bunches of things that go wrong.

  • One, as you've implied, the file might not exist.

  • Two, maybe you don't have permission so it exists but you can't open it.

  • Three, maybe there's not enough memory in the computer

  • to open yet one more file.

  • So any number of things could go wrong.

  • And recall, what is the special sentinel value that typically represents errors

  • now that we're in the world of pointers?

  • Null, so N-U-L-L in all caps is sort of the opposite of having a valid address.

  • It means no such address.

  • So we use it to check for errors.

  • So that's all line 25 is doing.

  • And then the rest of this is almost identical to the previous program,

  • except we're not using print F. What are we apparently using?

  • F print F and take a guess as to what the F in print F stands for?

  • Yeah, so file print F. So it works almost the same

  • except it takes one more argument.

  • The very first argument now is not a format string

  • like it's been for ages now.

  • It's instead a variable that references an open file and then the rest of this

  • is all the same.

  • So what's really cool about F print F is that you don't just

  • print to the screen.

  • You actually print to whatever file you've

  • opened so that then, on the very last line

  • here, when I call F close, that's equivalent to saving the file.

  • And then the program just quits.

  • And so what's neat about this in the end is

  • if I go ahead and scroll up here, and let

  • me go into my source five directory.

  • Let me go ahead and make struct one.

  • Structs.h not found.

  • What did I do?

  • Well, I just screwed up before class and misnamed this file

  • so that didn't just happen.

  • So now I've compiled the program.

  • All right, so now let me go ahead and run this ./struct1 enrollment will be

  • three.

  • And we'll say that it'll be Maria who is in Cabot House and Brian who

  • is in Winthrop.

  • And say David was, say, in Mather.

  • Enter and nothing seems to happen.

  • But if I type LS now and look inside this directory,

  • notice that I have a file called students.csv deliberately named

  • because if you've ever used Excel or Numbers,

  • a very common file format is what's called the CSV, comma separated values

  • format.

  • And this is sort of like a very simplistic database.

  • If I open this you'll see that indeed, the contents of this file

  • are separated by commas.

  • And if I were to actually open this file up in Excel, each of these columns

  • would open up visually in exactly that.

  • So what I did with my printdef, if I go back to structs1.c,

  • notice as I consciously included that comma there,

  • to create this sort of faux database format.

  • And just for good measure, let me see if I

  • go to download from the IDE's file manager

  • and I go ahead and open up students.csv.

  • And then if the program cooperates here, we have Microsoft Excel.

  • And now I've made myself a tiny little spreadsheet.

  • Now using c-code.

  • Now we're going to find pretty quickly that this is not

  • all that useful to make CSV files.

  • Because the more and more rows we add to these files, the slower and slower

  • it's going to get to search them.

  • And so before long, as we transition next week and be onto web programming,

  • we're actually going to replace spreadsheets or CSVs like this

  • and actually replace them with something more powerful, namely databases.

  • So that's a teaser then of what's to come.

  • But where did we begin this conversation?

  • It all kind of keeps coming back to what's

  • inside of our computer, which we can continue abstracting away.

  • You don't have to understand how this hardware works.

  • But we previously had said that you can at least think

  • about chopping up your computer's memory into a grid

  • so that you can just number of the bytes.

  • So that you have specific locations otherwise

  • known as addresses or pointers.

  • Last time we clarified that not all memory is

  • treated equal inside of the computer.

  • Rather, different chunks of memory are used differently.

  • So the top portion, so to speak, but there is no notion of top in reality.

  • This is just an artist's rendition.

  • So the top of your computer's memory might

  • be the heap, whereby you store certain types of values

  • and then down here is the so-called stack where

  • you store other types of values.

  • And if we zoom out there's actually different layers still of memory.

  • So let's actually tease apart what's going on here.

  • If, when you run a program, you have access

  • to a gigabyte of RAM or two gigabytes, and indeed, that's

  • what your Mac or PC does.

  • No matter how much RAM you have, the computer

  • typically gives you the illusion of having access to all of it.

  • And so this might be two gigabytes, then, of memory.

  • Well, one of the first things that happens

  • is that the zeros and ones that compose your program, whether it's

  • called A.out or Caesar or Vigenere or Structs One, those zeros and ones

  • are loaded way up top here in your computer's memory.

  • So the text segment in memory is a weird name

  • for the zeros and ones of your actual program.

  • It's not ASCII text.

  • It's like literally zeros and ones of your compiled program.

  • Below that are what are generally called initialized data or uninitialized data.

  • And this essentially just means any global variables

  • you have in your program are stored here or here.

  • If you gave them values at the top of your program,

  • they're initialized by definition.

  • And if you didn't, they're uninitialized.

  • So the compiler just kind of lays those out a little bit differently.

  • At the very bottom are something called environment variables,

  • which we don't use too much but you use them

  • in a few weeks for web programming.

  • You'll often store things like user names and passwords or other values

  • that you don't want to save in your code.

  • But you want the Mac or PC or server to have somehow access to.

  • But these are the ones we'll talk about the most, stack and heap.

  • And we saw a couple of examples of each of these, though, briefly.

  • What did we use the stack for or claim it's used for last time?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Say again?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Int min void, and more generally,

  • functions when they are called.

  • Main is, of course, the go to one for all programs' beginnings.

  • But it might call another function like swap.

  • And swap itself might call something else, maybe printdef gets called.

  • So every time you call a function, it gets a slice of or a frame of memory.

  • And they go up and up and up as those functions get called.

  • And this was ultimately illuminating, at least theoretically,

  • as to why this program was broken.

  • The program we looked at last time, the swap and no swap programs,

  • we claim that this implementation was wrong.

  • And yet I think when Kate came up and we did the example with switching

  • the Gatorade flavors, this is pretty much

  • an interpretation of that into code.

  • And it's correct in one sense and it's incorrect in another.

  • In what sense was this code actually correct?

  • In the no swap program.

  • Because we did walk through it briefly with the debugger.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: OK, we needed temporary container or variable in which

  • to store one of the values or one of the Gatorade flavors.

  • And by the time we got to this third and final line

  • in this function, what could you say about A and B?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, they had in fact been swapped.

  • And I saw that, I think, by plugging in, I think I struggled with the debugger

  • so I used E print def at the last minute just

  • to see what had happened after that very third line.

  • So it works.

  • This function does swap A and B. But it did not swap what?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: X and Y, which were the variables in main.

  • And so recall the story we told last time,

  • was that if we focus only on your computer stack,

  • that sort of bottom portion of memory, when main is called,

  • it gets a chunk of memory down here at the bottom

  • because it's the very first function to be called.

  • And it had variables, recall, called X and Y whose values were one and two.

  • When main called swap, the other function we just saw,

  • it had values called A, B, and also temp that initially were one and two.

  • And eventually became two and one.

  • But that picture kind of answers the whole question.

  • The reason X and Y didn't change is because you literally

  • change in that red version, just A and B.

  • So we solved this problem, recall, last time with what new feature of C?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Pointers, so addresses.

  • Rather than just hand the function, the values you want to swap,

  • give the function a road map, so to speak, to those values

  • so that the function can go to the values you actually care about

  • and move them wherever they are.

  • And it's a strange looking syntax at first.

  • It looks like multiplication all over the place.

  • But it had two different uses.

  • If you have the star or asterisk up here and a data type like int next to it,

  • this is saying, hey computer.

  • Give me a variable called A. But that's not going to store an int.

  • It's going to store what?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, the memory address of an int and just as B

  • is going to store the address of another integer.

  • So those are kind of placeholders for the addresses.

  • So down the road is the Science Center.

  • So if the address of the Science Center is 1 Oxford Street, Cambridge,

  • Mass 02138, that is it's postal address and it uniquely identifies that

  • building.

  • Similarly inside of a computer do values have unique addresses.

  • They're just much simpler.

  • They're numbers, they don't have streets and zip codes and all of that.

  • But it's the same exact idea.

  • So here we still have a variable temp of type int.

  • So give me a temporary variable just like Kate had

  • the extra cup that was initially empty.

  • *A, though, without a data type to the left of it, was like saying what?

  • *A in this context is a sort of different English sentence than this

  • one.

  • This means give me a pointer to an int or declare for me a variable

  • that will store the address of an int.

  • But this says, go to that address.

  • So it means A is in address, *A means go to that address so you can get

  • at the value, which probably in the story, is one.

  • *A means the same thing.

  • Again, go to that address.

  • And then, by the way, go to that address B.

  • Whatever that value is, put it where this finger is already pointing.

  • And then *B means go to that address and put whatever was in the temporary cup

  • of Gatorade, or in this case, the value one.

  • So pointers, though kind of a very convoluted way

  • of fixing this solve the problem fundamentally, because now

  • rather than passing one and two, we instead passed in here the address of X

  • and the address of Y that allowed the computer

  • to then go to those locations in memory and actually

  • do what it is we wanted it to do.

  • So long story short, this is how the computer stack is used.

  • When you call a function, it gets a new slice of memory on top

  • of whatever function called it.

  • As soon as that function is done executing,

  • this memory effectively goes away.

  • It's technically still there, because it's hardware.

  • It's not going to physically disappear.

  • But now whatever next function main calls, maybe printdef, maybe something

  • else, it will reuse this memory in any number of ways

  • that it wants for its own local values and parameters.

  • So given that definition, the fact that the stack, kind

  • of like trays in the cafeteria grow up and down

  • and up and down as a program calls functions and those functions

  • return, where do garbage values actually come from?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Old things that were in that memory.

  • So I kind of made this sweeping claim in the past

  • that you shouldn't trust variables if you yourself

  • haven't put values in them, because they have so-called garbage values.

  • But that actually has a very precise meaning.

  • That means if you have called a function and it

  • needs a variable that just happens to be here, and this is like a minute

  • into your program's running, a whole bunch of different functions

  • might have used and unused, used and unused that portion of memory.

  • So they're just going to have zeros and ones lingering there in some pattern.

  • The computer could be really defensive and it could just change

  • these all to zeros bits all the time.

  • And that could have been a reasonable design decision,

  • but long story short, C does not do that.

  • It would have just been time consuming, especially years ago

  • when computers were slower.

  • The language was younger.

  • And it just wasn't compelling to do that.

  • So you just get garbage values, which I have typically

  • just written as question marks.

  • But that's why.

  • There are garbage values there because they're your own previous values,

  • or those of some functions.

  • So the heap, meanwhile, was fundamentally different.

  • So the heap is this upper portion of memory

  • that is in some sense conceptually above the stack.

  • And it's up here.

  • And that's different in the sense that it's more for long term storage.

  • The stack is for short term storage, just

  • to use locally when a function is executing.

  • But suppose your program is to run for a while,

  • or suppose you want a function to allocate memory that does stick around

  • and does not just immediately become garbage values.

  • In fact, think about GetString.

  • GetString is a function we wrote and its purpose in life

  • is to get a string from the user, which is a whole bunch of characters.

  • And consider this.

  • If GetString is called and therefore gets a slice of memory on the stack,

  • and I type in Maria's name, M-A-R-I-A and then that gets a secret /0.

  • Where is the M-A-R-I-A and /0 stored?

  • Can it be, by this definition, stored on the stack?

  • Why?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Exactly.

  • So if we allocated space on the stack, which we could do with an array,

  • and then return the address of that string,

  • that would be valid in some sense, because you

  • would have allocated the memory.

  • And I'll go ahead and draw it like this.

  • If we had a function, GetString being called and GetString's

  • being called by main, that might mean that GetString has this chunk of memory

  • here.

  • And if Maria or I go ahead and type in her name,

  • that's like allocating space for M-A-R-I-A /0.

  • And you can think of this as just being a whole bunch of bytes in that frame.

  • So they do exist and they literally are stored in memory.

  • And I could return the address of this first byte, whatever this is.

  • Maybe this is byte 10 or 100 or whatnot.

  • And I could return that address to main as GetString's return value.

  • But as soon as I do that, yeah, exactly.

  • The memory doesn't technically go anywhere.

  • But it's no longer trustworthy.

  • All of that is now garbage value.

  • So you might get lucky.

  • And if you try to print GetString's return value you might see Maria

  • but maybe briefly, because the next time you call a line of code

  • or somehow that memory's reused, Maria's name

  • might get overwritten with some other values

  • because her name becomes, by definition, a garbage value.

  • And you don't know when it's actually going to get reused.

  • So that's not safe.

  • So this is why the heap exists.

  • If you need to keep your memory around for a while, like GetString is supposed

  • to do, turns out you can allocate it just elsewhere that won't disappear

  • until you yourself free it.

  • And so that's when we introduced last time a couple of new functions,

  • malloc for memory allocation and it turns out

  • there's an opposite of it, free, which you'll

  • need to use for future problem sets dealing with memory management

  • in order to undo the allocation here.

  • Otherwise you end up having what's called a memory leak.

  • And the computer might slow down, run out of memory,

  • because you're not giving it back.

  • And as an aside, it turns out there's a couple of cousins

  • of malloc, calloc and realloc.

  • Calloc is kind of cool in that the C means clear.

  • So calloc is identical to malloc, but it zeros the entire chunk of memory

  • for you.

  • So if you just want to initialize to have no garbage values whatsoever,

  • you can use calloc instead of doing it yourself with a four loop or something

  • like that.

  • Realloc, we're going to see, is a more powerful function

  • that allows you to take a chunk of memory and somehow grow it.

  • But we'll see what that actually means in a moment.

  • But with this power comes great responsibility.

  • And we saw that things can go horribly wrong for binkie

  • when you misuse memory addresses.

  • And recall that we looked briefly at this program by way of Nick's video

  • from Stanford.

  • And let's see what these lines of code actually represent here.

  • So here and here I'm declaring two variables, X and Y,

  • that are going to store what, generally speaking?

  • Addresses of integers.

  • So that's all that's happening there.

  • This now, was a new line of code last time.

  • Where it's saying, call malloc, so allocate some amount of memory.

  • How much memory do you want?

  • Whatever the size of an int is.

  • Odds are it's going to be four, maybe eight, some power of two

  • or some multiple of two here, or a multiple of four.

  • So here we get back what?

  • A chunk of memory or specifically, its address and we

  • store that in X. Meanwhile this line says, go to that address

  • and put the special number 42 there.

  • This next line blindly says, go to the address in Y

  • and put the unlucky number 13 there.

  • But that's where binkie had an accident because what

  • was inside of Y at that moment in time?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Memory it wasn't allowed to touch.

  • And why?

  • Be more precise.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: So close.

  • That's going explain the symptom ultimately but?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Exactly.

  • I didn't ask for any memory whatsoever.

  • So by default, even though this looks funky,

  • int*y just means give me space for an address.

  • So that means, give me four or eight bytes of memory in the computer's RAM.

  • But what is inside of a variable by default, have we claimed?

  • Just garbage values.

  • So there's going to be a number there, and numbers are our addresses.

  • So it's going to look like there's an address there.

  • So it is technically correct to say, go there.

  • But it's like following a map where you have no idea where you're going.

  • You might sort of walk off the edge of wherever you are.

  • And that's when bad things happen.

  • And so the visualization that Nick put together with claymation was this.

  • If you have this and it turns out it doesn't

  • matter if the star is on the left or on the right here.

  • But we have conventionally put it over onto the right side,

  • next to the variable.

  • When you do int*X and int*Y, that's like saying,

  • give me a chunk of memory or clay for each of these variables.

  • And he just kind of circled the little arrowhead there on the string

  • because there's memory for it.

  • It's just not pointing anywhere specific.

  • It's a garbage value at this moment in time.

  • The next chapter in this story was this.

  • Allocate space for an int, drawn in white clay here.

  • And Nick, because of the assignment, said X, which is again is a pointer,

  • is now going to point to that chunk of memory.

  • So it's no longer a garbage value.

  • It points somewhere specific.

  • That is why Nick was then able to say, go there.

  • Follow the arrow, and put the number 42 there.

  • But the next line of code, this one went horribly wrong

  • because Y was not pointing anywhere.

  • Nick tried to say, go there and put 13 but there is nowhere so

  • the computer crashes.

  • A segmentation fault, meaning that you touch the segment of memory

  • that you should not have because you tried to go somewhere

  • where it was just some garbage value.

  • And the fix, recall, might be this, or a solution.

  • If we instead kind of rewind and fix binkie.

  • And say Y equals X, that's not allocating extra space.

  • That's just saying, have Y point at the same chunk of memory as X.

  • Because again, X and Y are just addresses.

  • So if the address is 100 in memory, now X is 100, Y is 100.

  • They're both pointing at the same chunk of white clay.

  • So if he then did *Y, gets 13, that says, go there.

  • Update the number.

  • And now 42 became 13.

  • Very similar in spirit, in fact, to our capitalization example,

  • when we pointed to strings last time, at the same chunk of memory.

  • So any questions on the stack or as depicted here, by binkie, the heap?

  • Malloc allocates memory from the heap.

  • But anytime you declare local variables or arrays

  • inside of a function, that ends up on the stack.

  • And thus far malloc is the only tool, or calloc or realloc,

  • that gives us access to this new portion of memory depicted in white clay

  • here and sort of depicted in our diagram up above.

  • Any questions?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Good question.

  • So keep in mind that X and Y are both the same data types.

  • They're both pointers to addresses.

  • So as such, if you're going to set one equal to the other,

  • you have to just store the value that's in one inside of the other.

  • Otherwise it would be trying to put an integer

  • inside of a pointer, which isn't quite correct even though they're

  • technically both numbers.

  • So this is just saying, whatever address is in X,

  • put that same address in Y it says nothing about going to that address.

  • It's like making a photocopy of a map but not actually following

  • that map yet.

  • Until this line, which then says, go follow the copy of the map.

  • And it turns out it leads you to the same location.

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: So if you return, in C you can only return one value.

  • So you're kind of in a bad spot because you ideally want to return two values,

  • right?

  • Both A and B you want to return B and A. There

  • are ways you can work around that and kind of sort of return two values

  • and see.

  • But in short, it's much easier said than done.

  • Python will actually make that much easier.

  • But the short of it in C is you can't return multiple values.

  • And that ties our hands in this case.

  • OK, so odds are, related to all of this, you've

  • heard about this website, which is enormously useful when you're

  • trying to learn some new language, when you're out in the real world

  • trying to solve some problem because this

  • is this wonderful community of people who post questions and answers

  • and ideally link to canonical references so you can kind of understand

  • why some answer is the way it is.

  • But this actually has a very technical meaning, Stack Overflow.

  • And stack, of course, is now a familiar concept.

  • And you can imagine something like this picture eventually

  • going very wrong if you call so many functions that you just kind of run out

  • of memory.

  • Or not even run out per se.

  • What are you going to hit eventually if you keep calling function

  • after function after function?

  • The heap.

  • And then bad things are going to happen.

  • If your stack frames and your local variables

  • are so numerous that you start overwriting the heap memory,

  • now values that you allocated with malloc might themselves get clobbered.

  • So this is not the best design decision but it is the reality.

  • And it's mitigated only by using a compiler that

  • might help you notice this or by actually using

  • more memory than you actually need.

  • Now, Stack Overflow actually has a very technical meaning here,

  • as does heap overflow.

  • Stack Overflow means you overflow the stack.

  • You just call so many darn functions that you just touch memory you

  • shouldn't.

  • Heap overflow would be the opposite.

  • You keep calling malloc and malloc and malloc

  • and the heap overflows the stack.

  • Because for better or for worse, the stack is growing this way

  • and the heap is growing this way.

  • And eventually they're going to strike each other.

  • So this is a more general way of saying buffer overflows.

  • A buffer it's just a chunk of memory that stores data or values.

  • We people in the real world might have heard

  • of this in the context of video, YouTube, or various video players.

  • If you've ever seen the expression buffering, dot dot dot.

  • It's the most infuriating thing.

  • Something's about to happen in the movie or show you're watching.

  • And the damn thing start buffering, buffering, buffering.

  • What does that mean?

  • Well, the video player, YouTube or something else,

  • has a chunk of memory, which you can think of as an array.

  • And loaded into that array are the zeros and ones

  • that compose the movie or TV show you're watching.

  • And those were downloaded over the internet.

  • And what happens is, hopefully your internet speed is faster

  • than the movie's own playback.

  • So that even though you might be at the minute 10 in the video,

  • hopefully your computer has downloaded minute 10 through 11

  • so that you have this built up buffer of bytes

  • that you have a whole minute where you can watch,

  • even if your internet connection goes out.

  • But when your video is buffering, it means you have this array of memory

  • and you've kind of looked at or watched all of the bytes in it.

  • And the buffer is now empty.

  • But the opposite can happen, too.

  • If you try downloading more bytes then you have memory for,

  • you might try putting minutes of the video someplace they

  • don't belong in your computer.

  • Or if you call too many functions or if you call malloc too many times,

  • you might overflow the chunk of memory that's been allocated.

  • So buffers are all over the place.

  • And indeed, a string as we know it is just a buffer.

  • It's an array of memory and hopefully you

  • will only put as many characters in that string

  • as can fit in that chunk of memory.

  • So what kinds of things can go wrong?

  • This is a bit of a contrived example, but it

  • comes with a couple of visuals just to paint

  • a picture of how adversaries can hack into systems

  • that are written in languages like C. So here's a quick program.

  • We're going to include string.h.

  • And down here we have int main that takes command line arguments.

  • Notice this function does not do any error checking at all.

  • It's pretty stupid.

  • It just calls a function foo and passes an argv[1].

  • So the idea here is that this is a program

  • that if you take a command line argument, a word after the program's

  • name, just gets blindly passed into the foo program.

  • OK.

  • So next, what does foo do?

  • It accepts as input a string, a.k.a. char* and we're just calling it bar.

  • And then it allocates an array on the stack called C of size 12.

  • And then even if you've never seen this function before,

  • you can maybe kind of infer from its name, mem copy, like memory copy.

  • So it turns out this is going to copy into this memory whatever is

  • in this memory up to this many bytes.

  • So if I type in Maria as the command line argument, M-A-R-I-A is five.

  • So that means the length I typed in is going to be five.

  • And this is going to copy five bytes from bar into C that's it.

  • Now it's meant to be just a monster.

  • This program is pretty useless at the end of the day.

  • But it's kind of distilled a thread into the fewest lines of code.

  • What does this actually look like or what's happening?

  • We've called the function.

  • We've allocated 12 bytes.

  • We've copied those five bytes into those 12 bytes.

  • So all is well in this story.

  • But what actually happens in memory?

  • So here's a picture of the stack kind of zoomed in and nicely colorized.

  • So stack is going this way.

  • Heap is growing this way.

  • And this is just showing you technically how things are laid out on the stack.

  • I keep kind of simplifying the world by just drawing things

  • as X and Y and A and B. But they actually follow a precise order.

  • So specifically, if we have a local variable

  • called bar, which we did for this function,

  • it goes right there at the bottom.

  • If you then declare an array called C, it goes right up there on top.

  • And these are sized proportional.

  • This is four bytes.

  • This is going to be 12 bytes.

  • So it all is kind of proportional size.

  • And then it turns out, and we won't go into too much detail,

  • but if you like this stuff, CS 61 and other classes will explore it,

  • it turns out another thing that has always been tucked away on the stack

  • secretly is what's called a return address.

  • So when main calls swap, it's like handing the keys to the car

  • off to someone else.

  • Like swap, go do your thing.

  • But main kind of has to tell swap or any function it calls,

  • how to get back to its chunk of memory so that execution can resume with main

  • as soon as swap is done executing.

  • And it's not its stack memory, per se.

  • Recall that top portion of memory that I described as the text segment?

  • All the zeros and ones that compose your program?

  • It turns out that main, when it calls swap or some other function,

  • it tucks its own return address, the address of the appropriate zeros

  • and ones in that text segment, into four bytes here, or maybe eight bytes,

  • the address to which swap should hand the keys back to you, so to speak.

  • Otherwise it's like main handing the keys off to another function

  • and then it never hears from it again so main's other lines of code

  • never get executed.

  • So long story short, there needs to be an address or a map tucked away

  • on the stack so that swap can hand control back to main.

  • But what happens here when you actually use this memory?

  • Well, it turns out that if we just number the bytes on the stack, and that

  • was a size 12, the first one is zero, and the last one is 11.

  • So zero through 11 gives us 12 total.

  • So if we type in something like Maria or maybe more generally, hello, H-E-L-L-O,

  • which is the same length, that's using six bytes technically,

  • because the /0 and all is well.

  • Fits comfortably in C and we've got room to breathe.

  • But what if we don't type in Maria or hello?

  • What if we type in a very long sentence that's more than 12 characters?

  • Where are they going to end up?

  • If you type in a longer string at the command line in argv[1],

  • notice the code is flawed.

  • You're going to check the length of the word that the user typed in,

  • copy all of its bytes from bar into C. But what

  • if the length of the string you typed in is 13?

  • What are you going to do?

  • You're going to copy 13 bytes from bar into C,

  • thereby filling all of these 12 bytes plus one

  • more that you shouldn't be touching.

  • And if the string is even longer than 13,

  • if the adversary really typed a long sentence or phrase or word or whatnot,

  • you're going to really exceed the boundaries

  • of that buffer or that array.

  • So what does this look Like Well, if you type in a much longer word,

  • like A-A-A-A-A-A-A, you could end up overwriting these 12 bytes,

  • also these four bytes, also these green bytes, whatever they are,

  • and most importantly, even the red bytes that I described as the return address.

  • Now A-A-A-A-A really isn't going to cause anyone any trouble

  • because it's just a sequence of random ASCII characters.

  • But characters at the end of the day are just numbers,

  • and numbers are just bits, and programs are just bits.

  • So the threat here is that if you're a pretty sophisticated adversary, someone

  • who really knows programming, you could technically

  • write a program that does something really bad like delete

  • all the files on a hard drive.

  • Or send spam to everyone in your contacts.

  • Or anything like that because at the end of the day

  • the program that he or she has just written is just zeros and ones.

  • If you then convert that program zeros and ones to the corresponding,

  • even if weird-looking ASCII values, you could technically type a program

  • at the command line in argv[1] just by typing out the funky characters

  • on the keys that are not going to make sense to a human reading it.

  • But those ASCII characters in the context of a program

  • are going to be interpreted as code.

  • And if you're really good, and frankly, it's

  • not so much that you're really good.

  • If by a lot of trial and error, you happen

  • to overwrite the return address in a program,

  • you can trick the computer into not returning

  • back to main, but to jumping to the very input you passed into the program.

  • So A here implies attack, like attack code.

  • So if you're really clever, you can pass in an appropriate pattern of zeros

  • and ones, convert it to ASCII so the human can type them in at the prompt,

  • overwrite this return address, and trick the computer program

  • to return from this function not to main, but to like, this byte up here.

  • And maybe this byte coupled with all of these others

  • means delete all the files on this user's hard drive,

  • send spam to everyone in their contacts.

  • This is called a buffer overflow exploit,

  • and it's incredibly shockingly common even these days.

  • C is not commonly used for a lot of programs but it still is everywhere.

  • And there's other languages, too, like C++ that lend itself to this.

  • And even though this is still a little arcane

  • and you don't need to worry too much about the addresses on the screen,

  • the fundamental threat here is that if you do not

  • check the boundaries of your arrays and the amount of memory you've allocated

  • and you touch memory you should not, very bad things can happen.

  • You're effectively giving control to anyone

  • on the internet who can use your program because he or she can be clever enough

  • to inject their own zeros and ones into your program for execution.

  • OK.

  • So dear God, this is scary in a computer science sense.

  • So what can we do to defend against this beyond just not writing bugs,

  • which is never going to happen, right?

  • Even the most advanced, best programmers still

  • make bugs, especially as the software gets more and more complicated.

  • We have eprintf and we have help50 and we have debug50

  • but there's other tools, too, like Valgrind

  • which happens to be a tool for detecting memory leaks in a program

  • and other memory-related issues.

  • So let me actually go ahead and open this program, memory.c.

  • And it looks like this.

  • And let's see if we can't tease apart what is buggy about this program.

  • So here's the program here.

  • So, include standard lib.h, function f, function main, main calls f

  • and returns 0.

  • Nothing really interesting going on there.

  • So what's in f?

  • F on the right hand side allocates space for 10 integers, I think.

  • Malloc returns the address of that chunk of memory and stores it in X

  • and then line 8 is the bug, I think.

  • What's wrong with line 8?

  • Let me go here first.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Exactly, because we start counting at zero

  • and because we're asking the computer for space for 10 ints,

  • we get them back.

  • But that's going to be accessible via [0] through, as you say, [9].

  • So [10] is like writing one int past the buffer, a.k.a.

  • A buffer overflow.

  • Now it might be hard to see this, especially when the program isn't

  • as relatively short as this but is buried in a dozen lines, 100

  • lines, thousands of lines of code.

  • But tools can help.

  • So if I go ahead and make this program with make memory.

  • And then I go ahead and execute Valgrind ./memory enter,

  • the one downside of this program is that its output is just completely

  • overwhelming.

  • But let's see if we can't tease apart some recognizable terms.

  • So all of this on the left is a bit of a distraction.

  • This is just copyright information.

  • So the interesting stuff seems to happen here.

  • I see invalid right of size 4.

  • Not quite sure what all of this means, but I

  • do see that it's somehow related to uh-oh, line 8 of memory.c line

  • 13, line eight, line 13, a lot of bugs in the same places, it seems.

  • And then here, address such and such as zero bytes after a block of sys

  • 40 alloc, it's a little hard to wrap my mind around this.

  • So as always, let's at least initially just run this through help50

  • and see if it can help tease this apart.

  • So we see the same output.

  • It recognizes something familiar in yellow here.

  • Invalid right of size four and it highlights the lines.

  • And our TF-like feedback is, looks like you're

  • trying to modify four bytes of memory that isn't yours, question mark?

  • Did you try to store something beyond the bounds of an array?

  • Take a closer look at line 8 of memory.c.

  • So that's kind of a mouthful but it's just

  • because we have practiced reading stuff that's pretty arcane like this.

  • So we've extracted all of the salient details like line eight of memory.c.

  • So line eight of memory.c, as we noted, is already the dangerous line.

  • So what might it mean to have an invalid right of size four?

  • Well, it turns out an int in the IEDE is how many bytes?

  • Four, or 32 bits.

  • So invalid right of size four just means that this int here, zero, is an int,

  • it's four bytes, it's just invalidly being written, as you say,

  • to the wrong location.

  • So this is Valgrind's pretty terse way of communicating that idea.

  • So here we have then an explanation.

  • So how do I fix this?

  • Well, if my intent was just to update the last area there,

  • let me go ahead and do make memory enter, ./valgrind ./memory enter.

  • And now this is a good thing except we've made some progress.

  • Let me scroll up to fit more on the screen.

  • So I got rid of that message, invalid right of size four,

  • but this does not sound good either.

  • 40 bytes in one block are definitely lost in lost record one of one,

  • all right?

  • So I need a little help with that.

  • So let me do help50 again until I get familiar with the syntax.

  • And it's highlighted that chunk of output.

  • Looks like your program leaked 40 bytes of memory.

  • Did you forget to free memory that you allocated via malloc?

  • Take a closer look at line seven of memory.c.

  • So let's do exactly that.

  • So line 7 of memory.c, OK, here's where I malloc the memory.

  • And per help50's own feedback, what have I apparently not done?

  • Freed it.

  • And it turns out freeing is actually pretty straightforward

  • so long as you remember it do it.

  • You just call free, passing in the same pointer.

  • You don't have to remember how long it is.

  • It's up to the operating system to remember how long it is.

  • But now if I do make memory.

  • And now I do again Valgrind./memory enter, heap summary,

  • all heap blocks were freed.

  • No leaks are possible.

  • I see nothing particularly worrisome.

  • And the program is bug free now.

  • So Valgrind is another tool in your toolkit

  • that doesn't help you find logical bugs per se.

  • It helps you find memory-related errors, which might be logical bugs.

  • But it helps you hone in on them and see them in a way that you as a human

  • might not otherwise, especially if it's buried in many, many lines of code.

  • Now you'll notice, too, real briefly, in Valgrin's output

  • in these several examples, there are all of these funky numbers.

  • So if I go back to the original version here just a moment ago,

  • where it was in fact buggy in a couple of ways.

  • And I rerun make and I rerun Valgrind, you'll

  • see a whole bunch of things like this.

  • At OX such and such, at OX such and such, OX, what did OX denote last time?

  • So hexadecimal, so this is just a succinct

  • if weird-looking way of representing numbers, generally memory addresses.

  • And so this very specifically is saying that line 13 of memory.c

  • happens to be using memory at this location.

  • It's not particularly useful to us the programmers.

  • But that's why you see it.

  • And Valgrind is arguably a more advanced tool,

  • which is to say that memory addresses in tools like this and even in debuggers

  • tends to be written using hexadecimal notation like that.

  • Of course, you've seen hexadecimal converted.

  • Like these are the first three bytes in a JPEG,

  • which are typically thought of using hexadecimal like this.

  • But even though this looks new, it's the exact same idea.

  • And I thought I'd tease perhaps with a joke

  • that only a computer scientist can understand.

  • OK, so that's a good one that goes around each year.

  • So that of course is alluding to just these addresses.

  • And now let me propose one other debugging technique and explain like,

  • what the hell is going on here on stage today, too.

  • So you have of course debug50, which is a tool for debugging and walking

  • through code.

  • And silly though this is, there is actually

  • this thing in the world of programming, rubber duck debugging.

  • This is, in the absence of having a TF or a CA to bounce ideas off of,

  • this is in the absence of having a roommate around or roommate around

  • who wants to talk to you about your code.

  • It's recommended that if you have some bug in your program,

  • that you keep something like this on your desk.

  • And in the absence of roommates and friends

  • and hopefully doors closed, you talk to the rubber duck.

  • And I feel silly even saying this but there's a Wikipedia article

  • on this it's a real thing.

  • The idea here is that if you've ever been in office hours

  • or you've been chatting with a TFer friend

  • and just like talking about your code and talking about what

  • it is you think your code is doing, just very often that act

  • of saying something and hearing yourself say it can help reinforce one,

  • what your code is in fact doing.

  • Or if you realize verbally, wait a minute.

  • What I just said does not seem to line up with the code,

  • finally that light bulb goes off.

  • And it doesn't have to be a duck.

  • I mean, you can talk to the wall but that's a little stranger.

  • So at least this is a personification of having someone like a colleague

  • to talk to.

  • So at the end of today or during break by all means,

  • grab yourself a rubber duck the debugger and keep it on your shelf.

  • It doesn't have to quite be this large.

  • But this is a genuine debugging technique.

  • Like, in the absence of understanding something,

  • don't necessarily turn only to CS50 discourse or to office hours

  • or to sections or the like.

  • Literally try talking yourself through it,

  • even if it feels a little bit silly.

  • And if it does really feel silly, just look at him

  • and talk to yourself in your head perhaps.

  • But that kind of enunciation of what your code is doing or should be doing

  • will hopefully help all the more light bulbs go off.

  • And eventually you can just keep them on your shelf

  • and take off that training wheel as well.

  • Let's go ahead and take our five minute break here.

  • Grab a duck if you'd like and we'll come back with more.

  • All right.

  • So we're back and we keep thinking about memory.

  • Is this generally laid out as having addresses,

  • but of course we've clarified that a little bit in that we have

  • more of a canvas at our disposal now.

  • But even then we keep talking about having things back

  • to back to back in memory.

  • But that simply needn't be the case.

  • Like, what we have now with pointers and with malloc

  • and these kinds of functions is the ability to get memory from anywhere

  • we want and somehow stitch it together or connect these things.

  • But how do we actually do that with the ingredients we now have?

  • And why might we want to?

  • So here is how we keep representing something like an array.

  • An array, again, is just a contiguous chunk of memory

  • where you store things literally back to back to back.

  • But suppose that I've put six things into this array, six numbers, one, two,

  • three, four, five, six.

  • What happens if I try to put seven into this array?

  • What do I have to think about or worry about?

  • Touching memory that I'm not allowed to touch.

  • So I'd better not put it over here.

  • But what if seven must go in this array?

  • Well, I don't have too many options.

  • Like, if I fill the space I have to either overwrite some value or put it

  • somewhere it shouldn't be, and that should never

  • be an option because the program could or will crash.

  • And so I could alternatively just allocate more memory.

  • So how do I do that?

  • Well, if I've allocated this array initially to be a size six,

  • I could encode, just allocate a new array of size seven and then do what?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, add all the numbers and seven.

  • So I can take this array, I can give myself another one elsewhere in memory,

  • copy the values from old to new, then maybe free the old.

  • And then move on with my life because now I have enough memory.

  • Now, that fixes the problem.

  • And if we implemented it in code correctly,

  • it would be by nature correct, assuming there's enough memory in the computer.

  • But why is that arguably bad design?

  • AUDIENCE: It's a waste of space.

  • DAVID MALAN: It's a waste of space how?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, even though I don't keep both around in the story

  • I'm telling, it's temporarily pretty inefficient in that I'm

  • using twice as much memory as I actually need only to then kind of downscale.

  • What else?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, if I want to change the size of the array again,

  • whether bigger or even smaller perhaps, if I remove items from the list, then

  • I just have to keep allocating new memory, which is wasteful

  • and more importantly, it's not just space inefficient.

  • In what other sense is inefficient?

  • Time, why?

  • Where is the time coming from?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: So that's an asymptotic notation.

  • This is copying something from one array to another would be in big O of what?

  • They go of side, yeah.

  • So if we just genericize size as n it's like big O of N.

  • It's a linear time operation, which isn't horrible, right?

  • N squared is bad.

  • Linear has never really been bad but we already know that log n is better,

  • constant time is best.

  • And so just wasting any amount of time doesn't feel like optimal design.

  • And that all is a function of arrays being

  • a fixed size and contiguous in memory, back to back to back.

  • In fact, arguably there's one other issue that could occur.

  • It's not so much if you have a very small array.

  • But what if you have a huge amount of memory available

  • but it's only in size five or six increments?

  • Like, for whatever reasons your computer's

  • been using some of this memory, this memory, this memory, this memory,

  • and if you add up all the available memory there's a lot of free space

  • but they're always separated by memory that's in use.

  • So maybe this memory is free, then there's a bite that's in use.

  • This memory is free, there's a bite in use.

  • So your memory is quote unquote very fragmented.

  • So you have lots of available memory but it's not contiguous.

  • You cannot, in this model, allocate an array of size seven if you don't have

  • that memory available contiguously.

  • So not as big of a concern given enough memory,

  • but at least something that could arise.

  • So let's introduce the solution.

  • Something here called a linked list.

  • And the name kind of describes what it is.

  • It's still a list of numbers but it's linked by way of these arrows.

  • And we've used arrows before.

  • What have we used arrows to represent in pictures past?

  • Yeah, so pointers.

  • So now that we have the expressiveness of pointers, you can kind of digitally

  • stitch your data structures together if you spend a little bit more memory.

  • So we've not really solved the problem you identified,

  • which was the space use.

  • But if you're tolerant of that and if you've

  • got enough memory at your disposal and can

  • afford to spend it, why don't we store for every number

  • not just the number but also space for a pointer?

  • So each of the boxes I've drawn here now doesn't just

  • have a box for the number itself n.

  • It's got really two boxes together, one for n and one for something

  • we'll call next, which is going to be a pointer to

  • or equivalently the address of the next node, as we'll call it,

  • the next box in this list.

  • Now even though we've drawn it here very prettily from left to right,

  • technically these boxes could be anywhere in memory,

  • specifically in the heap, we're going to see.

  • But they don't have to be back to back.

  • And so the fact that there are these gaps in between the nodes

  • deliberately paints a picture that these things don't have

  • to be back to back to back any more.

  • They can be anywhere.

  • And now suppose I've got these five numbers, nine through 34.

  • Suppose I want to add another number.

  • Where do I put it?

  • I don't seem to have room.

  • But based on this picture, how much you infer we're going to engineer this.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Yeah, so why don't we just allocate space for another

  • and it's not going to fit on the board here but who cares?

  • We can put it here.

  • Put the new number in it and then just add a line, an arrow from it to that.

  • And so this then is going to be what we call a linked list

  • and it gives us dynamism.

  • It gives us the ability to grow or shrink our data structure, addressing

  • your concern, not necessarily yours.

  • It's still a little space wasteful, but we gain benefits for both of you,

  • your concerns time is a lot better now because we don't

  • have to waste time copying the values.

  • We just add to the values.

  • And if we want to grow or subtract it we only do as much work

  • as we're trying to add or subtract.

  • We don't have to worry about fixing everything.

  • But there is some complexity here.

  • And given that we have a whole bunch of these,

  • which would make problems at five a little easier,

  • if you haven't quite finished who done it.

  • Could we get for just one demo today six volunteers?

  • Six volunteers?

  • All right.

  • Come on down, right in front here.

  • All right.

  • Well actually, come on up.

  • Come on up and have a two and three over there, four.

  • No one over here today.

  • OK, OK, five.

  • OK, six.

  • OK, six, six.

  • Come on up.

  • All right.

  • We'll save this till the very end.

  • But let me give you guys in the meantime, these numbers.

  • And if you don't mind holding the numbers

  • out I want you to go over to the left there

  • and just order yourselves just as the picture on the screen OK, you'll be 17.

  • Let's see, let's see, nine.

  • I might have given out too many numbers.

  • OK, let me free that and give you nine instead, if I may.

  • And give you, let's see.

  • You have 17?

  • OK, so 17.

  • And yeah, I'll have you go ahead and flip yourselves if you don't mind.

  • 22, and then 26.

  • And what do we got?

  • 20, 34.

  • 34.

  • OK, and you guys will be slightly special.

  • So who wants to be literally first?

  • OK, so here first.

  • And who wants to be temporary?

  • OK, you'll be temporary, all right.

  • Come on over here.

  • OK, come on over here and if you guys could step a little closer.

  • So we have 9, 17, 22, 26, 34, and give yourselves like a foot in between.

  • And if you guys could use your left arms to represent the pointers to visualize

  • who is linked to whom.

  • OK, and why don't you just point, yes, very deliberately down.

  • So what's your name?

  • NAZLI: Nazli.

  • DAVID MALAN: Nazli.

  • So Nazli's left hand will be a null pointer.

  • It's not pointing at anyone.

  • So literally just pointed down to the ground, like ground electrically.

  • OK.

  • So now we just have to have some first node.

  • So what's your name?

  • OLIVIA: Olivia.

  • DAVID MALAN: Olivia is a little special here in that her paper has a word

  • and it's not just a number.

  • She represents a distinct variable called first.

  • Because that one catch with the linked list

  • is that you don't remember it by way of the address of a contiguous

  • chunk of memory.

  • You remember a linked list by way of the address of the first node in the linked

  • list.

  • And what's your name?

  • ACHMED: Achmed.

  • DAVID MALAN: Achmed.

  • So Achmed here happens to be the very first node in the list right now.

  • So Olivia's left arm is going to be pointing

  • to Achmed to represent that he is the first node in the list.

  • OK, and what's your name?

  • JESS: Jess.

  • DAVID MALAN: Jess.

  • We're going to use Jess in just a moment to complete some operations.

  • So suppose that we actually want to insert some value into this list,

  • like the number 55.

  • All right, so the number 55 is going to require a little bit of cleverness

  • here.

  • And so I need some place to store this.

  • I need to malloc.

  • So OK, you've been volunteered.

  • What's your name?

  • STELLA: Stella.

  • DAVID MALAN: Stella, come on up.

  • So malloc Stella.

  • And we will store the number 55 in Stella's node.

  • And right now if you could just kind of point your left hand anywhere.

  • It's kind of a garbage value.

  • OK, thank you.

  • And now what's your name again?

  • JESS: Jess.

  • DAVID MALAN: Jess.

  • OK, so Jess is going to help us find the right space here.

  • So we can obviously see where 55 belongs if we're keeping this sorted.

  • But again, computers don't have that luxury.

  • Moreover, we no longer have random access.

  • We can't just jump to [0], [1], [2] because there are these gaps between

  • them.

  • And just to make this more clear, can every other of you

  • step forward or back so that it just looks a little weird?

  • So you can no longer index into this data structure

  • because again, it's not an array.

  • It's not back to back.

  • These things could be anywhere in memory and it's only the pointers

  • that are linking everything together.

  • So Jess now is going to initially point at the very same thing

  • that Olivia is pointing at, the same address or Achmed.

  • All right, so now we have a bit of redundancy.

  • But suppose we want to insert 55.

  • What kind of logic, what's the pseudocode here for Jess

  • to find the location for Stella?

  • What should Jess do?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: OK, keeps going until she finds the null pointer,

  • or more specifically, till she finds the null pointer or the right location

  • for this number if we want to keep it sorted.

  • So let's do that.

  • So you're pointing at Achmed.

  • The number nine is not greater than 55.

  • So Stella belongs after Achmed.

  • So what are you going to do?

  • Good, you're going to point at Maria.

  • So you know to point at Maria why?

  • JESS: Because nine is less than 55.

  • DAVID MALAN: Nine is less than 55 but also, Achmed isn't just story nine.

  • Right, he has this next pointer that's telling

  • Jess where the next value is to look.

  • So his left hand is the substitute for what would have been just ++ the world

  • of an array.

  • So you go ahead and physically walk and let's just walk through this.

  • So now we're pointing here at 17.

  • It's not greater than.

  • We point next at Arunev.

  • OK, that's a 22.

  • Still not the right value.

  • We keep going.

  • What's your name?

  • Jeung Wan?

  • OK, that's not the right number because he's holding 26.

  • And now we catch you again?

  • Nazli.

  • Still no good and now go ahead and follow her left hand.

  • OK, so now we know that this has got to be the right space because we haven't

  • found numerically the right space.

  • So if we could borrow you, Stella, all the way over here.

  • Well, you're not technically physically moving in memory

  • but this will just make the story better.

  • OK, so yes, we're re-alloc-ing sort of.

  • So what are you going to do now with Stella

  • now that you found the right location?

  • Leave her here, OK.

  • But she's just kind of orphaned now.

  • She's pointing at nowhere and no one's pointing at her.

  • It's kind of sad.

  • And this is actually perfect.

  • Memory leak.

  • OK, so let's fix.

  • Who has the point at whom?

  • OK, good.

  • And now what should Stella point at?

  • Since now she is the end of the list and she's just

  • pointing to some garbage value.

  • And she's pointing, to be clear, at some garbage value because when you call

  • malloc you just get garbage values.

  • We overrode one of those garbage values with 55 for n,

  • but the pointer has not yet been overwritten.

  • So what you want to do, Jess?

  • To whom?

  • That's OK.

  • It's close.

  • What should her value be if there's no one to her left?

  • Should be null.

  • OK, and how did we represent null before?

  • Yeah, exactly.

  • Null, so now we have a list and now just to fix things, your pointer,

  • so Jess is kind of temporary.

  • We don't really care what her value is.

  • But who's important over here?

  • What's your name again?

  • Olivia was first and now do we have a list that is still linked?

  • We do.

  • And now, it of course took a little while to walk through this.

  • And frankly I kind of told a lie.

  • I haven't really made this faster because what

  • was the running time of this algorithm?

  • It was still a log of n.

  • But that's because what?

  • I was trying to maintain what property?

  • Sorted.

  • So suppose I relax that constraint.

  • Suppose that I didn't care about being sorted order.

  • Can I do better than 0 of n in order to have inserted Stella?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: OK, Constan time, where can I put her then?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Exactly.

  • So if we don't care about sorted order, I

  • could have saved myself a huge amount of time

  • and we could have inserted Stella here, updated Olivia's hands, and updated

  • Stella's hands to point at Achmed.

  • And then we're done.

  • Constant time.

  • And here's an example of why big 0 of one doesn't mean one step.

  • That's constant time because it's like moving Olivia's hand and Stella's hand

  • but not Achmed.

  • So that's at least two steps but it's still always two steps.

  • So if we could get a round of applause for our volunteers here?

  • [CHEERING] You can keep both your numbers and these if you'd like.

  • Thank you so much.

  • May let this help you P set five.

  • Oh, sorry.

  • All right, thank you.

  • So that's of course just one operation.

  • There could have been other numbers.

  • Like, if we were trying to insert five in sorted order,

  • we would have gotten our constant time or maybe our omega of one

  • because in the best case, the number might end up at the beginning.

  • 20, had we inserted it with our humans, might

  • have been a little more involved because we kind of have to walk the list

  • as we did with Jess.

  • But then she's kind of got to point at the person behind her

  • and the person in front of her because she has to update more hands.

  • Which is to say, that doing insertions or even deletions

  • requires a bit of re-stitching.

  • It's like kind of fixing clothes here if you're

  • trying to maintain a contiguous thread through all of these data structures.

  • But at the end of the day, even though it uses pointers,

  • it really just boils down to getting this logic right.

  • And in fact, let me do an example here of some code of how we might do this.

  • I'm going to go ahead and open up list 0.c

  • and take a look at how this here works.

  • So in list 0.c we have the following code.

  • We first, in main, get a positive number from the user.

  • So I'm going to wave my hands at this because this is kind of old school now.

  • We've been using a do while loop to get a positive number from the user.

  • And I'm calling it capacity.

  • So this is an example of getting it in from the user, calling it capacity.

  • Capacity meaning the maximum possible size for a data structure.

  • That would be a term of art there.

  • Here now is where, per week two the class,

  • I allocate an array for ints this many ints capacity.

  • So that, too, is hopefully familiar.

  • There's no pointers yet.

  • There's nothing too fancy.

  • I'm just allocating an array of size based on what the human just typed in.

  • But here's where it gets a little interesting.

  • The purpose of this program is to just prompt the user for that many numbers

  • again and again and again.

  • So I can type in 1, 2, 3 or 5, 17, 20, 22, and so forth.

  • And just build up an array of numbers and memory,

  • but I'm going to trip over the problem we identified a little bit ago.

  • So here we go.

  • I initialize size to zero because initially there's

  • nothing in the structure.

  • While size is less than my capacity, so while the current size

  • is less than the max, so to speak, do the following.

  • First I'm going to get a number from the user.

  • And the goal is to now insert this number into this list.

  • But now I'm going to do a quick sanity check.

  • Let me check and make sure this number is or isn't in the list already,

  • because I don't want to have duplicates.

  • Why?

  • Just because.

  • I want this to be a very clean list, no duplicates.

  • And so this loop of code here, maybe from week one, week two, week three,

  • is just an example of iterating through an array, looking for the number,

  • and if so, remembering that I found it with a Boolean

  • so that I have an answer found or not found in a Boolean variable.

  • OK.

  • So that's all that code is doing.

  • Still no magic.

  • So now is where the interesting part of the story happens.

  • So if the number was not found in the list already, here is how per week two,

  • we add a number to the end of an array.

  • Because if the size is initially zero, numbers [0]

  • is where the first number should go.

  • If size's initial is one at this point, numbers [1]

  • is where the new number should go.

  • And then I should increment size.

  • But there's a problem here in that once I print out these numbers

  • and the program ends, I can only have inputted as many numbers

  • as are available, as I have capacity for.

  • So it's kind of constrained.

  • So what if I want to do a little better than this?

  • Enlist 1.c which does now introduce new material, or at least an application

  • of the topics this week and last?

  • Here in line nine is technically how I can allocate

  • an array before I know the size of it.

  • So an array, recall, is just a chunk of memory

  • identified by some word, like numbers or students or whatnot.

  • But technically we've seen that there's kind of this equivalence, where

  • if an array is just the chunk of memory, you can technically refer to an array

  • by an address, the address of its first byte just like a string.

  • So this on line nine is a little new.

  • But it's kind of that idea.

  • Give me a pointer called numbers but initialize it to null.

  • There's no space for the number.

  • But this pointer therefore doesn't point to any chunk of memory.

  • It would be like Olivia standing up here awkwardly with no one

  • to point at because we've only allocated space for the first pointer,

  • not for everyone else on the stage.

  • So capacity is by default zero.

  • So here the rest of the program is pretty similar.

  • I go ahead and infinitely prompt the user for a number.

  • I check for errors.

  • It turns out if you read the documentation

  • for get int it will return a special constant called int

  • max if the user stops for writing input.

  • Here I'm just checking with a loop in a Boolean,

  • whether or not this number is in the list, same as before.

  • But here's where I start to use some new functionality.

  • If the number was not found already in the list, and the size of the list

  • already equals its capacity, that is if it is filled,

  • what do I have to do conceptually now?

  • If I've got an array whose purpose in life, as we proposed earlier,

  • is just to grow it?

  • I need to add space for it.

  • So I need to add space, as we were proposing.

  • Even though this is kind of lame in that it's a little inefficient,

  • here's how we do it.

  • I can simply call real alloc, passing in the array that whose memory

  • I want to reallocate.

  • And I just tell realloc how much space I now want.

  • So here, this is the size of the int, four bytes.

  • And this is how many bytes I want.

  • So whatever the current size is, realloc give me one more byte.

  • And then realloc gets assigned to numbers

  • and I check if it's null or not null.

  • And I'm keeping it a little simple.

  • We could add some additional error checking here, but what does realloc

  • do?

  • Realloc is pretty cool because if you pass to realloc,

  • a pointer to a chunk of memory that's like of this size, realloc

  • will look in your computer's memory and if it

  • sees a bigger chunk of memory over here, it

  • will handle the copying of everything over to it for you.

  • And then return to the address of the new chunk of memory

  • and free the old for you.

  • So does the switcheroo.

  • It's still linear time but this is how you would use it

  • without having to alloc and free and use a four loop like we described before.

  • And then you can go ahead and put the number in the list as before.

  • So the only new thing here, even though we're going through it quickly,

  • is that this is how you call realloc.

  • You pass in a pointer that's previously pointing

  • to a chunk of memory or even null.

  • That's OK.

  • If you pass it in a pointer that's pointing null,

  • it will give you back the address of just one byte and then the next time

  • two bytes, three bytes, and four bytes.

  • But with linked lists things get a little more interesting.

  • And the syntax is going to be a little funky but let's see.

  • Here it turns out is how we can implement each of our human volunteers

  • in code.

  • Each of them I called a node and node is a term of art in CS.

  • It refers to some data structure that contains some information.

  • Each of them was holding a number, which we called an int.

  • And then each of them, this is kind of funky,

  • had a left hand called next that was meant to point to someone

  • who looked just like them structurally.

  • So the idea here is that we don't want to just have another structure

  • inside of a structure, otherwise you would

  • get this sort of infinite Russian doll kind of thing going on.

  • You instead want to say, each of these structures

  • has a pointer to someone else who looks like them structurally, too.

  • And that's how we get the left arm metaphor implemented in code.

  • So that just defines a node, one of our volunteers.

  • Meanwhile though, here's how we would implement Olivia in one line of code.

  • So Olivia was herself a pointer to a node.

  • She didn't have a number, right?

  • Her sign just said first.

  • She was not holding a number.

  • So we don't need a whole structure for Olivia.

  • We just need a pointer to one such node structure.

  • But initially she was just kind of standing here so we'll just

  • say she was null initially.

  • So the rest of this code is presumably about malloc-ing someone like Stella

  • from the audience, updating Olivia, using Jess to actually update pointers

  • temporarily.

  • So let's see what this looks like in code.

  • So while true, I'm just going to prompt the user for numbers like before.

  • As before, I'm going to check for errors in the same way.

  • Here's a little different.

  • Here's the block of code wherein I just check if my current linked list already

  • has the number I'm trying to insert.

  • But remember, we took away the expressiveness of square brackets.

  • Can't do that anymore.

  • I have to now do this with pointers.

  • So here we go.

  • I, with my four loop, initialize a pointer, Jess,

  • to point at the same thing Olivia was pointing at, numbers.

  • So again, Jess was also just a pointer.

  • She was not holding a number.

  • She was holding PTR, so she was just one pointer

  • pointing at the same thing as Olivia.

  • Here we're saying, so long as Jess is not equal to null,

  • so as long as Jess doesn't walk off the edge of the stage,

  • go ahead and do the following.

  • What do I want to do?

  • And this syntax is new.

  • We saw at the beginning of today the dot operator,

  • which says take a data structure like students

  • and go into it with the dot operator.

  • Get their name and their dorm.

  • That was because the first demo today did not use pointers.

  • It just used structures.

  • Now we're using structures and pointers.

  • And so the syntax changes just a tiny bit.

  • When you have a pointer that is a pointer to a structure

  • and you want to follow that pointer and go to that structure,

  • the one piece of syntax in C that maybe actually maps to reality or concept

  • is this arrow operator, which means follow the left hand,

  • look at the structure, and get at that number.

  • And so if the volunteer's number equals the number that Jess was looking for,

  • go ahead and say found is true.

  • Otherwise update Jess or pointer to equal

  • whatever her left hand is pointing at.

  • So if Jess was temporarily pointing here,

  • she would then update herself by pointing there.

  • And so that's all this code is doing.

  • Jess starts to point at whatever her left hand was pointing at.

  • She moves physically on the stage.

  • All right, so now is where things get a little ugly.

  • And we'll do this with a hand-wave because I think this one is better

  • done at a slower pace on one's own.

  • And we'll come over these kinds of things in section and beyond.

  • Here's how I allocate space for a new node.

  • When I said malloc Stella, it's this line of code here, 45.

  • Malloc space for the size of a node and store it in the person

  • that Stella embodied.

  • Otherwise, if there is not enough memory, if something goes wrong,

  • return one.

  • Meanwhile, here's how we add the number to the list.

  • So this is exactly what Jess ended up acting out.

  • First we handed Stella her number, which is line 52 here.

  • We technically told her to point at a garbage value,

  • so I've improved the code since.

  • So line 53 would be like telling Stella, point here, not here.

  • So that's just cleaning up that omission last time.

  • And then here we have the same kind of code

  • again, a four loop that looks kind of funky

  • but it's just like updating the hand as you walk through the list.

  • And here's where the interesting part happens.

  • At the very end of our story, Jess kind of manipulated our volunteer's arms.

  • So if not pointer next, which is a cryptic way of saying,

  • if pointer next equals equals null.

  • So if Jess has found the end of the list,

  • go ahead and update whoever she is pointing at's

  • left hand to point to Stella, the new node.

  • Then break out because we're done.

  • So syntactically, this is hard and problem

  • set five will afford us opportunities to walk through very similar code.

  • But for now, just realize that all we're doing

  • is instead of just using super simple arithmetic, plus one, plus one,

  • plus one, we're just kind of following these arrows, following these arrows.

  • And the kind of syntax we'll use for that

  • is just this, which is not very readable at first glance.

  • But that's why I grasp onto, if you are a more visual person,

  • the kinds of hand manipulation and arm changes

  • that we were doing here physically with our volunteers.

  • And then we, again, print up [INAUDIBLE]..

  • The last thing here I'll note, and you'll

  • do this in problem set five, is here's how you might

  • free a whole length list of numbers.

  • I just kind of congratulated our volunteers and everyone

  • left the stage, thereby being freed.

  • But if we wanted to do this more methodically, we could use a four loop

  • but here I chose to do a while loop, because it's

  • a little more succinct design wise.

  • Here was our pointer, temporary pointer pointing at numbers.

  • And here I can say while pointer is not null because if it's null

  • my work is done.

  • Here I go ahead and say, update this value next to equal

  • whoever's next in the list.

  • Free whoever's currently in the list.

  • And then update the next pointer.

  • So again, don't worry too much about the lower level details here.

  • But just take away for today that we do now have a way of implementing,

  • in code, the higher level intuition that derived

  • from this kind of data structure.

  • But don't fret yet about the code itself.

  • But we now have the ability to stitch data structures together like this.

  • Upside of which is now we get dynamism, right?

  • We're no longer stuck painting our ourselves

  • into the proverbial corner with arrays by not allocating enough memory.

  • Or conversely, wasting memory by allocating way too much just so we

  • don't have to deal with the problem.

  • But we pay a price with the linked list.

  • We get dynamism and can more efficiently add a node, subtract a node,

  • and we just have to in constant time, update those pointers.

  • But we spend more memory for all these darn pointers.

  • And frankly, the code is more complex.

  • So recall from our first or second week, human time, programmer time

  • is a valuable resource.

  • And making something harder and more time consuming to implement

  • might not be a price you want to pay.

  • And so even I was just chatting with a colleague

  • yesterday about how in graduate school I used to cut corners,

  • especially late at night when writing code.

  • And I would write sometimes deliberately really bad code

  • that might take like eight hours to analyze

  • some data set for some research project I was working on because you know what?

  • I realized it was faster for me to write bad code, poorly designed, that

  • takes eight hours because in those eight hours

  • I could just go to sleep, frankly.

  • Now I would say that was only because my advisor was not grading me

  • on correctness and design and style.

  • But it is a manifestation of a very actual resource

  • that I don't recommend you cut that particular corner for now,

  • since one of the goals of being in a class is to get better at design.

  • But at the end of the day and in the real world,

  • even CS50 staff and I are constantly making decisions.

  • Well, yeah, we could improve this feature of help50

  • but it's going to take a week to do it.

  • Or we can just throw in some extra line of code and get it done now.

  • And it's a trade off.

  • And this is what makes code good and bad.

  • And when you start to cut these corners in the real world,

  • you start to accumulate what the world would call technical debt.

  • And debt tends not to be such a good thing.

  • And that speaks to the price you're paying in the long term

  • because it might take me and the staff longer this summer

  • now to go back in and clean all that up.

  • And God forbid, overnight frankly, and this

  • happened more often than I should admit, my code was buggy and bailed out

  • at like 2:00 AM I wake up eight hours later thinking, my data's ready.

  • No.

  • I should have done it right the first time so

  • I could rerun the code again and again.

  • So what else do we get now from this ability

  • to have pointers in data structures?

  • So there's this picture here from Mather's Dining Hall.

  • The cap represent the notion of actual trays.

  • And we've been using the stack in a very low level

  • arcane way to talk about memory management, which

  • isn't all that useful to us for solving problems.

  • But the data structure is.

  • It turns out there is a data structure in computer science called a stack.

  • And your computer, Mac or PC, are constantly

  • using it to manage functions and memory, but we can use it, too,

  • for various applications.

  • We can implement a data structure within I have two operations.

  • They're conventionally called push and pop.

  • Though it's like add and subtract.

  • You can call it anything you want but most programmers would

  • call it push and pop.

  • Push is like adding a tray to the stack and pop is like taking one off.

  • But just as the name implies with the stack,

  • what's this characteristic of a stack is that it is an example of a LIFO data

  • structure, last in first out, L-I-F-O.

  • Now what does that mean?

  • Well, if one of the staff from the dining hall comes by with a new tray

  • that's just been cleaned and he or she puts it on the top of the stack,

  • which one is a normal human being going to grab first?

  • The last one in, right?

  • It'd be strange and kind of difficult to get down on your knees

  • and pull out the bottom one, even though that would be more fair, right?

  • Like that little tray down there has been waiting the longest to be used.

  • But it's under the weight of the whole stack, literally.

  • But that, nonetheless, is how a stack would work.

  • And you can implement the stack now in a couple of ways.

  • And here's where the world gets interesting in programming,

  • in that there is this distinction between design

  • of data structures and low level implementation details.

  • A stack is as I've described it, a LIFO data structure.

  • Push and pop, last in, first out.

  • That's it.

  • How you implement that could be any number of ways.

  • For instance, I could implement a stack as a C data type, custom one,

  • that has an array of numbers for this capacity where capacity

  • is some big constant like 100, 1,000, however many trays I want

  • to store so long as I keep track of the size

  • of how many trays are in it so that I can always make sure its size less than

  • or equal to the capacity.

  • Just to make sure I don't try to cram too many trays in there.

  • But what's a downside of this implementation

  • of Mather House's stack of trays?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Limited space, exactly.

  • I have consciously hard coded capacity to be some fixed value.

  • So if we buy a new tray or a whole box of trays arrive,

  • might not fit there, right?

  • Once I exhaust this remaining space, I need to make a new pile

  • or I need to store them elsewhere.

  • I'm just out of space.

  • So maybe this is a good design decision in that it reflects reality.

  • Or maybe it's stupid because now I can't store even more trays

  • when they come in via shipment.

  • So I could solve that.

  • We know from our brief example a moment ago,

  • you could just make your array dynamic.

  • Don't preallocate it to be of size capacity.

  • Just declare it to be a pointer that will eventually

  • point to maybe space for one tray or 100 trays

  • or 1,000 trays or maybe 1,001 trays if we realloc the space again and again.

  • And when you start writing code that involves other people, whether it's

  • for some school project or a personal project or just in the real world,

  • this is where life gets more interesting,

  • too, because so long as you and I, if my colleague kind of decide, OK.

  • I'm going to expose push and pop as the operations.

  • I will implement push and pop.

  • You don't have to worry about the low level implementation

  • details in my own design decisions.

  • You just have to read my documentation and not

  • care how I've implemented it because I have abstracted that away for you

  • and given you just an API.

  • Push and pop would be an API, application programming interface.

  • All you need to know is that you can trust

  • that I will implement push and pop.

  • And you might dislike it ultimately, if I limit your space,

  • but to understand that you need to read the documentation

  • to know what features my implementation are providing.

  • Now this of course is the ridiculousness that ensues every year or so, whereby

  • people line up to buy an iPhone.

  • Now, why would it be a bad thing if Apple used a stack when people

  • arrive at 3:00 AM for their iPhones?

  • Yeah.

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Exactly.

  • The person who came last would get their phone first,

  • which is fantastic for that person.

  • But it's really unfair to everyone else.

  • So Apple of course, like most stores, if they even have this problem,

  • have queues or lines whereby it's a FIFO data structure, first in, first out.

  • Were the first person in line hopefully gets his or her iPhone first.

  • So how can you implement those operations or that API?

  • Might call it nq and dq, but add, subtract, whatever.

  • But these are the terms of art.

  • And we might implement it as follows.

  • A queue might just need a little bit more information than capacity and size

  • alone.

  • You have to remember who's in front, potentially,

  • just so that when that person gets out of line,

  • you don't have to move all of your data in the array

  • just like the humans would walk forward.

  • That's a waste of time.

  • Every time someone's ready to buy their phone,

  • why does n-1 people have to take a step forward?

  • Why not just bring the phone to them and save that inefficient use of time?

  • Or we could do it like this, more of a dynamic data structure.

  • And we won't do the code here.

  • But we've seen, for instance, the example in our list zero, one,

  • and two code, how you could start with a fixed size array,

  • make it dynamic with malloc and realloc, and how you might further

  • make it dynamic with a linked list, albeit with trade offs of time

  • and space.

  • There's this great short video I thought I'd share here, wherein in Jack

  • learns the facts about queues and stacks which distinguishes these two data

  • structures in a way that actually paints an even more clear picture of how

  • they're distinct.

  • If we can dim the lights for 60 seconds or so.

  • [VIDEO PLAYBACK]

  • - Once upon a time there was a guy named Jack.

  • When it came to making friends, Jack did not have the knack.

  • So Jack went to talk to the most popular guy he knew.

  • He went up to Lu and asked, what do I do?

  • Lu saw that his friend was really distressed.

  • Well, Lu began, just look how you're dressed.

  • Don't you have any clothes with a different look?

  • Yes, said Jack.

  • I sure do.

  • Come to my house and I'll show them to you.

  • So they went off to Jack's and Jack showed Lu the box

  • where he kept all his shirts and his pants and his socks.

  • Lu said, I see you have all your clothes in a pile.

  • Why don't you wear some others once in awhile?

  • Jack said, well, when I remove clothes and socks,

  • I wash them and put them away in the box.

  • Then comes the next morning and up I hop.

  • I go to the box and get my clothes off the top.

  • Lu quickly realized the problem with Jack.

  • He kept clothes, CDs, and books in a stack.

  • When he reached for something to read or to wear,

  • he chose the top book or underwear.

  • Then when he was done he would put it right back.

  • Back it would go, on top of the stack.

  • I know the solution, said a triumphant Lu.

  • You need to learn to start using a queue.

  • Lu took Jack's clothes and hung them in a closet.

  • And when he had emptied the box, he just tossed it.

  • Then he said, now Jack, at the end of day, put your clothes on the left

  • when you put them away.

  • Then tomorrow morning when you see the sun

  • shine, get your clothes from the right, from the end of the line.

  • Don't you see, said Lu, it will be so nice.

  • You'll wear everything once before you wear something twice.

  • And with everything in queues in his closet and shelf,

  • Jack started to feel quite sure of himself all thanks

  • to Lou and his wonderful queue.

  • [END PLAYBACK]

  • DAVID MALAN: So that isn't to say that queues are all that

  • and stacks are a bad data structure.

  • They actually each have their own applications.

  • And in fact, one common use for stacks beyond memory management,

  • as we discuss in a couple of weeks when we start exploring HTML and web

  • programming, you'll see that HTML itself, this

  • is the language in which web pages are written.

  • That you'll soon be able to write if not already.

  • This is a language that actually has a nested hierarchy to it.

  • Who, where by a browser, might actually use a stack to analyze the HTML

  • that composes a web page to determine, for instance, if it is correct or not.

  • But there's so many other tools that we can now add to your toolkit.

  • And even though we'll look at each of these just briefly, each of them

  • derives from these very two simple principles, the ability

  • to come up with custom data structures inside of which are pointers,

  • or the ability to stitch one thing to another.

  • So here's an example of what a computer scientist would call a tree.

  • The node here we've drawn as circles just because.

  • But the nodes in a tree are much like a family tree, where

  • each node has zero or more children or descendants, maybe

  • a parent or other ancestors.

  • And so we'll call things like the first node

  • at the very top in a data structure called the tree, the root of the tree,

  • albeit growing downward like this like a family tree.

  • Anything at the very bottom of the tree that only has arrows going into it

  • will be called children or leaves of the tree.

  • And so this might be a way to lay out data in a useful way.

  • In fact, if you think back to when we had things like numbers

  • like this, thus far any time we dealt with numbers or words or Mike Smiths,

  • we would just order them from left to right in an array

  • and then search the array either in big O of end time linearly from left

  • to right.

  • But we did better using what?

  • Binary search, but for binary search it needed to be an array

  • and it needed to be sorted.

  • And the problem I never dealt with was we never

  • actually added another page to the phone book.

  • We never actually tried to add more numbers to our array.

  • And yet today, we've kind of identify these very glaring issues with arrays,

  • which is that you're kind of painted into a corner.

  • If you allocate only so much space, you use it all up and then darn it,

  • you want to add more to the array.

  • So how can we maybe still lay out data in sorted order, still

  • leverage something like logarithmic time and divide

  • and conquer, but get today's benefit of the dynamism whereby

  • we can grow the data structure and shrink it very incrementally,

  • without having to all of a sudden reallocate the whole structure?

  • Well, instead of laying out these numbers, which are conveniently

  • numbered as multiples of 11 here, 22, 33, 55,

  • what if we laid them out like this in memory?

  • We won't look at the code for this, but think of each of these circles

  • as a structure, a data structure, inside of which there's

  • an int n, how many pointers apparently?

  • Seven total on the screen.

  • But how about within each node, like this one here?

  • There's a number n, 55.

  • And what else?

  • How many pointers?

  • Just two.

  • Two maximally, in fact, because the leaves, it would seem,

  • have zero by definition.

  • And technically, if I hadn't added 22, maybe there could just be one child.

  • This is what we call a binary tree because every node has

  • at most two children, 0, 1, or 2.

  • And it's technically a binary search tree because of a special property.

  • It's very searchable because if you look at any node, its left child is smaller.

  • And if you look at any node, its right child is bigger.

  • And that's a recursive definition, so to speak.

  • You can look at any node in the tree and that definition is true.

  • Even the leaves, because it's sort of a vacuous statement

  • to say it's greater than its left child if there is no left child.

  • It's sort of trivially true.

  • So what's nice about this data structure?

  • Well, suppose I want to search for the number 22.

  • Like our linked list, and like Olivia being the special first pointer,

  • a binary tree in a computer's memory would just

  • have one special pointer, called root or first or whatever you want to call it.

  • And if you want to look for 22 just like Olivia and Jess were,

  • you might look here.

  • And say, hmm, 55 is greater than 22.

  • So which way do I go?

  • Left obviously.

  • And here, you know, if we were doing this visually

  • we could snip off that whole subtree.

  • And you would see half of the problem be torn away like the phone book.

  • 22 versus 33, of course, this is greater.

  • So we go here and we find it.

  • And long story short, that was not linear

  • because we weren't searching all of the nodes.

  • And if conceptually we were chopping the tree in half,

  • in half, in half every time we went left or right,

  • what should be the running time of search on a binary search tree?

  • Log base 2 of n, or just logarithmic as we've seen.

  • Now it's not necessarily always as prettily balanced.

  • This is very deliberately chosen.

  • You can get perverse situations where it just kind of devolves

  • into a long linked list.

  • But it still is a binary search tree.

  • It was just poorly built. But at least if we keep a balance like this,

  • we can gain some benefits.

  • And here's how we would implement your proposed integer and two nodes.

  • Instead of just calling it next, I'm going

  • to call it more semantically usefully left and right.

  • And notice that struct node is just a pointer called left.

  • Struct node is a pointer called right.

  • And that's how we implement these.

  • And what do you think the leaves have as their values for left and right?

  • The leaves of the tree had no children, by definition, so what's

  • the value of left and right?

  • Null.

  • So they're just sort of pointing down at the floor as zero, null, values.

  • So we're not going to write the code for this now,

  • but we can leverage weak zero's ideas.

  • Divide and conquer, binary search.

  • We can leverage last week and this week's ideas of structures

  • and dynamic memory and technically the heap in order

  • to start to build up data structures like this, that now give us dynamism

  • that can grow and shrink as needed.

  • And just so you've seen the code, here might

  • be an implementation of a function for binary search tree

  • that, given the roots of the tree, finds for you true or false,

  • whether or not something is in it.

  • So I want to search for a number n in this tree.

  • So this here is, again, a pointer to the root

  • just like Olivia was a pointer to the first node in the linked list.

  • So if the tree is null, return false because it's

  • kind of a stupid question to ask.

  • If there's no tree being passed in, it's clearly not there.

  • So return false.

  • That's our special case to ensure that we don't dive too deeply.

  • But here's a very cool application of a past idea.

  • If n is less than the n at the current node in the tree,

  • and remember the arrow just says, go there and look at n,

  • we know we want to look at the left hand side of the tree.

  • So do we have an algorithm to search a tree for a specific value?

  • Just so happens that tree is now smaller because it's

  • this half of the tree on the left.

  • We do.

  • We have a function called search that takes a number as input

  • and takes a tree as a pointer.

  • That doesn't have to be the whole tree.

  • It can be a sub-tree because again, a tree is kind of recursively defined,

  • because every left child and right child itself might have children.

  • So it's a smaller tree but it's still a tree.

  • So I can answer this question.

  • If n is less than the current node's own n value,

  • I can just return the answer to calling search on the same number,

  • but passing in just the left half of the tree.

  • So this is like the tree version of tearing the phone book in half

  • and searching only the left half.

  • And you can perhaps guess, if you're following along at this point,

  • if n is greater than the current node, we're

  • just going to search to the right.

  • And that's three cases.

  • What's the fourth possible case?

  • Yeah, if n equals the current node.

  • And so in that case I'm just going to trivially return true.

  • And this is kind of beautiful.

  • It's not from one perspective.

  • It's not obvious at first glance, how this works.

  • And it's not comfortable necessarily if you're not used recursion that much.

  • But what's beautiful about this, especially

  • if we get rid of the stupid curly braces and a lot of stuff

  • it's not really intellectually interesting,

  • you are reducing this problem to really just these lines of code.

  • Check for null, return false.

  • Check if it's less than, just recurse on the left.

  • Check if it's greater than, recurse on the right.

  • Otherwise you found it.

  • It's literally the same idea or spirit as our divide and conquer

  • approach for the phone book, just implemented now

  • using trees or nodes linked together in a tree.

  • Yeah?

  • AUDIENCE: [INAUDIBLE]

  • DAVID MALAN: Tree arrow n, so tree, recall, is a pointer to a node.

  • So that, just like Olivia was a pointer to a node in a linked list,

  • this would be like Olivia standing here and instead

  • of pointing at a line of students, sort of pointing at a tree of students

  • that fans out this way.

  • So tree, we could call it anything we want.

  • I just called it tree, represents that.

  • And meanwhile, tree left would be like if Olivia was pointing at a node here.

  • Actually, if Olivia is pointing at the root of the tree here,

  • tree left would be go look at the left half of the tree

  • or the right half of the tree.

  • If again, our volunteers were laid out on stage like a fan,

  • like a tree instead of a list.

  • So we've seen a whole bunch of algorithms that might

  • have any number of these running times.

  • And up until now kind of the best running time

  • really has been this for the fanciest of algorithms.

  • But we have seen constant time here and there.

  • And even today if we want to insert into a linked list

  • and we don't really care about the order,

  • we can just plug the new value right there after Olivia and before Achmed

  • and get our constant time.

  • But wouldn't it be nice if more operations were constant timed?

  • One step, two step, three stepped or some finite number?

  • And it turns out we can achieve this with a bit of thought.

  • And we can leverage another sort of familiar idea as follows.

  • So like here, for instance, is some unusually large playing cards,

  • which actually do exist if you just Google jumbo playing cards

  • and look for them on Amazon.

  • Suppose I wanted to sort this deck of cards.

  • I could go through the deck one at a time

  • and order them both by their suite, like clubs and hearts

  • and so forth, and also by their numbers.

  • But odds are if you're like me, you're going to probably try

  • to make the problem a little simpler.

  • And if you see the king of spades here, I put him over here.

  • Nine of spades, I'm going to put that there.

  • 10 of spades coincidentally.

  • I'm going to put that over here.

  • Then I'm going to do what with the hearts, probably?

  • You know, probably I'm not going to go through one at a time.

  • I'm going to kind of bucketize each of the cards.

  • So here's the ace of clubs, so I'm going to make a third pile.

  • Here's a couple of diamonds.

  • So that's my fourth pile.

  • And then I'm just going to repeat this, because it's a nice simple algorithm.

  • It's going to make my life a little easier in just a moment

  • once everything is in the right pile.

  • But this is a general notion of what we'll call hashing.

  • And I'm not going to finish it because surprise, surprise,

  • we're going to get 13 cards in each pile.

  • But this is a more fundamental notion of hashing.

  • You take as input something from your list of inputs.

  • You look at it and you make a decision based on it.

  • And in this case, my hash value is going to be zero, one, two, three,

  • two because it's going to go into the hearts pile.

  • And what is a hash function?

  • It's just going to be a function in code or in my brain that

  • just makes a decision based on output and outputs a hash value, which

  • in this case is going to be that pile, that pile, that pile, that pile,

  • or if we want to be more precise, zero, one, two, or three.

  • If those four piles are implemented it's like four arrays

  • or some kind of stacks, really.

  • I seem to be making a stack of cards.

  • Now it's not done.

  • If I want to sort these things later, I'm

  • still going to have to sort each of the piles of 13.

  • But I've kind of made the problem a little easier for myself

  • in that I've spread it out over four equivalent problems.

  • But the key ingredient here is that notion of hashing.

  • And honestly, if you've ever watched a TA or professor deal with these things

  • at the end of some class that has blue books, if a whole bunch of students

  • at the end of the hour come down and start handing in their blue books

  • it's a complete mess.

  • And if the TFs or professor wants to organize these,

  • you might make a whole bunch of piles.

  • All the L last names will go there.

  • E will go there.

  • F will go there.

  • And maybe in this case you'll alphabetize as you go, thereby

  • making this problem easier, too.

  • That is a hash function.

  • You take as input a student's name, you look

  • at the first letter of his or her last name,

  • and you decide whether it goes in bucket zero, one,

  • or maybe 25 if you're indeed hashing based on the English alphabet.

  • So hashing is something we've all done, even if we've never

  • slapped that name on it before.

  • So how might we leverage this kind of ingredient

  • and get ourselves closer to the holy grail of data structures, which

  • would be constant time for everything?

  • Like none of this linear, none of this logarithmic time.

  • So suppose we have an array or a table, we'll call it, like this.

  • I'm going to call this a hash table because I want to leverage

  • the idea of this hash function.

  • And suppose that what I want to store in here are just things like names.

  • And I want to go ahead and store the name

  • Alice, because she turned in her exam first.

  • So here I might have [0] through [25] or in general, n-1.

  • So there's 26 buckets total.

  • Where might I be inclined to put Alice?

  • I might just hash her to zero because Alice,

  • we'll use her first name, not last, because she never

  • seems to have a last name.

  • So first name, Alice, brackets zero, she goes there.

  • Then Bob comes up, turns in his exam.

  • Where does he go?

  • [1] and then maybe Brendan comes over.

  • Damn it.

  • No room for Brendan's exam.

  • Why?

  • Because he hashes to the same value.

  • And this can happen.

  • Like, you might hash to the same value.

  • And here it was not a big deal.

  • I kept getting diamond, diamond, diamond.

  • That's fine because this data structure grows.

  • But this is an array, it would seem.

  • And I could write Alice here, I could write Bob here,

  • but Brendan should be written there too.

  • But I don't want to give Bob his exam back just to accept Brendan's.

  • So where could I put Brendan?

  • Maybe I'll kind of cheat and just put him here because there's room, right?

  • This is all free in this story so far.

  • But then Charlie comes forward.

  • What do we do with Charlie?

  • Now Brendan is where Charlie should be.

  • So now I've just kind of made a mess but I have so much free space

  • and odds are I'm not going to have a student, no offense, whose name starts

  • with a Z or an X or and some of the statistically less likely ones.

  • So why don't we use those spaces?

  • And we could, but this is an example algorithmically

  • of linear probing, where you linearly top to bottom just kind of probe

  • the data structure looking for space and just drop

  • the values in the first available.

  • And initially it's nice and clean and nice and efficient

  • because if I want to look for Alice's exam later, boom,

  • she's on the top of the pile.

  • Bob, boom, second in the pile.

  • But then Charlie, not quite where he should be.

  • So eventually with this approach of linear probing

  • it's space efficient in that you pack everyone into your data structure.

  • But it eventually devolves into something linear.

  • If Alice came and given her exam last, by nature of space,

  • she might end up at the bottom of the pile

  • and that does not make her easy to find later.

  • So what if we instead change the data structure

  • and use elements from today and past?

  • Let's use an array here of pointers drawn vertically just because.

  • And then why don't we string students' names off the right of this?

  • So this is an excerpt from a text that explores exactly this data structure.

  • It's called a hash table, not with linear probing,

  • but with separate chaining, whereby your data structure, your hash table,

  • is technically an array.

  • This time it's upsized 31, because the book's example was

  • about day of the month for birthdays.

  • And so the data structure has not just an array, though, but what other data

  • structure combined with it?

  • It's a kind of linked list.

  • So what's nice here is that S Adams, so Adams, starting with A in our story.

  • Now they're using birthdays if you read this in the context.

  • But suppose that Adams is the only one with birthday

  • on the second of some month.

  • Well, he or she might end up here.

  • And that's no big deal if someone else has the same birthday in this example,

  • because we can either walk the list as we did with Jess

  • and just string him or her at the end of this data structure.

  • Or we can just kind of insert them at the very beginning

  • and just use some constant time changes to peoples' left hand to fit them in.

  • The point is though, the data structure no longer is an array only.

  • It's an array of 31 buckets, four piles, 26 piles, 31 piles.

  • But each of those piles can grow vertically,

  • so to speak, or in this case laterally because we're

  • implementing the idea of these data structures

  • now by using an actual linked list.

  • So why is this actually better or worse?

  • Well one, is there any limit now on how many students can turn in their exams

  • or have birthdays?

  • No because we just keep growing it wider and wider and wider.

  • Why is this then a good thing?

  • Well now if I want to look someone up, if I know their name starts

  • with A or in the birthday example, I know

  • their birthday is on the second of the month,

  • I know deterministically, no matter what,

  • what bucket they will be in in the array.

  • Now, they might be in a long string of people with similar names or birthdays.

  • But they're going to be there, deterministically, predictably,

  • again and again.

  • And the beautiful thing is if my hash function is well-implemented, uniform

  • so to speak statistically, then it would be nice if almost all of these chains

  • are roughly the same length.

  • It would be pretty lame if this chain were really huge

  • and then every other chain were shorter because that's just

  • an opportunity for better design.

  • So in real terms, a hash table, when implemented like this,

  • should decrease in this concrete case, by a factor of like 31,

  • how long it takes to find someone.

  • So the time is one divided by 31 because if all the chains are roughly

  • the same length, you have chopped up your data set into four piles, 26

  • piles, 31 piles, each of which is one fourth or 126th

  • or 131st the size of the whole data set.

  • Now asymptotically, per couple of weeks ago,

  • that is algorithmically irrelevant.

  • That's big O of the same thing, so to speak.

  • But in real terms, having it taking a quarter of as much time,

  • 126th the amount of time, 131st the real time

  • is literally going to save us times on our watches.

  • Like that in real human times will save time.

  • And in fact, what you'll see in problem set five, in which you implement

  • your first spell checker, you'll see that that's

  • what we're trying to optimize for.

  • In fact, as a quick teaser before we look at our final data structure here,

  • you'll be challenged as part of this problem set optionally,

  • if you'd like to opt in, to compete on the big board.

  • Once your code is working per check 50, you

  • can actually run a separate command with check 50 to post it to the leader board

  • here.

  • And right now, damn it, Brian is beaning both Doug and me

  • because his implementation of the spell checker

  • takes only 4.81 seconds and only 7.4 kilobytes versus my 82 megabytes

  • of memory implementing a spell check over the whole lot of words.

  • But how do you decide how to minimize space or minimize time

  • and how do you mitigate some of the trade-offs?

  • Well, let's look at one final data structure to consider.

  • This is perhaps the most sophisticated and it takes up more space

  • and so it's hard to paint on the screen.

  • But suppose we did this.

  • Suppose we were trying to store in our data structure people's names.

  • We could do this with an array of a lot of strings.

  • And we could do linear search and Brian or Doug

  • or I could just use linear search big O of n and find any one you want.

  • That's not so great.

  • We could somehow use binary search if we used a tree or an array

  • but kept the names sorted.

  • We know we can do better.

  • Just as we found Mike Smith pretty quickly in week zero.

  • But what if we could find names in constant time?

  • Whereby no matter how many words are in the tree, no matter

  • how many words are in the dictionary more generally,

  • still takes me the same amount of time to find anyone?

  • And it doesn't get longer and longer the more names we add?

  • So here is a type of tree goofily called a trie, T-R-I-E,

  • which is an excerpt from retrieval, which is weird because it's retrieval

  • and retrival but this is a trie, T-R-I-E.

  • Each of the nodes in a trie, essentially, are an array themselves.

  • Technically they're a structure with a little more inside of them.

  • And you'll see this in the walk through that Zamyla put together.

  • But each of the nodes in the trie are an array.

  • Each of those arrays elements is a pointer to another such array.

  • And the way you store words in a trie is not with characters,

  • but implicitly with pointers.

  • So if we want to put someone's name like Maxwell in here,

  • we hash into this trie using the first letter of Maxwell's name, which

  • is of course m.

  • And that's going to be the 13th element in the array in my 26-element array

  • here.

  • I'm going to change that originally null pointer

  • to be a pointer to another node.

  • And then I'm going to hash on the second letter of Maxwell's name, which is A,

  • and I'm going to allocate a pointer to another array.

  • And then repeat that process for every letter in his name.

  • So if I hash on his first letter, second letter, third letter, every time

  • I do that it leads me to a new array.

  • What's not shown here is that each of these arrays is size 26.

  • It would just be atrocious to see on the screen.

  • So it does use a bunch of memory.

  • But the end of this, there's a special symbol drawn here is a delta symbol,

  • but it can be anything, that just means Maxwell stops here.

  • There's a word here.

  • So how many steps does it take to find any name in the tree?

  • Well, to find Maxwell it's M-A-X-W-E-L-L. So that's seven steps.

  • For Maria it'd be M-A-R-I-A. That would be five steps.

  • So it's still dependent on the number of letters in the name.

  • But if there is a billion names in this dictionary,

  • per this definition, how many more steps does it take to find Maxwell?

  • M-A-X-W-E-L-L.

  • How about if there's four billion names in the dictionary?

  • How long does it take to find Maxwell?

  • M-A-X-W-E-L-L. it's invariant.

  • And if we assume that no human name is going to be super long,

  • it's effectively constant whether it's 10 characters, maybe 30 characters

  • or whatnot.

  • That's effectively constant, which means a trie gives you constant time look up

  • or big O of one, which means it's in theory the fastest of data structures.

  • But of course you pay a price with more memory.

  • And I know we're one minute over but let me tease you with this final look.

  • And you'll see this data structure's implementation with Zamyla.

  • But we begin to do transitionally now, especially if you're

  • a little worried, especially as we're coming on the midpoint of the semester,

  • like oh my god.

  • Things are getting more and more sophisticated.

  • We're kind of at the peak of a hill here because after problem set

  • five do we transition to HTML and CSS and Python and JavaScript and web

  • programming more generally.

  • And next week, how the internet works.

  • [VIDEO PLAYBACK]

  • [MUSIC PLAYING]

  • - He came with a message, with a protocol all his own.

  • [MUSIC PLAYING]

  • He came to a world of cruel firewalls, uncuring routers, and dangers

  • from worse than death.

  • He's fast.

  • He's strong.

  • He's TCPIP.

  • And he's got your address.

  • Warriors of the net.

  • [END PLAYBACK]

  • DAVID MALAN: All right.

  • All that and more next week.

  • We'll see you then.

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it