Placeholder Image

Subtitles section Play video

  • [MUSIC PLAYING]

  • [MUSIC - FERGIE & Q TIP & GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"]

  • FERGIE & Q-TIP & GOONROCK: (SINGING) Just one night,

  • all we got, just one nigh, all we got, just one night, all we

  • got, just one night, all we got.

  • A little party never killed nobody.

  • So we going to dance until we drop, drop, drop.

  • Mm, a little party never killed nobody.

  • Right here, right now, is all we got. skeeten-bod-op-bop.

  • [MUSIC - FERGIE & Q TIP &

  • GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"]

  • (SINGING) A little party never killed nobody.

  • So we're going to dance until we drop.

  • Let's go.

  • A little party never killed nobody.

  • Right here, right now is all we got.

  • Oh, a little party never killed nobody.

  • So we're going to dance until we drop.

  • A little party never killed nobody.

  • [MUSIC - FERGIE & Q TIP &

  • GOONROCK, "A LITTLE PARTY NEVER KILLED NOBODY"]

  • MUPPET: [SNORING]

  • DAVID J. MALAN: All right.

  • This is CS50, and this is lecture 8.

  • And that was a look at the so-called CS50 hackathon, this annual tradition

  • with which we nearly end the semester that will start around 7:00 PM here

  • in Cambridge, will then end around 7:00 AM the next day in Cambridge.

  • And punctuating the evening will be quite a bit of work

  • on final projects, which is the overarching goal of the evening

  • to give students, both an academic and a very

  • social and collaborative environment in which to tackle

  • their final achievements in the course.

  • But also, as you gleaned there will be a bit of distraction, a bit of food

  • from Felipe's, typically, around 9:00 PM, a little bit of Domino's Pizza

  • around 1:00 AM.

  • And for those still standing as in the muppet there,

  • we'll treat you to breakfast at IHOP if you still have the energy.

  • Thereafter, we'll be the so-called CS50 Fair, the climax of everything

  • that you've done in CS50.

  • And more on that in the weeks to come.

  • So without further ado, where have we been these past several weeks?

  • Well, recall that, over the past couple of weeks,

  • we've been introducing web programming.

  • And most recently, did we try to tie together a lot of the topics

  • from the past few weeks.

  • HTML and CSS and JavaScript, and then Python, and then another framework,

  • another piece of software called Flask that

  • just made it easier and more possible to build web-based applications.

  • And the simplest example of that last week, recall,

  • looked a little something like this.

  • And this past week with the problems on similarities and survey,

  • have you been building your own web-based applications.

  • But they haven't quite had all of the pieces

  • that you might want to assemble into a web-based app.

  • And so today, we'll fill in those final blanks.

  • But recall that we've been trying to frame these applications, not

  • as one-offs or just tools that you built, but as part

  • of a common paradigm or a methodology.

  • And indeed, when you're learning computer science and software

  • engineering, you start to notice patterns in the software,

  • in the code that you're writing.

  • And humans tend to adopt these patterns because they save you time.

  • And then everyone can speak the same language, so to speak.

  • So MVC was the acronym we introduced last week, Model View Controller.

  • And that just speaks to this paradigm whereby

  • you organized certain type of code in one file, certain types of other code

  • and another file.

  • So your Python code goes in application.py.

  • Your HTML goes in your HTML files, your CSS in your CSS files, and so forth.

  • But what we didn't have last week was this thing here, really.

  • Model generally refers to your data.

  • And while survey, your most recent problem,

  • did have data, a CSV file that's arguably a model,

  • CSV is not terribly expressive.

  • You can pretty much just write rows to a text file, not unlike a spreadsheet.

  • But you can't really query it.

  • You can't easily insert or delete.

  • You pretty much would have to reconstruct the whole file.

  • And indeed, that's exactly what I did back in the day.

  • The very first web-based application I wrote back in sophomore or junior year

  • was the freshman intramural website.

  • And all I had available to me, both technologically and conceptually,

  • we're CSV files.

  • I had no idea what databases were, didn't realize how much easier they

  • could make my life.

  • So I stored all of the data behind these links here just

  • in very simple text files.

  • But today, will give you so much more power.

  • And especially, if you're coming into the class with an interest

  • in applying CS to other fields in medicine or the arts

  • or any world in which there's data, particularly,

  • in STEM and data science and the like, realize

  • that SQL has really become this incredibly powerful language

  • with which to solve problems in those and so many domains.

  • And ultimately, what your build, this coming week--

  • your very last CS50 problem set--

  • you have now peaked just about-- will be CD50

  • Finance, a web-based application by which you can buy,

  • or "buy" and sell stocks.

  • And so what you're going to do this coming week is write code

  • that implements a web application that resembles this whereby

  • your users and yourself will be able to register for an account,

  • log into that account.

  • You'll be given for free, thanks to us, 10,000 virtual dollars.

  • And then you can go ahead and buy and sell stocks via their symbol

  • using this application.

  • So for instance, if I go ahead and register myself here.

  • Let's say, username Malan, password 12345--

  • shouldn't have said that.

  • Go ahead and register here.

  • You'll see that, by default, I get this free $10,000 in virtual cash.

  • You know what, I'm going to go ahead and buy myself

  • a share of maybe Netflix, whose symbol is NFLX.

  • I'm going to go ahead and try to buy one share.

  • Voila.

  • And now you'll see that, not only do I have less cash left, about $9,600,

  • I now own a share of stock.

  • And if we reload this throughout the lecture, frankly,

  • the price might go a little up or a little down

  • because this web application you'll build

  • will also talk to an API, an Application Programming Interface--

  • so some third-party website that provides you with stock data.

  • So you ask it for the current price of a stock.

  • It gives you an answer in JSON format, if you

  • recall our short discussion of that.

  • And you'll integrate it, ultimately, into this interface.

  • So this will be the culmination of so many of the ideas

  • and the building blocks with which you've experimented.

  • But we need a few final pieces today.

  • So first and foremost, when I registered for this site, I created an account.

  • And indeed, I can go ahead and log out now.

  • And if I go ahead and log back in--

  • if no one's stolen my password yet, I can see my account and only my account.

  • But we haven't yet seen any mechanism in code

  • whereby you can implement this notion of logging in, right?

  • What is the underlying logic via which a website knows that you are logged in

  • or you're not logged in, right?

  • Clearly, when I went to this website just a moment ago,

  • it prompted me to log in because I was not because I had logged out.

  • But how did it know that?

  • And now that I am logged in, how does it know that I can click on "quote"

  • to get just the stock quote or "buy" just to buy a stock or "sell."

  • Why does it not prompt me every time I click a link to log back

  • in with my username or password?

  • What do you think?

  • AUDIENCE: Isn't cookies little files that

  • saves the data inside your browser's cache that let it know what's going on.

  • DAVID J. MALAN: Yeah, cookies--

  • little files that get saved by websites on your computer.

  • Now odds are, most everyone here has probably heard in some form,

  • cookies-- bad.

  • Maybe bad, privacy-invading, yes?

  • OK.

  • So true.

  • But most of the web would not work without them.

  • So someone else, what do you understand a cookie to be?

  • What is a cookie?

  • AUDIENCE: Basically saves your information

  • so you don't have to put it in every time.

  • DAVID J. MALAN: Yeah.

  • It saves your information so you don't have to put it in every time.

  • So in simplest form, it's exactly a combination of those answers.

  • Whereby, when I log into a website, you could

  • imagine that website just remembering my username and my password

  • by just saving them in a little file on my Mac or PC

  • so that the next time I visit a web page,

  • it just automatically sends them for me, so that the human doesn't

  • have to type them again and again.

  • Now let's find fault with that.

  • Feels like that would work, but what's a downside of that naive implementation?

  • What do you think instinctively?

  • AUDIENCE: The data is just out there?

  • DAVID J. MALAN: The data is just out there, right?

  • I have people in the office or at home or in a library

  • that I might not want to just be able to walk up to that computer

  • and just see this little text file planted on my computer.

  • Because then they could log in to my account by just digging around.

  • So it feels a little invasive.

  • Indeed, that's one of the threats of cookies.

  • So it turns out cookies or websites tend not to do that, at least,

  • if they're implemented well.

  • They instead just plant a really big random number on your hard drive

  • or in your computer's ram or memory.

  • So 1234567-- like, some really big number maybe with letters and numbers

  • maybe even some punctuation that uniquely identifies me.

  • And thereafter, any time I click a link on that same website,

  • after logging in, my browser, thanks to HTTP--

  • the language that browsers and servers speak that we discussed a few weeks

  • back--

  • just magically sends that same big random number

  • to the server again and again.

  • And so long as the server remembers that that big random number corresponds

  • to user name Malan, it can figure out whose account to show.

  • And why is it a big random number?

  • Why is it not just my password or my username?

  • Why am I proposing that it be a big random number?

  • AUDIENCE: Sometimes people can guess it by just using a random generator.

  • DAVID J. MALAN: Exactly.

  • In computer science, randomness is this incredibly powerful ingredient.

  • If you pick a big enough word or a big enough string

  • or sentence, the probability that some random adversary, or bad person

  • on the internet, is going to be able to guess or try

  • to guess that value is just so low, it's just not realistically

  • ever going to happen in your lifetime statistically.

  • And so random this gives us that capability.

  • And so you can think about this metaphorically in the real world

  • as being like a hand stamp.

  • If you've ever gone to an amusement park or a bar

  • or a club where you have to show your ID or you

  • have to pay for a ticket on the way in, sometimes they'll

  • stamp you with either visible or invisible ink.

  • And that's largely for efficiency so that, thereafter,

  • when you're in the amusement park, you can come and go.

  • And you don't have to repay or reprocess.

  • If you're in the bar or the club, you don't have to keep showing your ID.

  • They can check once, and then more efficiently let you come and go

  • as you please.

  • Because you're just presenting your hand or this virtual hand

  • stamp to the bouncer or to the gatekeeper at those places.

  • So cookies are exactly like that.

  • Unbeknownst to us all this time, anytime you visit a website into which you've

  • logged in, your browser is secretly but usefully

  • presenting a hand stamp to that server to remind it who you are.

  • Or rather, not really who you are, but of that big random number

  • so that, if the server remembers who that number belongs to,

  • it can figure out whose account to then show.

  • So put more concretely, if I actually pull up some of the HTTP examples

  • that we looked at in the past, let's consider this in context.

  • So almost everyone here has probably used Gmail at some point.

  • And you log in generally via page like this.

  • So it might be infrequent because you're not often

  • prompted to log into email because of-- surprise, surprise-- cookies

  • from Google being on your computer.

  • But let's see where those come from.

  • So when you request Google or gmail.com, you might send, in a virtual envelope,

  • so to speak, from your Mac or PC to the server, a message like this.

  • Families who've not seen this before, this

  • is what your browser is actually sending to a server in order

  • to request a home page of gmail.com.

  • Now, I'm going to simplify this a little bit

  • because there's a bunch of redirects--

  • HTTP 302s and 301's involved that aren't that interesting today.

  • So let's just assume that Gmail responds immediately.

  • Typically, Gmail would respond with this, saying 200 OK.

  • Here's the login page.

  • And here's a web page in text/html format.

  • But once you've logged in, what your Gmail actually does

  • is it also sends this.

  • Recall that we call these things HTTP headers--

  • key value pairs separated by colons that are semi-secretly sent from browser

  • to server and from server to browser.

  • Now we, more sophisticated developer types can see this stuff, right?

  • You can open up Chrome's network tab and start poking around.

  • And it's not secret per se.

  • It's just most people don't know it's there.

  • And what Google and other companies are doing is

  • they literally send a header called set-cookie, the value of which

  • is that big random value that the server has decided for you.

  • Your browser, assuming that it speaks HTTP properly,

  • should then save that value in RAM, your computer's memory,

  • or on your hard drive.

  • And then, every other click you make on gmail.com should send,

  • not just headers like this, but it should send the opposite header--

  • just cookie, not set-cookie, but cookie, which

  • is the presentation of that hand stamp.

  • And so every time you click a page on Gmail, or Facebook,

  • or almost any website into which you've logged in.

  • Those cookies have been planted on your computer.

  • Now recently, in both Europe and in the US,

  • have laws been passed that are increasingly putting pressure

  • on companies that operate internationally to present you

  • with cookie-based information, right?

  • You get these little pop-ups increasingly.

  • And you've been getting them for years in Europe saying,

  • can we plant cookies on your computer?

  • So given all of this, what might happen if you say

  • no, no cookies on my computer?

  • AUDIENCE: You have to keep logging in.

  • DAVID J. MALAN: You're going to have to keep logging in, right?

  • If you take away this fundamental HTTP feature, much of the web

  • breaks, or the user experience deteriorates significantly.

  • Now the flip side is cookies can be used, not only for good, so to speak,

  • but also for evil.

  • It turns out that there are a lot of ads on the internet, of course.

  • And that's what drives a lot of the revenue that makes it all possible.

  • Those ads typically come from image tags in your HTML or script tags

  • or some of the HTML tags we have seen in the class.

  • But they go to third-party servers, somewhereelse.com.

  • And the problem arises with cookies when those third parties are allowed

  • to plant cookies on your computer.

  • Because if you go to Google, and you go to Facebook,

  • and you go to Twitter-- bunches of websites.

  • Suppose they all have some middlemen advertising service advertising

  • on each of these websites.

  • That middlemen, so to speak, because they

  • have their ads and, therefore, their HTTP headers on this site,

  • and this site, and this site, there are big third parties

  • out there-- lots of them advertising networks-- that know everywhere

  • little old you is going because they see your cookie appearing

  • from multiple different other sites.

  • And so here's where cookies become, not a computer science, engineering feat,

  • but really a threat to one's privacy because they

  • can be used so easily for tracking.

  • And frankly, AT&T and Verizon, as an aside, got into trouble--

  • not enough people know this--

  • some years back when they started injecting, forcibly, additional HTTP

  • headers similar in spirit to this to all of people's cell phone traffic--

  • so not things you could even opt out of initially.

  • Because this was a way for advertisers and for themselves

  • to be able to track users.

  • So these HTTP headers on which cookies are based

  • are very powerful, but also very invasive.

  • And we're only now starting to see, societally and politically,

  • pushback on this very simple mechanism that, hopefully, we as CS types

  • just understand the mechanics of and, therefore, now the implications.

  • So let's see this in context, for instance.

  • Let me go ahead and open up a relatively small example in CS50 IDE.

  • For those unfamiliar, CS50 IDE is a web-based application

  • via which you can write programs in the cloud,

  • just using any browser on a Mac or PC.

  • I'm going to go ahead into my account here.

  • And I'm going to go into Store.

  • And I'm going to go into the Templates directory

  • and show that this example here has a few files.

  • This will look familiar to students in the room.

  • Application.py is the web-based application.

  • And then we have some template files, so to speak.

  • In my Terminal window here, I'm going to go ahead and, with just my keyboard,

  • go into this directory called Store in source 8, which is available online.

  • And I'm going to go ahead and do a flask run, which, for those unfamiliar,

  • is the command via which you can start a web server

  • and start running a web-based application.

  • So now that I have that, I'm going to go ahead and visit exactly this URL here.

  • And we'll see a relatively simple and super ugly web store.

  • Let me go ahead and zoom in a bit.

  • And this web store allows me to buy three things-- foos, bars,

  • and bazes, whatever those are.

  • And this is a very simple e-commerce-like site

  • where I just have to type the quantity of foos, bars, and bazes that I want.

  • And then I can go ahead and buy them.

  • So I'm going to go ahead and say, give me one foo, how about zero bars,

  • and these two bazes.

  • And I'm going to go ahead and click Purchase.

  • And now you'll see my shopping cart.

  • But if I continue shopping, you'll see that it

  • resets, just like if you keep browsing Amazon or whatever other website.

  • But if you want to check shopping cart again,

  • notice it's remembered what I'm looking for.

  • And in fact, you know what?

  • I'm going to go ahead and close the window.

  • Oops.

  • I actually lost the website.

  • But you know what?

  • If I go back to that URL and reopen, I see the storefront.

  • But if I view my shopping cart, it has remembered my state.

  • So notice the power now of cookies.

  • It's not just to remember with a hand stamp who you are.

  • But now you can remember anything about that user that you want, right?

  • A shopping cart on Amazon or any website is the best example of this

  • because it would be horrible, horrible, horrible for a User Experience or UX

  • if every time you click the darn link, you

  • lost the contents of your shopping cart because the website forgot who you are.

  • So this is a compelling feature to remember.

  • And it's cookies that implement this feature too.

  • Because more generally, what's going on with cookies is this.

  • When you set a cookie using the set-cookie header that

  • looked like this, there is a key value pair

  • to the right that we might call session, and that has the value.

  • Value is the hand stamp.

  • Session is just a term of art that refers to this abstraction

  • that you can think of really as a shopping cart.

  • But it doesn't have to have anything to do with actual shopping or e-commerce.

  • It's just a container in which you can store stuff.

  • So this is telling the website that my session value, my hand stamp

  • is 12345-- and big really random number.

  • The website can then say, you know what?

  • I'm going to store a container for you.

  • Let's call it a Python dictionary, or dict for you,

  • inside of which I can put anything I want.

  • And so in fact, when you go to a shopping cart like this,

  • what is the server actually doing?

  • Well, upon seeing your hand stamp, and realizing, oh, you are user 1234567.

  • Oh, that's username Malan.

  • Let me go ahead and grab the dictionary, the Python dict

  • from memory, or even from a CSV file if you want,

  • and show you the contents of your shopping cart.

  • Meanwhile, someone else, Brian visits the same exact website

  • and logs in with his account.

  • He's going to present a different hand stamp, presumably.

  • And so the website can look for a different dictionary

  • and show Brian the different contents of his shopping cart.

  • And the same for everyone in this room.

  • So this simple mechanism--

  • I mean, consider the power.

  • This is just a stupid text value key:value.

  • And from all of that does all of the web's capabilities now come.

  • And we won't look in too much detail into the code here,

  • but let me show one snippet of it.

  • If I go into application.py, the magic that makes this possible in flask

  • is just to import this additional feature.

  • You'll recall, from the past couple of problems

  • you've worked on, you import from flask a few things--

  • render template, redirect, request.

  • Well, if you Add session to that list, which

  • is the code abstraction of this hand-stamp idea, what you can now do

  • is this.

  • Let me scroll down to the juicy part here.

  • And notice that you can do lines of code now,

  • and for the next CS50 finance problem, lines of code like this.

  • Session is just a Python dictionary, or dict.

  • You can index into it using any word--

  • foo, or bar, or baz, just like you could with dictionaries more generally.

  • And you can store in it anything you want.

  • In my case, I want to store a number--

  • 0 foos, or 1 foo, or 2 foos or whatever.

  • So I can simply convert to an int, the user's request forms

  • item, whatever that is.

  • And let me wave my hand at some of the code above because it just

  • sets the whole thing up.

  • But the new line of code, the new feature

  • that now will empower us to build something like a stock trading website

  • is quite simply this line here.

  • Because what you can also do to remember that a user is logged in,

  • just go ahead and store in this so-called session,

  • a value of, like, true.

  • And if the value true is there, you can infer that the human is logged in.

  • And if there's no such value in the dictionary, they are not logged in.

  • And so we'll hold your hand a bit more in the next problem set

  • with this introduction of CS50 Finance.

  • And we'll write some of the code that handles the login,

  • so you can see by example how to do this.

  • But it'll be up to you thereafter to start

  • remembering what stocks a user actually has

  • using sessions to retain the fact that they've logged in already.

  • All right.

  • So that was a lot all at once.

  • But any questions on cookies and the feature they provide,

  • these things called sessions?

  • Anything at a all?

  • Yeah?

  • AUDIENCE: Is the session saved on the user side or the server?

  • DAVID J. MALAN: Is the session saved on the user side or the server?

  • Really good question.

  • And the answer can be it depends.

  • You could store on the user's client, on their browser.

  • And as the gentleman over here proposed, you could theoretically store literally

  • their user name and password-- maybe I proposed that earlier--

  • on their computer.

  • And that's bad for the reasons that we surmised earlier.

  • But you could also store the contents of their shopping cart, foos, bars,

  • and bazes.

  • That's not quite as invasive as storing their password.

  • But if you're buying things you don't really want people knowing,

  • that is then invasive.

  • So maybe we can do better.

  • And better is often, store it on the server.

  • So a well-designed website will typically

  • store only this big random value, the hand stamp, on the Mac or PC.

  • And then all of the interesting and maybe sensitive stuff

  • is stored in a database or CSV file or just the server's RAM or memory

  • like in a global variable.

  • Good question.

  • Other questions?

  • Other questions?

  • No.

  • All right.

  • So that's one problem solved.

  • We know now we can implement login forms.

  • And we know that we can remember that people are logged in.

  • So let's just stipulate that is now possible.

  • But over the past couple of weeks, it's not been incredibly powerful

  • to only have access to things like CSV files, comma separated values, which

  • create the illusion of Excel and Google Spreadsheets

  • and Apple Numbers like columns and rows.

  • Why?

  • Well, it's pretty much a linear search for everything, right?

  • A CSV file is just rows and rows and rows and rows of data.

  • And if you want to search for anything in that

  • file, like you might have wanted to for your survey implementation,

  • how do you find it?

  • Well, you open the file with Open.

  • You maybe use a for loop and iterate over every single line

  • looking for some value, and then you close the file.

  • That is big O of n.

  • And in the worst case, the thing you care about is at the very end.

  • And it's not terribly efficient.

  • Now, you can append to files pretty efficiently.

  • Recall that when we opened files with quote unquote, "A" for Append mode.

  • We did this with a brief example, instead of "W" for Write.

  • You get the operating system's help and add rose to the file

  • at the very bottom, which is more efficient.

  • But you can't insert things in the middle very efficiently.

  • You can't delete things very easily.

  • You would have to literally, for those kinds of scenarios, open the file,

  • read the whole darn thing, then write out parts of it

  • or add to it as you're writing out.

  • And so humans years ago realized, well, this is stupid.

  • All of us humans in the world are constantly

  • writing code to open files, change files, save files.

  • Why don't a few of us do an even better job of implementing that feature,

  • then share it with the world?

  • And thus were born, effectively, databases.

  • And these days, there are so many different types of databases--

  • you might be familiar with tools, commercial software like Oracle,

  • or SQL Server, or Microsoft Access.

  • And in the open source world, there's Postgres and MySQL,

  • and SQLite and others.

  • And many of them, as even those names imply,

  • use a special language called SQL, Structured Query Language, which

  • is the very last language we'll introduce

  • you to in CS50 unless you go off on your own with final projects

  • to pick up something more.

  • But with SQL, you have the ability to select data from a database,

  • to insert it, to delete it, to updated it.

  • All of the things that you could absolutely do with Python and CSVs,

  • it would just be so darn tedious to write those lines of code yourself.

  • So what's the right mental model?

  • Here is, I claim, essentially a database.

  • Like most everyone in this room has probably

  • used Google Spreadsheets, and if you've not, probably

  • Excel or maybe Apple Numbers.

  • So these are spreadsheets.

  • And they're essentially what you could call a relational database.

  • Relational-- implying that there's relationships

  • among the various data in the rows and the columns decided by you

  • or whoever made it.

  • And this is a spreadsheet of course.

  • And it has rows where your actual data goes.

  • And columns-- and what's noteworthy about the columns?

  • It's just conventional to do what with the columns,

  • especially in that first row?

  • AUDIENCE: Headers.

  • DAVID J. MALAN: Yeah, put some header values, right?

  • You could certainly put them in the left.

  • It's just humans prefer, it seems, to read top

  • to bottom instead of left to all the way to the right.

  • So we just adopted a conventioneers ago that your columns represent

  • different types of data, and the rows represent different values

  • for those fields or for those columns, so to speak.

  • So if you wanted to store a spreadsheet of values

  • for a bunch of students in a class or in a university,

  • one column might be ID number, like their Harvard ID or the Yale ID.

  • Another column might be their name, their email address, phone number, age,

  • and so forth.

  • And you could just lay that all out.

  • You can make it pretty and boldface the top and organize things and sort it.

  • But at some point, this isn't quite the right tool.

  • And in fact, I ran into this in grad school years

  • ago where I was analyzing large sets of data.

  • And it was just convenient to double-click on the CSV file,

  • open it in Excel, and just manipulate it and answer the data

  • questions that I had.

  • But back in the day, Excel actually had a limit of 65,536 rows.

  • Sounds like a lot--

  • and that's 2 to the 16th power, if you call the reference there.

  • And so humans at Microsoft decided years ago, that's enough.

  • And frankly, it kind of was.

  • Because even in the thousands of rows, like,

  • my Mac just became unusable because it was just too much memory being used.

  • So it was just an impractical limit anyway.

  • But at some point, you might want to store

  • 65,537 rows of data or hundreds of thousands of rows or millions of rows

  • or even more.

  • I mean, there are so many websites, Google and Facebook

  • and others that surely do this already.

  • You can't just use a program on your Mac or PC anymore.

  • You need to use fancier software.

  • But that fancier software often still works the same way.

  • You have one main file called a spreadsheet in our human world.

  • But in the server world, you might call it a database instead.

  • And whereas, in our human world, we might

  • have things called Sheets or individual spreadsheets within the bigger file,

  • in a database, you're going to have things called tables.

  • But they're the exact same thing with rows and columns.

  • And so when we want to actually store data,

  • we can actually store it in exactly those ways.

  • So let me go ahead and do this.

  • Let me go ahead and open up Google Spreadsheets just

  • as representative of a database.

  • And let me go ahead and create a new file here.

  • So New Spreadsheet-- and just so that we can represent things like students.

  • I'm going to call this spreadsheet Students.

  • And over here I'm going to put their ID number and their name, and what

  • are some other fields I rattled off that come to mind?

  • AUDIENCE: Email.

  • DAVID J. MALAN: Email.

  • AUDIENCE: Dorm.

  • DAVID J. MALAN: Say again.

  • AUDIENCE: Dorm.

  • DAVID J. MALAN: Dorm.

  • Good.

  • AUDIENCE: Sports and stuff they want to do.

  • DAVID J. MALAN: OK.

  • So sports they want to do, sure.

  • That's fine-- a little longer.

  • What else?

  • AUDIENCE: Graduation year.

  • DAVID J. MALAN: Graduation year, age, OK, good.

  • So graduation year, age, concentration--

  • OK, and we can probably keep going, and it just gets wider and wider.

  • And these are my columns, if you will, or the fields in the database.

  • And frankly, I could name this if I really want to be tidy here,

  • and I could call this Students.

  • And you know what?

  • Maybe I should call this not students but university.

  • Because I might want to have another tab here that we'll preemptively

  • name to, say, faculty.

  • And faculty probably have ID numbers, and they have names,

  • and they probably have departments, for instance.

  • And they also probably have emails and so forth.

  • So phone number is another one.

  • But I might want to kind of cluster these different types

  • of data in different spreadsheets because they're

  • kind of apples and oranges.

  • There might be some overlap, but they're distinct inside different sheets.

  • In the database world, you do the exact same thing,

  • except you call the whole thing a database,

  • and you call these things tables to be clear.

  • But more importantly, with a real database, so to speak,

  • you have to be a little more proactive about telling the database what

  • types of data you want to store.

  • Now, those of you who've used Google Spreadsheets and Apple

  • numbers and Excel know that you can go to the format menu or the data menu

  • or whatever, and you can show things to different decimal places.

  • You can turn things to percents.

  • And you can format the data.

  • But that doesn't tend to have any functional impact on your data.

  • It's just an aesthetic detail, like, how long or short do

  • you want to show the numbers to be in your columns and cells.

  • But with databases, you can actually provide juicy hints

  • to the database that will help that database store and find data for you

  • more efficiently.

  • For instance, there's this laundry list of data types

  • to which we have access in SQL.

  • So SQL is the language with which we're going to be able to query a database.

  • Frankly, in something like Google Spreadsheets and Microsoft Excel,

  • how do you find data?

  • We'll usually hit Command-F or Control-F, and you search for it.

  • That's not very programmatic.

  • You can do it with macros and such.

  • And we'll do this with code.

  • But with SQL, you're going to have to make a few decisions first.

  • SQL supports a few different data types, at least, in one of its forms.

  • So it turns out there's a lot of competition out there.

  • SQL is a general language.

  • But different companies and different nonprofit groups

  • have come up with their own dialects.

  • So much like we humans might speak--

  • bunches of us might speak Spanish.

  • Well, there are different variations of Spanish, not to mention

  • many, many, many other languages.

  • And so similarly, in the programming world,

  • as there's sort of a common subset of words in a language.

  • But then different companies and different people

  • might have added and subtracted their own features just because they think

  • that's better for their use cases.

  • So in SQLite, which is the database technology we will use in CS50 IDE,

  • and Android uses this, and it's very popular in other contexts

  • too, IOS as well--

  • they only have a few data types.

  • Blob, of all things, integer, numeric, real, and text.

  • Now what does that mean?

  • Well, it turns out, it means different things with different databases.

  • In another popular database, it's called Postgres or PostgreSQL.

  • This is a very popular one for high-performing websites.

  • So if you're trying to build a business, you're

  • trying to make an application used by everyone on campus, generally,

  • you would use a fancier technology than SQLite, which is, by definition,

  • lightweight.

  • And you'd do something like Postgres.

  • And so we'll introduce you to both of these dialects along the way

  • because each of them allows us to do different things

  • with a different degree of precision.

  • So in SQL, let's claim, even though it's a simplification,

  • there are at least these data types.

  • When deciding what type of data to store in your database table,

  • you have to decide, not just what it's called, like ID, and name, and email,

  • and dorm.

  • You have to decide what data type it is.

  • So we've kind of come full circle.

  • Because the last time we did this was in what language?

  • C.

  • So in SQL too you do this, but it's not to be nitpicky.

  • It's to actually improve performance.

  • So this is to help you get at that data more efficiently.

  • Because the more the server, the database, knows about your data,

  • the faster it can find it essentially.

  • So in the world of databases, you generally

  • have to decide, not just to store an integer, but either

  • a small int of integer, or a big int.

  • And generally speaking, at least in most databases, this might be 2 bytes.

  • This might be 4 bytes.

  • This might be 8 bytes.

  • So let's maybe impress the families here.

  • If you have a 32-bit integer, or 4 bytes,

  • what's the biggest integer you can store in your database?

  • (WHISPERS) 4 billion.

  • Someone please, impress.

  • AUDIENCE: 4 billion.

  • DAVID J. MALAN: 4 billion is exactly correct.

  • And it might be 4 billion positive values, or maybe it's only 2 billion

  • if you also want to have 2 billion or so negative numbers as well.

  • Big int uses 64-bit, which is just massively, massively bigger.

  • And tiny int or small int rather--

  • there's also tiny int in some dialects-- small int is just 2 bytes,

  • and it actually counts up as high as 65,535, as I alluded to earlier.

  • So you just need to decide.

  • Now, why would you choose one or the other, when you could just use big int

  • and store small numbers and big numbers alike?

  • Well, why might you just intuitively not want to make all of your integer values

  • in a database big ints.

  • It's just simple-- give me as much headroom as I want.

  • AUDIENCE: You would be using memory.

  • DAVID J. MALAN: Yeah.

  • You're just using memory unnecessarily, right?

  • Why use more memory than you need to?

  • Now, arguably-- god, memory is so cheap these days.

  • Computers are getting bigger and faster.

  • So in some contexts, it's not a big deal.

  • But if you have lots of data, and every one of your rows in a database

  • is 2 bytes bigger than needs to be, and you have 1 million rows,

  • that's already huge numbers of bytes that you're just wasting and throwing

  • away.

  • So it behooves you to actually give some thought to these data types.

  • Of course, you don't want to go too small

  • because, if you use a small int or maybe even an integer,

  • but business is so darn good that you have over 4 billion users or 4

  • billion rows or 4 billion foos and bars and bazes that have been sold,

  • which absolutely happens to the Googles and Facebooks of the world,

  • you better start using 64 bits to represent unique values instead of just

  • 32.

  • Because eventually, you'll run out.

  • Well, what else?

  • Besides integer values in SQL, you might want to use floating point values.

  • Those are real numbers that have decimal points in them.

  • This is quite like C. They're called a little something different.

  • And annoyingly, it uses two words in this case.

  • But a real number is a 32-bit value that is a floating point

  • value, like a float in C, and double precision

  • is like a double in C, which is 64-bits.

  • And that just lets you get even finer-grained control

  • over the decimal point as we've seen.

  • Because of course, you can't represent numbers precisely

  • because of imprecision, as we've seen in both Python and C.

  • Numeric is kind of a catch-all for a bunch of number-related things

  • that aren't quite integers or reals necessarily.

  • Boolean, a true or false value, dates if you

  • want to literally store, like, year, year, year, year, dash, month, month,

  • dash, day, day in a standard computer format,

  • even though human cultures vary how they write that.

  • Datetime if you want have a date and a time right after it--

  • numeric is actually a solution to a problem.

  • Turns out that, finally, after all this time,

  • we have a built-in mechanism for storing numbers precisely.

  • And this is perfectly timed because, in CS50 Finance, when

  • you're dealing with money, it would kind of be nice if you don't accidentally

  • round off how much money someone has either slightly lower or slightly

  • higher.

  • Those of you may be families especially because, from a few movies back

  • in the day, if you ever saw Superman III,

  • this is how some money was made by shaving fractions of pennies

  • off of computer systems and, more moderately, Office Space.

  • Office Space did this as well-- big scam trying to round off these pennies.

  • Well, if you use the right technology and language, with SQL,

  • you can actually specify a numeric data type, which is like a floating point

  • value, but you specify, essentially, the total number of digits you

  • want it to handle and how many of them should be after the decimal point.

  • So it's common in stock markets to actually use

  • four decimal points of precision.

  • So not just cents to the tenths and the hundredths place,

  • but to the thousandths and ten-thousandths place,

  • you can now specify that with a value of four for precision.

  • And you don't have to worry about those round

  • off errors we talked about before.

  • Time and time stamps-- just how much time

  • has passed since the date and date and time are also available to you in SQL.

  • And then lastly here, text values--

  • turns out you have a choice of values for when

  • it comes to deciding how many characters do you use to store your data.

  • You can use char, which just like in C with an array,

  • you have to decide in advance for a string how many bytes to use.

  • And if you say char 8, you are getting 8 bytes.

  • And if the name is Malan, M-A-L-A-N, you are wasting 3,

  • or if we null terminate it, 2 of those bytes.

  • But SQL introduces a solution that Python kind of gave us too.

  • Varchar as the name implies-- variable number of chars

  • allows you to specify just an upper bound.

  • So if you know that one of your columns is going to be called Name.

  • And a person's name might be M-A-L-A-N-- like five letters,

  • or maybe certainly more for people with longer last names are first names,

  • you can say varchar 100.

  • And that's the upper bound on how long it is.

  • But the database will be smart.

  • If you only need five characters or 5 bytes,

  • it won't waste 95 others for you.

  • It will optimize that for you.

  • And then lastly text, if you want to store tens of thousands,

  • like paragraphs or pages of text, that's even bigger than varchar

  • where you don't necessarily want a well-defined upper bound.

  • You just want a good chunk of memory.

  • But there are implications here.

  • And just quickly, char and varchar seem not even complimentary.

  • Just char seems unnecessary.

  • Why paint yourself into a corner and say give me

  • only 8 bytes when you could just say, eh, give me

  • up to 100 bites, and you, the database server,

  • figure out how many to actually use?

  • Why do these two data types exist?

  • Yeah?

  • AUDIENCE: Time.

  • DAVID J. MALAN: Time.

  • AUDIENCE: It would be longer to run.

  • DAVID J. MALAN: Yeah.

  • So time is spot on.

  • It might actually take longer to use one data type or the other.

  • How could that possibly be?

  • Well, if you've got a value like char where you specify

  • a precise number of characters like 8.

  • If you think about the columns in Google Spreadsheets and Excel,

  • every one of those columns, those cells is exactly the same width, right?

  • It's 8 bytes, 8 bytes, 8 bytes, 8 bytes.

  • And as soon as you have perfectly constant offsets--

  • this is byte 0, this is 8, this is 16, this is 24, this is 32,

  • and so forth, you reintroduce the capability

  • that in C we had with arrays.

  • You have random access doing simple, simple arithmetic-- multiplication,

  • addition, and subtraction, you can just jump to any element in a column

  • if it's using the same number of bytes for every cell.

  • But with varchar, if you need to have small narrow cells and wide cells,

  • you're going to have what's called a ragged array where

  • one side of it, the left-hand side, so to speak,

  • is going to be very straight and rigid.

  • But the other side my kind of vary based on how long the word is in that column.

  • And to your point about time, it's going to take the computer more time

  • to search a varchar field.

  • Because it can't just jump to the third element.

  • It has to search all of them, potentially.

  • But thankfully, at the end of the day, databases don't do that.

  • They don't just devolve into linear search.

  • CSVs devolve, for our purposes, into linear search.

  • The magic and the intellectual property and the secret sauce, so to speak,

  • that goes into databases--

  • Oracle SQL Server, Microsoft Access, Postgres, MySQL, and SQLite

  • is that really smart computer scientists have decided to implement algorithms

  • and fancy data structures, often tree-based--

  • if you recall our discussion of binary trees

  • or tries on a hash tables and those fancier data structures-- they

  • had baked those kinds of data structures and designs

  • into their database software.

  • They've made it freely available or commercially available.

  • And so we now just have an abstraction called a database

  • into which we can store data and just read it, update it, write it,

  • and change it quickly without having to implement all

  • of that low-level plumbing ourselves.

  • And so that's ultimately what a database is going to give us.

  • But we have to actually have access to a database to do that.

  • And we need to actually decide first to help

  • the database by telling it what to store where and how to store it as a type.

  • So why don't we do this?

  • Let me go ahead and do exactly that with a couple of examples.

  • Let me go ahead and open up CS50 IDE again,

  • our web-based programming environment.

  • And let me go into our Terminal window.

  • So for those unfamiliar, a terminal window is just this black and white

  • or this blue and white window in which you can type commands.

  • And rather than point and click and double click on things,

  • you can only do things textually in this window.

  • And it turns out that, in the world of SQLite,

  • we can actually use a command called SQLite3,

  • which allows us to create a database.

  • So in our human world, for CSVs, you've seen how, in code, you

  • can just create rows and columns.

  • You can literally type it in a text box.

  • SQLite is a database technology that uses binary format, 0s and 1s.

  • And so you're going to have to actually store your data differently

  • using a program.

  • So if I want to make a file called, for instance, froshims.db,

  • I can literally say SQLite three froshims.db,

  • and then I'm in this program.

  • This is a text-based program via which I can create my rows and columns.

  • And it, by designer, is going to be a little arcane for a moment.

  • It's going to look a little cryptic.

  • But if I want to go ahead and create a table, like a spreadsheet in here,

  • I'm going to literally type CREATE TABLE.

  • I'm going to call this registrants just like last week

  • when we were looking at froshims examples.

  • And the freshman intramural sports program

  • has students registering for sports.

  • So I'm going to have a registrants table here.

  • I'm going to go ahead now and specify that I

  • want a few columns for this database table, one of which I want to call ID,

  • one of which I want to call name, and one of which I want to call dorm.

  • So I'm going to keep-- whoops.

  • I didn't mean to hit Enter.

  • Let me just clean this up.

  • So now I have three columns, ID, name, and dorm.

  • And if I were to hit Enter, theoretically, it

  • should create a database.

  • It's not going to be graphical like Excel or Google Spreadsheets,

  • but it will exist in my IDE in a file called froshims.db.

  • But I can't just hit Enter yet.

  • Because I have to tell the database a bit more.

  • I need to tell it that, you know what?

  • My ID value should probably be an integer.

  • And you know what?

  • My name should probably be char, or varchar?

  • What should the name of a student be stored as?

  • AUDIENCE: varchar.

  • DAVID J. MALAN: Varchar, OK?

  • So varchar because they're going to vary, right?

  • If you can think of bunches of your friends,

  • they might have short names or long names.

  • OK, what's the upper bound?

  • What's the length of your friends' longest name?

  • AUDIENCE: 50.

  • DAVID J. MALAN: 16?

  • All right.

  • Is anyone in here having a name, first name and last, that's longer than 16?

  • AUDIENCE: That's 50, sorry.

  • DAVID J. MALAN: Oh, 50.

  • Sorry.

  • I heard 16, I thought.

  • Because you're not going be able to register for froshims

  • because the database is going to truncate.

  • And as an aside, if you've ever gone to a website and tried typing into a form

  • field and either your keyboard stops working

  • or you do type a long word or paragraph in, and then you hit Enter,

  • and it's some of it's gone, well, that's because the database likely

  • can only store so many bytes.

  • And frankly, this often happens on customer service forums, right?

  • They don't really want you being too verbose with the customer support

  • staff.

  • They'll cap the length of the field into which you're typing.

  • And they're also doing that on the database,

  • but also probably for practical human reasons,

  • they don't want to read a big complaint that's this long as well.

  • So 50-- anyone have a name longer than 50 characters?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: Yeah, not many names.

  • But you know what?

  • We have technology to answer this.

  • Longest name in world--

  • and it's this guy.

  • Guinness Book of World Records is probably right.

  • So his name is Barnaby Marmaduke Aloysius Benjy Cobweb Dartagnan Egbert

  • Felix Gaspar Humbert Ignatius Jayden Kasper Leroy Maximilian Neddy Obiajulu

  • Pepin Quilliam Rosenkranz Sexton Teddy Upwood Vivatma Wayland Xylon Yardley

  • Zachary Usansky.

  • But he also goes by Nick, apparently.

  • But OK.

  • Let me go ahead and highlight this.

  • And I could count this manually.

  • But I'm sure someone has made a website to count characters in a string.

  • There we go, lettercount.com.

  • OK.

  • Let's go ahead and paste that in, count the characters, 225--

  • so nick will not be registering unless we don't support just 16 or 50.

  • Looks like we need at least 225.

  • So at some point, you do have to make a decision.

  • And honestly, it's not always obvious.

  • You do have to specify that varchar is going to be-- eh, it could be 16,

  • could be 50, could be 225.

  • But you know what? if he has kids--

  • so let's just round up a bit and call it varchar 255, honestly,

  • only because it was a convention.

  • So back in the day, 255 tended to be the max length

  • for this field in older databases.

  • And that at least feels like it's probably enough wiggle room for names.

  • But you have to make a judgment call at some point.

  • We could ignore the problem say 1,000 characters.

  • But if that's never really going to happen,

  • and your potentially wasting space, you probably shouldn't.

  • Because even though it's an upper bound, you're sacrificing something like time,

  • potentially, to search that field if you're just

  • telling the database that it's going to be bigger than it ever actually

  • will be.

  • It needs to be more finely hinted.

  • So what about dorm?

  • Oh, god, now we have to do this again.

  • How do we do it with dorm?

  • What's the longest length of a dorm or a house on campus?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: I'm, like, OK [INAUDIBLE],, house maybe.

  • But honestly, at some point, it doesn't matter too much.

  • But being consistent is what matters.

  • It's a matter of style.

  • It's a matter of design.

  • At some point, you don't have to nitpick every little value.

  • But you should probably pick some reasonable value

  • that you could justify to a roommate or to a colleague or to a teaching fellow.

  • And then say this feels like enough without it being actually excessive.

  • And there's one last thing I'm going to do here is just

  • end my thought with a semicolon.

  • I apologize, they're back.

  • But at the end of my start here, if I haven't made any typographical errors,

  • and I'm going to go and hit Enter.

  • Nothing seems to happen.

  • And ironically, in CS50, in programming, nothing happening

  • is usually a good thing because it means no errors have happened.

  • If, in SQLite, I type schema, I can actually

  • just see a regurgitation of the table I just created.

  • But more importantly, I can start to insert data into this.

  • Now unfortunately, it's not yet graphical.

  • But if I want to insert data into this, I can do this.

  • INSERT INTO registrants-- well, what do you want to insert?

  • I want to go ahead and insert an ID, a name, and a dorm.

  • Well, with what values?

  • Well, with these values--

  • the ID will be-- the first registrant will be 1.

  • The first one to register will be Brian.

  • And his dorm, Brian where did you live?

  • BRIAN: Out on Pennypacker.

  • DAVID J. MALAN: Pennypacker.

  • So it's good.

  • That might be close to four timer too lengthwise.

  • So I'm going to go ahead and do this.

  • Here is an example of a SQL statement.

  • CREATE TABLE is one verb or expression you can use.

  • INSERT INTO is another.

  • You specify the name of the table, the fields of the table, and then values.

  • And now I'm capitalizing just to make clear what are SQL commands

  • and what are actually just words I, the human developer, chose.

  • But it's just a convention.

  • These uppercase words could probably be lower case in most contexts too.

  • But it helps things, I think visually, to distinguish.

  • Now I go ahead and enter.

  • Nothing seemed to happen.

  • That's probably a good thing.

  • Let's go ahead and register one other person.

  • I'll be the second registrant.

  • So maybe David from Matthews.

  • And so Matthews here, Enter.

  • And now, if I want to select all of the students in the database,

  • I can go ahead and say SELECT FROM--

  • or you know what?

  • Let's select everything as denoted in many languages by star,

  • from registrants semicolon enter, and there we have it.

  • It's kind of a tiny super simple, lightweight database.

  • But there are my rows and columns much like Excel

  • and Google Spreadsheets would lay them out for me.

  • But it gets better than this.

  • Suppose I want to search this database for all of the students who

  • registered for Matthews.

  • And suppose that time passes and more students

  • actually register for forshims.

  • I can actually filter this data.

  • I can do something like SELECT star FROM registrants WHERE

  • dorm equals quote unquote, "Matthews."

  • And so I can filter it, hit Enter, and now I get back just one row.

  • And if your should mind starts to wander--

  • wow, if I could introduce Python or JavaScript into this,

  • you know what I could do?

  • I could probably get back, not this and this ASCII-based table, this text

  • table.

  • Maybe I could get back an actual list of rows

  • so that I can actually do something with that data.

  • And that's, indeed, where we're going with this.

  • So if I want to select someone else, I could do hmm--

  • maybe SELECT just the NAME from registrants

  • where dorm equals "Mathews."

  • If I only care about knowing who registered,

  • I could do that and whittle it down to even less data.

  • So already, in just these few commands, I

  • can express so much more functionality than you could with a CSV.

  • To do this and CSV, you would have had to write all

  • of these lines of Python code yourself.

  • What if Brian moved?

  • So Brian really didn't like the Union dorm,

  • so we're going to go ahead and UPDATE registrants

  • and SET dorm equal to-- where do want to move to?

  • BRIAN: Canaday.

  • DAVID J. MALAN: "Canaday" WHERE--

  • I could do this a few ways.

  • What's your instinct?

  • How could I identify Brian and only Brian so--

  • I don't want to move to Canaday.

  • How do we move just Brian?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: Yeah.

  • Maybe we could WHERE name equals quote unquote "Brian" and Enter.

  • I like this, but someone find fault with this if you could.

  • AUDIENCE: Two Brians.

  • DAVID J. MALAN: If there are two Brians, we're moving both of them

  • are all three of them.

  • So maybe better, honestly, would be to say, mm,

  • let's just say where the ID equals 1.

  • Now, of course, I need to know Brian's Harvard ID number or whatnot.

  • But this is going to be more precise.

  • Because the supposition here is that the leftmost column or ID

  • field, by human convention, should be unique so that it uniquely identifies

  • the room so we can have as many Brians as we want,

  • but we're not going to confuse them.

  • If I go ahead and hit Enter now, nothing seems to happen.

  • But if I select star from registrants again, and hit Enter, notice now,

  • Brian has indeed been moved to Canaday.

  • So there are so many other things.

  • Brian, you know what?

  • You're not very good at sports the team concluded.

  • So delete from registrants WHERE ID equals 1.

  • And now if we select star from registrants, we have just me left.

  • But Brian is gone as well.

  • So we have the ability to INSERT, to SELECT, to UPDATE, and DELETE,

  • and CREATE, all the while filtering.

  • And we've only just scratched the surface here of what's possible.

  • Because it turns out, we can store so much data in these databases.

  • All of our students, all of our faculty--

  • if you're a company, you can start to store

  • all of the products in your database, all of the orders,

  • all of your customers.

  • But as soon as you start going down that road, well, gee, what's a customer?

  • Customers have names and ID numbers and maybe email addresses

  • and postal addresses.

  • Those are going to get messy-- phone numbers,

  • which are kind of like integers, but not quite.

  • So there are so many questions we still need to answer.

  • But feels like it's time for some fruit and some muffins.

  • So why don't we go ahead and take a five-minute break, turn on some music.

  • If parents need to depart, that's fine.

  • And we'll see you back here in five or so minutes.

  • All right.

  • So we're back.

  • And where we left off was exactly here using

  • SQLite3, which is the version 3 of the SQLite command, which

  • is just a text-based interface to the technology

  • that we should think of as SQLite.

  • SQLite is interesting and it's lightweight in the sense

  • that it's not a server, it's not fancy software that you have to run.

  • It literally stores all of your data in a file, but that, by convention,

  • ends in .db or maybe even .sqlite.

  • But it's just a binary file, 0s and 1s.

  • It's not text that you can open with Excel or something like that.

  • So it's not a CSV.

  • So it's stored on file.

  • This means that you don't need particular experience

  • with setting up a database server.

  • You don't need memory and software to actually run on the computer.

  • You can store everything locally, but you do pay a price.

  • Because recall from our past discussions, disks are slow.

  • And if you're storing your data on the disk,

  • it's not going to be nearly as fast as storing it only in RAM.

  • And that's why these other technologies like Postgres and MySQL,

  • and Microsoft Access, and SQL Server, and Oracle exist.

  • Those are fancier products where someone literally does double-click an icon

  • or run a command that runs a program that

  • stays running in the computer's memory and gives you even better performance.

  • But for our purposes, pedagogically SQLite is handy.

  • But we're still going to ask the questions about data types

  • because SQLite does support fancier data types like those in Postgres

  • and SQL Server, MySQL, and Oracle, and those are the lower-case ones

  • we introduced a bit ago.

  • But honestly, this is going to get tedious quickly.

  • Even I rarely remember exactly the right syntax when creating tables,

  • the order in which everything has to go.

  • So I tend to use a Graphical User Interface, or GUI myself.

  • And there are lots of tools out there.

  • But we've built one that's free and open source the CS50

  • IDE that just makes it a little easier to edit your SQL tables.

  • So I'm going to go ahead and Exit out of SQLite.

  • And I'm going to go ahead over here.

  • And you'll notice that we have all of the files

  • from today including, most recently, froshims.db.

  • That is the file I created with SQLite3.

  • But if you double-click it when using CS50 IDE,

  • it's actually going to open a program that's called phpLiteAdmin.

  • It happens to be written in another language called PHP,

  • has nothing to do with databases, just the name of the product here.

  • But it's a tool that, using pretty simple HTML tables,

  • just gives us a graphical user interface over the exact same functionality.

  • And personally, I just find this easier to use.

  • And pedagogically, it's going to be better

  • because it's going to show us the available data types for our table.

  • So for instance, notice here-- there's a lot going on the screen.

  • But a lot of this is just uninteresting details.

  • But notice here, I see a table called registrants.

  • And then I can browse it.

  • I can see its structure.

  • I can execute manual SQL by typing it in.

  • I can search it, insert.

  • I can do bunches of things to it.

  • And that's why this graphical user interface is just convenient.

  • Let me go ahead and click on registrants.

  • And by default, you'll see these tabs now, Browse, Structure, SQL, and then

  • a bunch of others.

  • And notice over here.

  • You'll see the one row and the one registrant who actually remains.

  • Because when we last left off, we removed Brian forcibly from the team.

  • So suppose I want to go ahead and add more rows just

  • for the sake of discussion.

  • I can just do it manually here.

  • I can go in here and I can say, let's say the third player

  • is going to be Veronica.

  • I think she was also in Matthews so I can just type that in here.

  • And I can just go ahead and click Insert.

  • But what's nice about phpLiteAdmin is that it will not only

  • insert the rows for you, it will show you

  • the SQLite code with which you could have done it yourself.

  • So it's a nice visual reinforcement of that exact same command.

  • And you'll notice they use double quotes instead of my single quotes before.

  • They're sometimes interchangeable, but not always.

  • So it's a wonderful way of just learning how you can actually

  • do this with the right SQL code so you're not oversimplifying it

  • with the GUI.

  • But you know what?

  • Let's do this.

  • I'm going to go ahead and start over because I

  • want to make some better decisions.

  • I'm going to go ahead and, literally, right-click or Control-click

  • this, delete froshims.db.

  • And let's actually start this from scratch.

  • Now previously when I did this, I could have,

  • in my Terminal window, gone ahead and done SQLite3 and then, what was it?

  • Froshims.db and created the file.

  • For now, I'm going to actually just go ahead and touch

  • a file called that name, which is a simple command that

  • just literally creates it, but puts nothing in it,

  • just so I have an empty file.

  • And you'll see that it just popped up again

  • on the left-hand side, which is handy.

  • And now I can double-click this version of it, which has nothing in it yet.

  • And I'm back to phpLiteAdmin.

  • But notice no table in database.

  • So let's start to ask some of the harder questions

  • as to actually how to create data in a database.

  • So let me go ahead and go to this field here-- create

  • new table on database froshims.

  • I'll go ahead and call it registrants again.

  • But how many fields?

  • So let's go with ID, name, and dorm, what else?

  • Email I heard earlier.

  • Age I heard earlier.

  • AUDIENCE: Sports you want to be in.

  • DAVID J. MALAN: Sorry.

  • So sports you want-- excuse me--

  • sports you want to be in.

  • OK. six fields, six fields, let's go with that.

  • So it's, actually, you know what?

  • Seven fields.

  • I want phone numbers too this time.

  • So let me go ahead and click Go.

  • And now you'll see just a GUI way of prompting you for all the answers

  • to the same questions as before.

  • And maybe this is clearer.

  • Maybe it's not.

  • But just no alternative to the memorizing

  • exactly what the commands need to be.

  • So top to bottom, here are all of the fields I need to decide on.

  • Ironically, the fields is just a synonym for columns.

  • And yet my columns are currently laid out in rows,

  • but that's just a UI issue.

  • So let me go ahead and decide.

  • By convention, my first field is almost always ID or probably should be.

  • And we'll see why this is powerful and just a little bit.

  • The data type for that, by convention, should be integer,

  • unless you've got a lot of data like a Facebook,

  • and then big int might make more sense.

  • But notice this Dropdown actually gives us

  • a nice menu of options just as before.

  • And you'll see in gray text, the category,

  • or in fancy terms, the affinity of these various types in SQLite

  • and in lower case black words here, you'll

  • see the actual data type supported by big popular databases like Postgres

  • that you might want to use for your final projects.

  • In fact, we're introducing these with eye

  • toward your using these for final projects in the cloud, not in CS50 IDE,

  • but actually getting your own domain name

  • and putting your website, if you do a web app, out there.

  • So here, we have all of my available types.

  • And under Integer, I'm going to go ahead and literally choose Integer for my ID.

  • All right.

  • Next, go ahead and Zoom Out.

  • Let me go ahead and choose a student's name.

  • Before did we do did we decide on char or varchar?

  • AUDIENCE: varchar.

  • DAVID J. MALAN: OK, varchar, and what size?

  • AUDIENCE: 255.

  • DAVID J. MALAN: 255.

  • So the user interface here just allows me to type it in.

  • So the syntax is a little different because it's a GUI, but 255.

  • But you know what?

  • The last field was dorm.

  • That too, I think we said varchar.

  • So let me choose that.

  • And 255, though, this one's a little more debatable.

  • I'm not sure what the right number is, so in the absence of clarity,

  • I'm just going to standardize on some same value without being too wasteful.

  • But notice there are a few questions here that we haven't come to.

  • But our perfect segue earlier hinted at this.

  • It turns out, you need to make a few other decisions

  • when designing a database.

  • If you know in advance that one of your fields

  • is the primary piece of data to uniquely identify users,

  • that's what's going to be called a primary key--

  • the column or fields that, guaranteed, is going to identify users uniquely.

  • So if you've got two Brians, each of them is going to have its own ID.

  • That therefore, is your primary key, not the name field Brian.

  • So I'm going to tell the database, this is the primary key.

  • And it's going to help me keep track of that uniqueness.

  • Moreover, this is a fancy feature.

  • Before, I was manually and very arbitrarily

  • saying Brian will be number 1, I'll be number 2, Veronica will be number 3.

  • That's tedious.

  • Like, computer should be able to solve that problem for me.

  • I don't want to think about who is idea number what.

  • You can auto-increment the field.

  • So if I actually check this box, SQL for me

  • will just plus-plus, plus-plus the idea field every time

  • I insert a new name and a new dorm.

  • I don't have to even bother specifying an ID anymore.

  • Now there's another column here, not null,

  • where you can specify this column should never be null.

  • And this is important because, if you're building a website that has important

  • data that you must have from the users-- like,

  • your app won't work without the user's username or their password

  • or their email address--

  • you can say not null.

  • And your database will ensure that you can't even

  • insert a row into this database unless you give it a value for that field.

  • So it helps you protect you against yourself.

  • Because you could certainly implement that

  • logically in Python or any language.

  • But the database is a final gauntlet as well.

  • A default value doesn't really make sense here.

  • But for certain types of fields, you can say, database,

  • insert the current time or the current date for me?

  • Now, why might you want that?

  • Why date and time by default?

  • Why might that be useful?

  • AUDIENCE: When the account was created?

  • DAVID J. MALAN: When the account was created, when they bought

  • a foo, when they shipped a bar-- any number of reasons.

  • You might just want to know, what is the time right now?

  • But you don't have to write code for that.

  • The database can answer those questions for you.

  • So just so much more functionality than we got, of course, with CSVs alone.

  • So name, should it be a primary key?

  • No.

  • Otherwise, we couldn't have two Brians.

  • And generally, your primary key will be one field, though,

  • theoretically you could make joint columns if you wanted.

  • But generally, it'll be a single one.

  • Should we auto-increment Brian-- so it's, like, well--

  • Brian28, or Brian2, Brian3, and so forth?

  • No.

  • Doesn't really make sense.

  • Not null?

  • Probably.

  • I want all of the freshmen's names so that we

  • know who is signing up for sports.

  • And dorm?

  • Yeah, not null.

  • But oh, corner case--

  • can anyone think of a corner case where dorm maybe

  • should kind of sort of be null?

  • AUDIENCE: They're off-campus.

  • DAVID J. MALAN: They commute, they're off-campus--

  • it's not many students.

  • But if you have 1%, 5% of students living off-campus,

  • this is a design question now.

  • And all of us have probably visited some website

  • where you just can't fill out the form in the right way

  • because you don't fit their mold or their expectations.

  • And that's just because of a poor design decision.

  • So let's allow it to be null just in case.

  • Now what else?

  • We said a phone number was when I proposed.

  • Gosh, there's no phone number type.

  • So what do you want to go with?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: I'm sorry?

  • AUDIENCE: [INAUDIBLE] varchar.

  • DAVID J. MALAN: Varchar.

  • OK.

  • So we can pick varchar and maybe use 10 or so for 10 digits in the US,

  • at least, though, maybe, like, 12 with the dashes, or 13 with the parentheses,

  • or--

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: What's that?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: I hear murmuring.

  • Sorry.

  • AUDIENCE: A small int.

  • DAVID J. MALAN: A small int and just treated

  • as a number, maybe that could work too.

  • Another alternative?

  • Anything else?

  • AUDIENCE: [INAUDIBLE] precision.

  • DAVID J. MALAN: Precision-- so a specific number.

  • AUDIENCE: [INAUDIBLE] make sure you have the correct number.

  • DAVID J. MALAN: Good.

  • So not a bad instinct.

  • But it turns out with the numeric data type, where you specify scale

  • as it's called-- the total number of digits in precision--

  • that's generally meant for floating point values-- so

  • real numbers with decimal points.

  • Because even though you're specifying a max limit,

  • you don't require that many digits.

  • It's just a max.

  • Yeah?

  • AUDIENCE: I would think that giving it--

  • doing some sort of chars would be a bad design because then you

  • can enter nonnumerical values.

  • DAVID J. MALAN: Yeah.

  • We can kind of find fault, I think, both of these ideas, though both of them

  • are reasonable.

  • If you allow for char or varchar, I could

  • type in like foo or bar or baz and not a number.

  • So maybe we should go with int.

  • But can someone think of a counter-example to why

  • you shouldn't use integer?

  • AUDIENCE: It's got a variable length to it, doesn't it?

  • So you'd get whatever number--

  • DAVID J. MALAN: Variable length.

  • But if we actually do the math, maybe 65,

  • maybe-- there are enough bytes there.

  • We could use a big int, and that gives us a really long phone number.

  • So there's probably a reasonable max.

  • Yeah?

  • AUDIENCE: Let's say [INAUDIBLE] in between [INAUDIBLE] integer.

  • DAVID J. MALAN: Hyphens and parentheses.

  • We have to decide, do we want to support those?

  • And honestly, all of us are perhaps a little US-centric right now, most of us

  • here.

  • But when you make a local call in some zones, like, you type 0 first.

  • And some human might type 0 as their number.

  • But what's going to happen if you type 0 into an integer field?

  • AUDIENCE: You're going to ignore it.

  • DAVID J. MALAN: You're going to ignore it.

  • So now there's that corner case.

  • So dammit.

  • Like, there's no way to solve this problem it would seem.

  • So what's best?

  • We have to make a compromise and just accept

  • that we have to solve this with code.

  • AUDIENCE: Varchar.

  • DAVID J. MALAN: Varchar?

  • All right.

  • So varchars or char.

  • All right.

  • So maybe let's simplify the problem.

  • No one from outside the US can take freshman intramural sports.

  • That simplifies the world.

  • Because if we only support US phone numbers, now we can say 10 digits.

  • And if we say, you know what?

  • I don't care about the hyphens or the parentheses.

  • I can use code, JavaScript or Python, to throw away the syntax.

  • And I can just store 10 digits.

  • Maybe char 10 is sufficient--

  • 3 for the area code, then the rest of the number, the all seven digits

  • thereof.

  • But you could find fault with this too.

  • And we're really alienating that international population on campus.

  • But again, these are just non-obvious design decisions.

  • And so here we are, at the end of the semester.

  • We don't always have good answers.

  • And reasonable people will disagree.

  • But let me simplify our assumptions and just do US numbers, 10 digits,

  • and trust that I will use code in Python or some other language

  • to throw away the syntax, the punctuation and whatnot

  • of parentheses and hyphens.

  • And I'll make sure the human hasn't typed in any letters of the alphabet.

  • I can do that in code.

  • And you know we can do that even in JavaScript when a human submits a form.

  • We'll leave for the end, sports, what was it sports--

  • AUDIENCE: Sports they want to do.

  • DAVID J. MALAN: Sports they might want to do.

  • So this is a good example of, you shouldn't really have spaces

  • in your field name, so the convention would be sports_they_ really or might

  • or let's just call it sports in this case.

  • We'll come back to that.

  • I think there were two other ideas we had.

  • Phone number.

  • AUDIENCE: Email.

  • DAVID J. MALAN: Email.

  • OK.

  • Email is a good one.

  • What should that one be?

  • There's no email type, unfortunately, even though there

  • is an HTML, an input type for email.

  • AUDIENCE: Varchar.

  • DAVID J. MALAN: What's that?

  • AUDIENCE: Varchar.

  • DAVID J. MALAN: Yeah, I feel like we probably need a varchar here.

  • But here, a little non-obvious, what is the longest email address in the world?

  • Maybe it's Nick's?

  • So I don't know.

  • But let's pick a reasonable upper bound that we can maybe

  • be comfortable with as a group.

  • And let's see-- anything else here?

  • No?

  • OK.

  • And was there one more field?

  • AUDIENCE: Graduation year.

  • DAVID J. MALAN: Oh, grad-- oh, age.

  • Let's go with age.

  • So age-- finally, something simple.

  • What you want this to be?

  • AUDIENCE: Small int.

  • DAVID J. MALAN: Small int, right.

  • We will not support people older than 65,535 years old.

  • Someone want to find fault with this idea though?

  • I would argue, there's no one right answer to any of these.

  • AUDIENCE: You need month.

  • DAVID J. MALAN: Month--

  • oh, it depends.

  • Do we want month?

  • AUDIENCE: Yeah.

  • We might need [INAUDIBLE].

  • AUDIENCE: --then you need the date, the year.

  • DAVID J. MALAN: So you're assuming we want birth date.

  • I think I've called it age.

  • So maybe that's the problem.

  • Like, if it's age, small int's fine.

  • Like, you can be 0 years old or 65,000 years old.

  • We have a good range.

  • AUDIENCE: All about the age, date of birth.

  • DAVID J. MALAN: If it's age, I think we're OK.

  • But I think you allude to a good point, which is, why would we maybe want

  • to store birth date and not age?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: Yeah.

  • The damn thing's always changing, otherwise, right?

  • Like, I'm going to have to update my database tomorrow and then

  • the next day, let alone every hour or every minute just

  • because my users' ages are changing.

  • Like, that seems silly.

  • Let me, instead, fix a value--

  • so do something like birth date.

  • Birth date-- maybe specify not an int.

  • But let's actually use the date field.

  • We could store time if we really care what time they

  • were born on a certain day.

  • But here, I can say date is going to be a little better because, now I

  • know in Python, JavaScript, even C, I can do a little bit of math.

  • And if I know they were born on such and such a day and month and year,

  • well, I'll just subtract that from the current day, month, and year

  • and figure out how many days or years old they are-- so a better design

  • decision there perhaps.

  • But we do have to standardize the format.

  • We can't just allow people from the US and Europe and Asia

  • to all kind of choose their own formats.

  • SQL standardizes this-- year, year, year, year dash

  • month, month dash day, day.

  • And that's the value of having these data types again.

  • All right.

  • So how about sports-- the last one?

  • AUDIENCE: Varchar.

  • DAVID J. MALAN: Varchar, all right.

  • What's the longest number of words in the sport?

  • AUDIENCE: I didn't say might want to do, so I could get up there,

  • but 255 probably makes the most sense.

  • DAVID J. MALAN: Maybe unless they're very athletic.

  • Yeah.

  • Here too I don't know.

  • But just for the sake of opening up possibilities,

  • when you think the human might be a little expository

  • and actually write a paragraph of all the sports they're

  • involved in or whatnot or even bigger than a paragraph, text is even bigger.

  • Sports, this probably isn't compelling.

  • And I'll change it back to varchar.

  • But if someone is typing in their college essay

  • into the Common Application, or if you're

  • asking people to paste their resumes, or the like, you might want to use text.

  • Because I have no idea how many words someone's going to have.

  • Text allows you to store even more data.

  • But it stores it a little differently.

  • It tends to store it not in the column, but using pointers.

  • If you recall from a few weeks back, it uses the equivalent

  • of that to store it over there, which takes a little more time to get to.

  • So again, there's just a trade-off here.

  • So we could do varchar 255--

  • makes me a little nervous.

  • So I'm going to go with another common value.

  • 1024, it's a power of 2, but there's no one right answer here.

  • But these are the non-obvious design decisions we have to make.

  • If I didn't make any mistakes here, I'm going

  • to go ahead and click Create, and whoo--

  • table has been created.

  • You can even see now all of the data that's been created there.

  • And voila, if I go back to froshims, and I go back to the registrants table,

  • not only can I browse it-- there's nothing in there yet.

  • I can look at the structure and actually see all of those same values

  • and edit some of them like renaming.

  • But you can't completely mutilate it.

  • You might have to start over if you make too many changes.

  • All right.

  • Any questions then about this?

  • Yeah?

  • AUDIENCE: Why do you use 1024 instead of 1023?

  • DAVID J. MALAN: Oh, why do I use 1024 instead of 1023.

  • Typically, when you choose a ma-- oh, convention.

  • I can't justify this.

  • 255 was the max because I think one of the bites was reserved for some value,

  • historically.

  • But that's not really the case.

  • That bound has been lifted, and no good reason.

  • AUDIENCE: You could have used 255.

  • DAVID J. MALAN: Nowadays, yes.

  • Years ago, 255 was an actual limit.

  • And so it got adopted.

  • Now, I don't know.

  • I just pick powers of 2 often.

  • And then at my next go-to would be 2048, 4096, and so forth, just because.

  • Yeah?

  • AUDIENCE: What's the difference between varchar and char?

  • DAVID J. MALAN: Varchar--

  • What's the difference between varchar and char?

  • Char uses a fixed number of bytes no matter how many of them you are using,

  • the advantage of which is your columns, conceptually,

  • are perfectly straight on both the left edge and the right edge, which

  • means you have random access because every cell is some fixed

  • number of bytes from the rest.

  • Varchar user a ragged array, as it's called,

  • where one side, the right-hand side is shorter or longer in different cells.

  • So there's only a maximum length on each of those cells.

  • But searching it can be slower as a result

  • because you can't just jump to cell to cell.

  • You have to follow the lengths of those things.

  • That's the trade-off.

  • Yeah?

  • AUDIENCE: If you [INAUDIBLE] education, how do you add it?

  • DAVID J. MALAN: Oh, if I wanted to add an education field now and modify

  • the table--

  • if you realize too late, oh, darn, like, I need to actually add something

  • to this, in the GUI tool, we can add 1 fields to the end of the table,

  • literally.

  • So let's do that.

  • Let me go ahead and click Go.

  • I'll be prompted with a similar form, but smaller.

  • I can go ahead and type in something like education.

  • Let me propose this is varchar, maybe 255,

  • though we could have that debate too.

  • I'm going to go ahead and say add fields.

  • And now notice that the table has been altered successfully.

  • It actually, for whatever reason, it's not showing me

  • the code for that particular command.

  • But there is literally an ALTER command in SQL

  • that would allow you to change it.

  • And if I go back to the structure now, you'll

  • see that I have another column called education shown ironically

  • here as a row.

  • Yeah?

  • AUDIENCE: Does the order of the columns matter?

  • DAVID J. MALAN: Good question.

  • Does the order of the columns matter?

  • Fundamentally, no.

  • By convention, you would typically put the ID first.

  • And then I, personally, by design, put the most important fields

  • next like name feels like the right choice,

  • maybe email feels like the right choice, though, I clearly thought

  • of it a little too late this time.

  • It's not easy in SQLite to reorder things, but in other databases you can.

  • So there, it's more of a human convention.

  • Yeah?

  • AUDIENCE: Is it convention to have one single primary key,

  • or can you have multiple?

  • DAVID J. MALAN: Good question.

  • Is it a convention to have one single primary key, or can you have multiple?

  • By definition, you can only have one.

  • But that primary key can span multiple columns.

  • So we haven't seen a use case for this yet.

  • But there are scenarios in which you would want to say,

  • I want to guarantee that these two columns together are unique,

  • but not each individual one unique.

  • But we won't encounter that just yet.

  • Other questions, yeah?

  • AUDIENCE: What if you had added education

  • after you had started the database, and that was a not null field?

  • DAVID J. MALAN: Really good question.

  • What if you had added education after you already had real data in there,

  • but you specified not null, which is problematic.

  • Because what is the educational backgrounds of the previous people?

  • Typically, what the database would do is either reject it,

  • or it would just put the "empty string," quote unquote.

  • So it's not technically null, but there's nothing actually there.

  • It's just a string of length 0.

  • Really good question.

  • All right.

  • So what can we now do that's a little more powerful about this?

  • Well, let me go ahead and quickly insert some data here.

  • I'm going to keep most of it blank.

  • But you know what?

  • I'm not even going to bother with ID.

  • Brian you're back on the team.

  • Let's go ahead and insert Brian.

  • Let's go ahead now into registrants again.

  • Let's go ahead and add Veronica again.

  • So I'm just inserting a few rows manually.

  • And again, notice it's executing all of this for me

  • without me having to bother typing it out.

  • But I absolutely could.

  • In fact, just for good measure, let's do one manually.

  • If I click the SQL tab, notice that I get a default suggestion here.

  • That is the syntax with which you can select everything.

  • Or I can just type INSERT into registrants.

  • But now, if I only want to insert a name,

  • I don't have to do all of the columns.

  • I can just say go ahead and insert here, Erin, for instance,

  • semicolon, zoom out and click Go.

  • That seemed to work.

  • If I go back to browse now, Erin is in there as well.

  • But you'll see the difference.

  • The query that's being generated automatically by the GUI

  • was lazily just inserting quote unquote, the so-called empty string of length 0.

  • I, by omitting even mention of those columns,

  • was deliberately inserting null.

  • So frankly, my database is getting a little messy here.

  • So you generally don't use phpLiteAdmin or a GUI to insert data.

  • You might use it to conveniently create your tables

  • and get your application ready.

  • But then you're going to write code ultimately.

  • And that's the direction we're going.

  • And I'm going to go ahead and insert one more person.

  • Oh, I forgot I'm not on the team at all because we started over.

  • So let me put myself back on the team, David.

  • And let me go ahead and click INSERT, go back to registrants,

  • and now you'll see there are four of us.

  • My ID changed because they've been inserted in different orders.

  • But notice all of the auto incrementing has been happening magically for me.

  • And that's useful because I don't have to even think about it.

  • And who cares what my ID is.

  • I just need to have, in many cases, in a database, a unique ID.

  • So now let's actually write a little bit of code, right?

  • Thus far we haven't done anything useful.

  • We've shown you this black and white window in which you

  • can select in certain update data.

  • But that doesn't really solve any problems we know about yet.

  • We have this graphical web-based interface

  • via which you can create tables and add data, but who cares?

  • We're trying to solve actual problems.

  • And the problems of late have been to build

  • software that would solve any number of human problems

  • like serving users and showing the results

  • or finding similarities in documents.

  • So suppose the problem at hand now is to actually build

  • something like the froshims website and let students register and then see

  • who is registered.

  • Well, back in my day, I fairly lazily, for lack of technical know-how,

  • just emailed the registrations to the proctor or the RA who was

  • managing the intramural sports program.

  • But I-- they later started putting it in CSV files.

  • Suppose now, version 3, 20 years later, I want to store in an actual database.

  • How can I actually do that and then see the results?

  • Well, let me go into the IDE again and open up,

  • for instance, a New File that I'll go ahead and call lecture.py.

  • And suppose I just want to write a simple Python program via which

  • to select data from a database.

  • So it turns out I can do a few things here.

  • First of all, let me go ahead and do, let's say, from CS50,

  • I previously have done things like import get_string and get_int

  • and so forth.

  • It turns out that the CS50 library for Python also supports SQL.

  • And it's going to give us a function called

  • EXECUTE that will let you execute any SQL command, but in Python code.

  • So instead of pulling up SQLite3 via my own hands

  • or going to a graphical user interface phpLiteAdmin,

  • I can write code that talks directly to froshims.db,

  • and eliminate all of those tools altogether, and just now write code.

  • So how do I do this?

  • I'm going to declare a variable called db for Database.

  • So I could call it anything I want.

  • And I'm going to go ahead and call this SQL function.

  • And I'm going to pass in a somewhat funky looking string as an argument,

  • but it's a standard convention, to say what

  • database technology do you want to use?

  • Then you do colon, slash, slash, slash-- so it's three slashes, not

  • the usual two in a URL.

  • And I'm going to specify froshims.db.

  • This now will give me a Python variable called

  • database that is kind of like a portal, if you will, into that database

  • file that I can send SELECTs, and INSERTs, and DELETEs, and UPDATEs to.

  • How do I do this?

  • Well, if at the end of the day, I want to execute

  • the equivalent of SELECT star from registrants, how do I do that?

  • Well, I'm just in a text editor, right?

  • This is CS50 IDE.

  • I'm just typing text.

  • Moreover, I'm typing text in a Python file, and this is not Python.

  • And indeed, the IDE has this little red x saying, mm-mm, can't do this.

  • But I could pass SQL code as an input to a Python function

  • and let that function talk to the database.

  • And indeed, that's what we're going to get here from CS50's library.

  • I'm going to go ahead and do this.

  • You know what?

  • Access the database and EXECUTE the following SQL code,

  • quote unquote, "that" close parentheses.

  • Now, what is select return by convention?

  • What should it return?

  • Well, in SQLite3, we just saw a pretty text-based table

  • with lines and slashes that looked like a table, but was just text.

  • phpLiteAdmin, we actually saw HTML tables when I browsed the database.

  • And I proposed verbally, a bit ago that, you know what?

  • If I were to get back all of this data in code, what data type

  • would I like it to me as?

  • Rows.

  • I want rows from a table.

  • Show me all the students who've registered.

  • What data structure in Python seems apt?

  • AUDIENCE: List.

  • DAVID J. MALAN: Yeah, just a list, right?

  • A list that's ordered from first row to last row.

  • So we'll call that a list or an array, back in the day of C. So you know what?

  • I'm going to assume that's correct.

  • And if I read the documentation, I would see that it's correct.

  • CS50's EXECUTE function, if you select, returns to you a list of rows.

  • It might have 0 rows if there are no matches.

  • But it might have 1,000 rows if there are lots of matches.

  • I'm going to store those results, wherever they are, in my rows array.

  • Now, suppose I want to print out who has registered in my database

  • from whatever froshims website exists.

  • I'm assuming students have registered on the web.

  • Now I'm just the proctor or the RA who's actually now

  • trying to manipulate the data and do something with it.

  • So what can I do?

  • Well, for row in rows, what do I want to do?

  • Let me go ahead and just print out that so-and-so registered.

  • So so-and-so registered.

  • Well, how do I plug in so-and-so?

  • Well, there are a few ways to do this.

  • And let's see.

  • First of all, I could use my placeholder syntax for print.

  • And then, I want to print out the row.

  • But what do I want from that row?

  • What columns are in any row in this database?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: ID and name and dorm and phone and sports.

  • Well, it turns out, those are going to be handed to you as Python dictionaries

  • or dict structures.

  • So I can just say row quote unquote, 'name' here.

  • And I'll use single quotes just to make more clear what's going on here.

  • And then I need to make one fix.

  • How do I--

  • AUDIENCE: F.

  • DAVID J. MALAN: --F for Format string.

  • So it looks a little cryptic.

  • But this is just Python stuff now.

  • The only thing that's new is SQL.

  • But if we stipulate that SQL, when using this execute function

  • is just going to hand you all a list of rows, each of which

  • is a dictionary so that you can get at this column or this column--

  • ID, or name, or dorm--

  • this would seem to be now a nice convergence of this week--

  • now, with the past couple of weeks.

  • So let me go ahead and save this.

  • Let me go ahead and View my Console, so I have a Terminal window.

  • And let me go ahead and run Python of lecture.py.

  • And in just a moment, if I cross my fingers,

  • I should hopefully see who has registered.

  • Amazing.

  • I've seen who's registered.

  • Now there's one line of output that I didn't expect, which is just this one.

  • This is the library just being pedagogically helpful.

  • It's showing me every command that I sent to the database.

  • But you'll see that so-and-so has registered.

  • So this is kind of interesting.

  • It's kind of a stupid program, right?

  • Because most proctors aren't going to be hacking

  • froshims by using a terminal window and running Python scripts.

  • They're probably going to want to do this by a web page

  • and actually see who is registered.

  • But if we have the ability in Python code to do this, like,

  • iteration, what could I do instead of just printing to the screen?

  • What could I print out per last week and per the past problems set?

  • I could print out HTML, right?

  • Like, each of the students who register, kind of

  • feels like an opportunity for an unordered list

  • or an ordered list or a table or whatever.

  • You can now generate HTML.

  • So let me do this.

  • Let me actually go into an example I've made in advance.

  • What if I went ahead and opened up layout.html.

  • Here is a simple layout for a web application using Flask.

  • For those unfamiliar, this is mostly HTML

  • plus a technology called Jinja, which is a web-based technology for generating

  • websites dynamically.

  • The body of this page is clearly what's of interest.

  • And you know what?

  • I bet I could do some logic right in there.

  • So let me go ahead and do this.

  • Let me go ahead and Create, let's say, a New File.

  • Let's call this application.py.

  • I'm going to go ahead and, just for time's sake,

  • do a little bit of copy-paste to save myself some keystrokes.

  • So here's a very simple web app.

  • And this is going to be my to-do.

  • So if I go in here to templates, let me open up index.html--

  • and let me go head into here.

  • So long story short--

  • here's where we're going with this.

  • What if I instead generate an unordered list using code from last week,

  • but I use my for loop here inside of my web app

  • instead of actually just with a simple lecture.py file.

  • Well, recall that I could do something like this.

  • I can have a Jinja loop.

  • So I could say something like for row in rows.

  • And then, down here, I can preemptively say something

  • like andfor, which is our weird syntax from last week.

  • And then, in here, I can just do a list item.

  • And then, if I want to show who registered, what do I type here?

  • Something in between these curly braces if each row represents a registrant?

  • AUDIENCE: Row.

  • DAVID J. MALAN: Row name registered.

  • I can just do something like this.

  • So same idea, I'm just wrapping it with a little bit of HTML.

  • Now let me go to my application.py file because I'm

  • going to have to fill in some blanks.

  • And let me see, how can I do this?

  • Well, ultimately, I want to return the result of rendering

  • a template called index.html.

  • For families unfamiliar, this is just a line of code

  • that says go show that file to the user.

  • But I don't want to just show them the file as-is.

  • But let's at least get this set up.

  • I'm worried that it might not work yet because I need to get the actual data.

  • So how can I get all of the rows for my registrants?

  • Well, I can do rows gets db.execute.

  • And I can go ahead and select star from registrants and store that in there.

  • This file is called froshims.db that I created earlier.

  • And everything else is just Flask stuff from last week.

  • Nothing else is new.

  • The only thing that's new is this line here,

  • this line here, and now this line here where

  • I'm using SQL inside of a Python call by passing

  • it is an argument to a function called EXECUTE.

  • How do I pass the rows to index.html?

  • AUDIENCE: [INAUDIBLE]

  • DAVID J. MALAN: Yeah, like rows equals rows is the convention we've adopted.

  • You call it anything you want, x equals y,

  • but this is a little more straightforward.

  • So this is saying, hey, database, get me all of the rows from my registrants,

  • and then render the template index.html, and pass in these rows.

  • And now, if I hold my breath and run flask run, no syntax errors.

  • If I go ahead and visit this here and open--

  • dammit-- the tab, I see an internal server error.

  • So teachable moment, families.

  • Let's go back into that browser window here and see what happened.

  • OK.

  • Template syntax error-- so pretty stupid mistake.

  • It looks like I expected a square bracket instead of a curly brace.

  • That's fixable.

  • Let me go into index.html.

  • And oh, I didn't finish my thought.

  • So that's some of the frustrations of programming for those

  • who are seeing this for the first time.

  • Let me save that.

  • Let me go back to the browser here, and we'll just reload.

  • And oh my god, voila, now I have a web page

  • via which you can see who has registered.

  • But you know what?

  • We can make this more powerful.

  • Recall that we've been playing with HTTP for some time.

  • And if this is the URL I'm accessing, recall

  • that we played around with reimplementing search functionality.

  • Well, what if I want to support search such that I can just visit q

  • equals and then search for people named, say, Brian

  • and see how many Brians are registered.

  • Could we add support for something like this?

  • Well, maybe.

  • Let me go back into the IDE, into application.py.

  • And let me go ahead and say something like this.

  • q equals request.args get q to see if anything is actually there.

  • And then let me go ahead and do this.

  • SELECT star FROM registrants WHERE--

  • let me see-- q where what?

  • NAME equals q.

  • But I need a placeholder.

  • So maybe I should do this.

  • And as soon as I do this, I need a what?

  • F for a format string.

  • So I could just create, on the fly, a SQL command

  • that plugs in the value of q between those curly

  • braces to express the logic of select all

  • of the registrants whose names equal Brian or Veronica

  • or whoever's name I typed in.

  • Let me go back to the browser here.

  • Let me go over and do something like this now.

  • Question mark q equals Brian-- cross my fingers as before.

  • Dammit.

  • OK.

  • And what did I do wrong here?

  • What did I do wrong here?

  • This is subtle.

  • And we're seeing it for the first time.

  • It thinks there's a column called Brian.

  • But why would it think that?

  • Well, what I've effectively done is sent in this.

  • Brian is not a keyword in SQL.

  • And because it's an actual string that I'm comparing against,

  • what I really need to be doing is this.

  • Otherwise, SQLite is going to think it's like the name of a column or something

  • I pre-created so we have to fix this.

  • But that's OK.

  • I can put the quotes there.

  • But I should probably put the q there in quotes.

  • Let's save this, go back to the browser, reload.

  • And there we go.

  • Now we have functionality for Brian.

  • And so with this basic building block, what have we done?

  • Well, in SQL, we have several commands at our disposal--

  • creating a table, which frankly gets tedious by typing it out.

  • I myself tend to use and recommend phpLiteAdmin just to create your table

  • and get it going.

  • But then you can certainly manually, with SQLite3 or phpLiteAdmin INSERT,

  • or UPDATE, or DELETE, or SELECT information

  • once it's actually in the database.

  • And that's pretty powerful.

  • But once you do that, you can now use that same new syntax, that new language

  • SQL, passing it in as an input with a "string" to our EXECUTE function,

  • and now start pulling any data you want from your database.

  • Last week, with CSV files, if you wanted to do this,

  • you'd have to open the CSV file, use a for loop to iterate over it,

  • look over every column and row for your data, then pass it in.

  • And that's fine.

  • That's correct.

  • That's not bad.

  • But it's tedious.

  • And you're reinventing the wheel again and again.

  • And there's no filtration built in as there is to SQL itself.

  • So you now have a more sophisticated tool

  • in your toolkit so to speak with which to solve that same kind of problem.

  • Any questions then on this technique?

  • All right.

  • Well, let's look at a bigger database and see where we can go with this?

  • So if you go on the course's website, you'll

  • see a larger database that's actually available in multiple formats, SQLite,

  • which we'll see in a moment, but also Google Spreadsheets.

  • Because frankly, it's a lot more pleasant to look

  • at your rows and columns in a GUI than it is, necessarily,

  • with the file itself.

  • So this happens to be a free and open source sample database.

  • Like, some guy, years ago, took his actual iTunes database--

  • all the music he'd ever bought, he wrote a program

  • to like analyze Apple's file format in iTunes

  • and extract all of the data that seemed to be stored about him,

  • I think was the story, and just made it publicly available as a sample

  • database for students and teachers to just use to manipulate data.

  • But what's interesting is that this database

  • demonstrates some other principles that we really haven't touched on.

  • For instance, if I were to store, again and again,

  • all of these students who are registering for froshims,

  • what do you start to see in certain fields?

  • Well, I was a little lazy, and I didn't bother typing in everyone's dorm.

  • But suppose that hundreds of students have registered for froshims.

  • A lot of them are going to be from Matthews, some from Pennypacker,

  • some from Candaday, some from Weld, and bunches of other buildings on campus.

  • It starts to get a little ridiculous when

  • you see Matthews, Matthews, Matthews, Matthews,

  • Matthews-- like, 100 or more times.

  • If there are 1,600 freshmen, there are a lot of kids in Matthews.

  • That's a lot of bytes to store M-A-T-T-H-E-W--

  • I don't know.

  • It's not important.

  • Doesn't matter how it's spelled.

  • That's a lot of bytes to actually store in your database again and again

  • and again.

  • It feels like there should be an opportunity

  • to factor out the commonalities.

  • And what humans do with databases is, once they recognize a recurring

  • pattern of data, same darn strings again and again and again, you know what?

  • Rather than use-- now it matters--

  • M-A-T-T-H-E-W-S-- which is 8 bytes.

  • Or P-E-N-N-Y-P-A-C-K-E-R, which is 11, then we have a lot of bytes being used

  • again and again and again to store all of these dorms.

  • You know what?

  • What's better than 11 bytes or 8 bytes?

  • Let's just use an int, or let's even use a small int--

  • 2 bytes or 4 bytes to represent dorms.

  • So instead of storing Matthews, let's just store the number 10.

  • And instead of Penny Packer, lets just store the number 11, thereby,

  • using some bytes, but fewer and, therefore,

  • saving bytes in the long run.

  • And so with this database demonstrates is exactly that principle.

  • Certainly, when it comes to music, where artists have multiple albums

  • and artists have multiple songs, it's probably

  • a little silly in a musical database to store the name of the album

  • again and again and again and again for all 10 or 12

  • or 20 tracks or songs on that particular album.

  • So what this person did was this.

  • Notice here, we have a whole bunch of sheets.

  • Or in database-speak, these would be tables.

  • And notice that these tables have columns.

  • And notice that these columns are album ID,

  • in the album table, album ID and title.

  • But notice what he did very cleverly with this field, artist ID.

  • Artists or singers have lots of songs to their name, eventually.

  • And so he's assigned each of them unique value,

  • or Apple did, in iTunes underneath the hood.

  • So how do I know what this artist's name is?

  • How would you figure this out?

  • It's not that interesting to us humans do know, ooh, artist ID number 2.

  • This is just-- what's that?

  • AUDIENCE: You need another table with artists.

  • DAVID J. MALAN: Yeah.

  • We need another table with artists, which is right over here.

  • So let me go ahead and look there.

  • So if I want to see-- let's see, "Let there be rock," artist number 1.

  • Let's go to the artists table.

  • And turns out, AC/DC, the band is who created that.

  • Now we've added a step here, which maybe is costing us a little bit of time.

  • But it's going to save a space in the long run if I'm not storing long

  • artist's names-- although, AC/DC isn't terribly long--

  • again and again and again.

  • Now, which is better?

  • Well, it's a trade-off.

  • Are you more comfortable wasting space and storing everything together?

  • Or do you prefer to save space and just spend a little more time

  • joining the data back together?

  • But it's going to be really annoying if, now,

  • if I want to make a website that shows me the names of the songs

  • that I have in a database and the artists for them,

  • let alone the albums, and more of that, the titles of the tracks, and so forth.

  • It feels like that's three queries, right?

  • Like SELECT the album, SELECT the artist, SELECT the titles--

  • but no.

  • With SQL, you can collapse that altogether.

  • Because notice, in this table here, artist,

  • there is a column called Artist ID that's numbers.

  • And notice, if you kind of picture this, it's

  • like finger tips here-- let's propose metaphorically--

  • represent the artist ID.

  • If I go into album now, notice that we have album ID and title,

  • but we also have artist ID.

  • And so if you imagine these two tables sharing this common column, what if we

  • kind of stitch them together like this, lining up one

  • on the left, the other on the right, thereby reconstructing

  • all of the information and duplicating it as

  • needed so that I get back just the album and the title and the artist.

  • Well, how can express that?

  • Well, let me go ahead into CS50 IDE where I have a copy of this file.

  • Let me close all of my tabs from earlier go into this file called lecture.db.

  • And in lecture.db, in phpLiteAdmin, we'll see all of those same tables.

  • And I literally just imported it into a SQLite.

  • You'll see all of these same tables.

  • We can browse album just as before.

  • And we just see a different format for the same data.

  • It's the same data from the Google Spreadsheet,

  • which is just more user-friendly.

  • And let me go ahead and do this.

  • I could, of course, SELECT star from album

  • Where our artist ID equals 1 to get back all of AC/DC's albums.

  • And indeed, here, I have two.

  • They have For those about to rock, We salute you, and Let there be rock.

  • They have two.

  • But notice, the rows I got back contain only what information?

  • Album ID title, and artist ID.

  • I just know, as a human, that oh, these are AC/DC's albums.

  • But what if I want to know, well, OK, I see that artist ID is 1.

  • So all right, well, let me open another tab here.

  • And now let me SELECT star from artist WHERE artist ID equals 1.

  • And so if I want to learn something about that artist-- let me go ahead

  • and Zoom Out, click Go, and OK-- now I get AC/DC.

  • Well, this is great.

  • Now I have to results, two sets of rows.

  • This is stupid.

  • Now I'm just creating work for myself by having two return values.

  • I could call db EXECUTE twice.

  • But there's a better way.

  • It turns out, SQL allows you to join tables just using SQL itself.

  • So I'm going to go ahead and do this.

  • I'm going to go ahead and SELECT star FROM album,

  • but also FROM artist WHERE Album.Artistid--

  • let me scroll to the right--

  • equals Artist.Artistid.

  • So notice I'm saying select everything from two tables,

  • but only do so where the album tables, artist ID

  • column has the same value as the artist tables artist ID column.

  • That's kind of the stitching, metaphorically, of my fingers

  • together, looking for that common column.

  • If I go ahead and click Go, wow, look at what I've just constructed.

  • It's a lot of information, but I have album ID and title,

  • I have artists ID still, but I have the name of that artist altogether.

  • So if you now let your mind wander back to the Python code,

  • oh, I could now get a whole bunch of rows containing

  • everything I care about all at once.

  • I don't need two select queries.

  • I can join these tables in this way.

  • And I use join very deliberately.

  • It turns out that there's another way to express this same thing.

  • Instead of using that comma syntax I did, you might see as well this,

  • Select star From Artist JOIN Album ON Artist.Artistid equals--

  • let me scroll over--

  • Album.Artistid.

  • This is going to have the exact same effect,

  • but you might just find that it reads a little more intuitively to you.

  • Select everything from the result of joining these two tables.

  • How do you want to join them?

  • Well, join them on this equaling that--

  • just another way of expressing the same idea.

  • And if I click Go, I get back the same information.

  • So ultimately, with JOINs do we have the ability to reassemble data.

  • So on the one hand, it's just good practice to normalize your database.

  • Identify columns that have lots and lots and lots of redundancy, and only

  • store that information once.

  • For instance, CS50 Finance, if you're supporting

  • many different users, every time Malan or Brian or Veronica buys a stock,

  • feels like it would be a little silly to store Malan

  • or Brian or Veronica along with Netflix, the symbol, and the number of shares

  • one of us bought.

  • Because Malan, Malan, Malan, Malan is going

  • to appear all throughout the database.

  • And what if I change my username or my name or someone gets married

  • and, therefore, it changes?

  • Like, why do you create that messiness for yourself?

  • Instead, give Brian and Veronica and me and everyone else a unique ID.

  • And when they buy something, just store their user ID or customer ID

  • or however you want to think about it, just like with album ID and artist ID.

  • And so normalizing a database is all about finding those commonalities

  • and moving the data into its own table.

  • And if you care about rejoining it, just use

  • SQL to reconstruct that view of the data, so to speak.

  • So what else can we do here as well?

  • It turns out that there is in SQL, not just primary keys,

  • but there are unique constraints in some databases where you can specify,

  • this isn't my primary key, but I want it to be unique.

  • You can specify that something should be indexed.

  • So it turns out that, if you just know there's a field in your database

  • that you want to be able to search on very efficiently,

  • you can index it in advance.

  • And you'll see or be able to do this if you'd like for final projects

  • or even for the next problem set if you'd like.

  • But what this enables are queries like this.

  • If I want to go ahead and search for, for instance--

  • what would be a good example?

  • Rock.

  • I'm interested in rock.

  • So if I want to go into My SQL tab here.

  • I could say something like this.

  • SELECT star FROM Album WHERE Name not equals,

  • but where name is LIKE and then I'm going to say 'Rock.'

  • But if I want any number of characters to come before that word,

  • I can use a percent sign.

  • And if any number of characters after, I can use a percent sign.

  • These are like wildcards.

  • In most languages, you would use star.

  • In SQL, you use percent signs.

  • But it means the same thing.

  • And if I go ahead and say go, now I get back--

  • oh, I get the got wrong lecture, album, oh, title I think is what I wanted.

  • Let me try that again, sorry--

  • WHERE Title LIKE 'Rock'--

  • let me go ahead and click Go.

  • And voila, here are all of the albums in the database that

  • have the word rock in them.

  • Now, as an aside, this table has a lot of more albums in it.

  • And frankly, it's small enough though.

  • It has hundreds of rows, maybe a few thousand rows.

  • None of us humans are really going to notice how slow linear search is.

  • But if you start having thousands of rows, tens of thousands

  • of rows, millions of rows, not having an index

  • means that searching for something like rock is going to start at the top

  • and search every darn field all the way to the bottom, big O of n.

  • If you instead tell the database, I know I'm

  • going to be searching on this column a lot, please index it for me,

  • here comes the secret sauce.

  • SQLite, Oracle, Microsoft Access, and so forth, they

  • will, using their own intellectual property,

  • build up some fancy data structures-- trees, or hash tables,

  • or whatever in memory, store the data for you invisibly in that format

  • so that, when you do ask for a question like, show me all the albums like rock,

  • they can answer you in much faster time than linear.

  • And that too is what you get with SQL that you don't get with CSVs.

  • CSVs are, by nature, only linear.

  • So we can do better.

  • But you, the programmer, have to help the database

  • and actually give it those hints, not just the types,

  • but also hints like this.

  • And as an aside, there's also the notion of foreign keys

  • where, if you really want to lock things down, you can specify that,

  • if you ever see in album ID in another table,

  • if it's a primary key in the album table,

  • by definition, in the other table, it's going to be called a foreign key.

  • Because it doesn't really belong there, but it's

  • referencing a column elsewhere.

  • So there's a lot more technology and vocabulary.

  • And you're welcome to dive in deeper.

  • And odds are, many of you will for final projects,

  • by nature of wanting certain features, among them, even

  • the ones we've seen like auto incrementing and not null.

  • As an aside too, SQL even has functions.

  • And for data scientists and statisticians

  • it's super useful to be able to just do math and summaries of data right

  • within SQL without ever writing Python code or R or anything else.

  • Built into a SQLite and other databases are

  • functions like this for average, counting things, getting

  • the min, max, sum, and so forth--

  • all of that you get for free with a lot of databases.

  • All it takes in the context of Python is a line like this.

  • But, but, but, but, but there are some problems.

  • And let's end by taking a look at two fundamental problems and threats that

  • are too often underappreciated.

  • And in fact, we have to fix a very serious vulnerability that I introduced

  • into my very own code earlier.

  • But first, the so-called race condition.

  • In survey-- or rather, let's see--

  • suppose that we think back at the very start of the semester, most of you

  • signed up for a GitHub account for the very first time.

  • And you went to github.com/signup.

  • For those unfamiliar, GitHub is a website

  • where you can save and store programming code that you've written

  • and want to collaborate with others on.

  • And you chose a username.

  • And let me go ahead and try choosing a username like, say, jharvard

  • for John Harvard.

  • Notice that the website immediately said the user name is taken.

  • All right, that's useful.

  • And you can probably guess how this is done-- maybe a little JavaScript,

  • using AJAX, talking to the server, getting the response,

  • changing the HTML or the CSS or whatever.

  • Might take some time to wire all that together.

  • But that's probably what's going on.

  • So let me try a really long random username that is not taken.

  • Hey, it's available.

  • But probably is not a good thing that I'm streaming this on the internet.

  • Because if I wait long enough, I bet someone could, for playful reasons,

  • just sign up for this, let alone anyone in this room.

  • But you've just told me it's available.

  • So good.

  • I'm really excited.

  • I've got my username.

  • Let me go ahead and type in my email address,

  • malan@harvard.edu, my password, 12345-- take a few moments there.

  • Verify my account and so forth.

  • And I click Submit.

  • Suppose that I'm told, momentarily, sorry, that username has been taken.

  • Could that happen?

  • Yeah, if any of you were trying to mess with me right now,

  • you would have signed up for that username and beaten me to the punch

  • so that when I hit Join, I get an error.

  • That's the definition of a race condition where two people or two users

  • or two computers or two threads--

  • if we really roll back to our discussion of threads in Scratch--

  • are trying to do the same thing at roughly the same time.

  • And if those two things, threads or humans,

  • check the state of a variable, which is a fancy way of saying

  • is the username available, they both get back answers.

  • But then some number of split seconds later,

  • then they make a decision based on that information, there is a window of time,

  • either split seconds or even seconds or minutes,

  • where the state of that variable could, of course, change.

  • So if you two, literally, right now on your laptop,

  • typed that very long username, all of us would probably be told, green light,

  • it's available.

  • But only one of us is actually going to get it.

  • And that's because of a race condition.

  • Literally, all of us might be racing to sign up for that value.

  • And it's when state can change in between things happening.

  • This is a bad thing because it makes your data

  • vulnerable to changes by someone you don't necessarily intend.

  • Or if the database isn't smart, you might

  • be able to do especially bad things.

  • ATMs are a canonical example of this.

  • If you had a malicious adversary trying to log into two bank accounts at once

  • or two physical machines at once, either with two cards or two accounts and two

  • laptops, you could imagine both of them trying to deduct, like,

  • $100 from the same account instantly.

  • Because imagine a poorly-implemented bank website.

  • It checks the account balance of the user logged in.

  • Do you have $100?

  • If the answer is yes, maybe both websites are going to say yes,

  • you may deduct $100.

  • You hit enter and voila, you deduct $100.

  • The user gets it somehow because it's transferred to some other account.

  • But the bank thinks it only did that once, deducts $100,

  • but you've just walked away with $200 because you made a decision based

  • on the same answer in two different threads

  • or two different programs or two different computers.

  • So long story short, this can happen even in the real world.

  • An example I was taught by my advisor years ago was this.

  • Suppose you and your roommates have a little dorm fridge.

  • And you're in the habit, of course, of drinking a lot of milk.

  • And so the fridge has run out of milk.

  • And you come home, the first roommate after classes.

  • And you realize, oh, I really need a drink of milk.

  • And so you check the fridge.

  • There's nothing there.

  • So you close the fridge, and you walk into the square, go to CVS,

  • and get in line to buy some milk.

  • Meanwhile, your roommate comes home.

  • Also, they really need a drink of milk.

  • And so they check the state of the variable.

  • Argh, no milk.

  • Close the fridge, and then walk to like Tommy's Convenience or some other place

  • nearby and get in line for some milk.

  • You of course, then both get home eventually.

  • And what happens now?

  • Dammit, now you have twice as much milk.

  • And milk goes bad quickly.

  • So now this is a problem, a very bad problem.

  • You have twice as much milk as you could possibly drink.

  • But what's the origin of that problem fundamentally?

  • AUDIENCE: You're out of something.

  • DAVID J. MALAN: You're out of something, but--

  • AUDIENCE: You need it.

  • DAVID J. MALAN: --you need it.

  • But why did I end up with two?

  • AUDIENCE: There's no flag.

  • DAVID J. MALAN: There's no flag, right?

  • There's no indication.

  • There's no sharing of state.

  • You both inspected the value of the variable,

  • made a decision independently on it.

  • But the state of that variable changed on one of you.

  • Because when one of you came home, the later person, damn, like,

  • the milk has already been refilled.

  • So how do you solve this?

  • In the real world, how could you avoid this problem?

  • You just, one, never do errands for your roommate.

  • AUDIENCE: The magnet you put on the refrigerator says get this.

  • DAVID J. MALAN: Mag-- yes.

  • A shopping list, right-- gone for milk--

  • Arrested Development.

  • Always leave a note, right?

  • You could convey that information.

  • You could more dramatically lock the refrigerator, right?

  • Padlock the thing, and so your roommate can't

  • inspect the state of the refrigerator while you are gone, therefore,

  • not making us vulnerable to this.

  • And I use the word lock deliberately because,

  • in databases, that's how they solve this.

  • There is a feature in databases called locks.

  • Or fancier versions of this are called transactions,

  • whereby, you can guarantee something called atomicity,

  • where atomicity means you can do multiple things back to back to back

  • without getting interrupted.

  • So in the case of a bank, it is possible, with SQL, using slightly

  • fancier syntax that we won't dive into today to solve this problem by saying,

  • you know what?

  • Begin the following transaction.

  • Check the state of the bank account, deduct this amount of money,

  • and now commit the results.

  • And while I'm doing that, lock everyone else out.

  • Don't let any other customer or any other user

  • do exactly that information that touches the same data until I am done.

  • Long story short-- you pay a price, perhaps.

  • You're literally preventing your roommate from accessing the fridge,

  • and that's annoying.

  • Or you're preventing other customers from doing transactions.

  • So hopefully the computer is fast at this, and your fast at shopping.

  • But you've at least ensured that you have atomicity.

  • No operation can get inserted into your sequence of operations

  • as by your roommate or some other computer or thread.

  • So that's a problem with databases that we're only going to skirt over.

  • And GitHub might solve this, how?

  • Well, by just not caring, potentially.

  • I don't know what's going to happen if multiple of us try.

  • I'm guessing they will just give n minus 1 of us an error message saying, sorry,

  • that username is no longer available.

  • Think about this.

  • If you've ever bought airline tickets, this

  • is a solved problem in that industry.

  • That would be really annoying.

  • If you just spent an hour of stressful price-hunting

  • for a good airplane ticket, you start checking out

  • after adding it to your shopping cart.

  • And five minutes later, after your name and email address and credit card

  • number, the ticket is gone.

  • So what do airlines do?

  • They often give you a five-minute window.

  • And some of the fancier websites show you the clock saying,

  • we guarantee this for the next five minutes.

  • Hotels might do this too where they locked the refrigerator for you

  • by somehow altering the database to say, mm-mm.

  • No one else can buy this ticket or this room for the next five minutes,

  • much like the note or the padlock.

  • So those kinds of things are all around us.

  • But let's look at one final example that's the worst threat of all is this.

  • Previously, I allowed myself to search by name.

  • So q equals Brian or q equals David or Veronica or the like.

  • And what did I do with that information?

  • Well, if we go back into the IDE and actually look

  • at that file in application.py, I simply formatted it using an F string

  • inside of this SQL string.

  • But what if my users were a little bit malicious?

  • And suppose that someone doesn't want to just search for Brian.

  • But you know what?

  • Suppose they do something like, my query is DELETE FROM registrants WHERE--

  • sorry Brian-- NAME equals Brian--

  • something like this.

  • Now, this is not valid at the moment, because this string,

  • while I'm certainly allowed to type it in,

  • is going to get plugged into my code, but in the wrong place logically.

  • Like, I'm going to look for someone's name called "DELETE FROM registrants

  • WHERE name equals Brian," which is just nonsensical.

  • It will return 0 results.

  • But what if I do something like this where I say,

  • Brian or DELETE from registrants where I finish

  • the thought that the programmer had and then start my own new thought.

  • Or another way of doing this is to use special syntax semicolon, something

  • like this.

  • Long story short, I could contrive a human malicious input that

  • finishes the programmer's thought and returns zero rows, but by the way,

  • also sneaks one additional rogue query into the database.

  • This is what's known as a SQL injection attack.

  • And if you naively and very, very, very badly and incorrectly write code like I

  • did-- don't ever do this--

  • you will be vulnerable to exactly this attack

  • because you are blindly plugging in the user's input to a string

  • that you are then passing to a database.

  • This is a fundamental flaw in lots of applications, lots of languages

  • where you have to distrust your users.

  • It doesn't matter if it's for just students on campus

  • or it's just for you and your friends.

  • Never ever, ever trust users' input because either someone's

  • going to mistype something and something is going to go awry,

  • or you're going to have a bad apple trying

  • to hack into your website or your application

  • by trying these kinds of commands.

  • And you have to always write code defensively.

  • So how to do this?

  • There are a bunch of ways.

  • But it turns out that, what's dangerous about something

  • like I just typed in is that it's the semicolon, for instance,

  • and that's the quote marks over "elsewhere."

  • So the safest thing to do is, no matter what the user types in, escape things.

  • You can use special syntax.

  • We saw this in C-- generally, putting a backslash in front of something means,

  • don't let it have its default behavior.

  • Instead, treat it specially.

  • So you could use special code in Python that

  • just says remove any bad characters, or replace things.

  • Frankly, you've probably been to a website where you've been told,

  • sorry, you can't use that character in your password.

  • Or sorry, you can't use that in your username.

  • That's just dumb.

  • Like, that is the lazy approach to this.

  • There is no reason to prevent users from typing

  • any characters into their keyboard for their password and maybe even

  • their username.

  • That's kind of a lazy way of defending against this by saying,

  • mm-mm, I don't trust any percent sign, any semicolons,

  • any dashes, any apostrophes.

  • Rather, just escape things.

  • But it's silly for all of us in this room

  • to write our own code for scaping users' input or scrubbing it, as it's called,

  • or sanitizing it, as it's called-- same things.

  • Why don't we just use a library?

  • Now, there are many libraries out there.

  • The one that we're using at the moment is CS50's.

  • And the EXECUTE function does this for us.

  • Instead of using F strings, which you should not use like this.

  • You should instead do this.

  • If you want to plug in a placeholder value to a SQL query,

  • you literally use a standard convention, that we have adopted too,

  • where you just put in a variable's name, but with a colon in front of it.

  • And it can be anything.

  • It can be q.

  • It can be x.

  • It doesn't matter.

  • But you want to just plug in some value there.

  • So I'm going to call it name, by convention.

  • Then you close your quote and finish your thought.

  • And then you go ahead and pass in the actual value, name equals q.

  • And now you have constructed, dynamically,

  • a SQL string with a place holder that is not

  • Python's own curly brace placeholder.

  • This is a special SQL convention where you say plug in value here.

  • What value?

  • We'll plug in this names value, q, whatever the human has typed in.

  • And what are execute function will do for you

  • is all of the fancy backslashing and all of the escaping

  • and will protect you from the user's data.

  • And this is how truly simple it is.

  • It doesn't have to be CS50's library.

  • This is ever so common in all languages, but too few people know about it

  • and use it.

  • And so half the time you read about some database getting hacked

  • or your data getting stolen, it is because

  • of a stupid oversight like that.

  • So just use libraries and escape users' input.

  • We can see this now more concretely.

  • All of the undergrads in the room have surely

  • logged into, either Yale's website or Harvard's website,

  • which looks a little something like this here.

  • You're prompted for your login name and your password,

  • or your Harvard key or the like.

  • Well, how does this take effect in real terms?

  • If I were to type in my email address--

  • but then weird syntax like this-- let's look at an example.

  • Quote "or" quote unquote "1" equals quote "1."

  • Notice it's not balanced.

  • It's missing a quote over here, missing a quote over here.

  • Because the presumption is that maybe Harvard is vulnerable to this.

  • I don't think they are.

  • But suppose that the code running Harvard key

  • and Harvard's login page looks a little something like this.

  • This is bad.

  • This is dangerous because they're just using f strings or format

  • strings, which are just going to blindly plug anything in there.

  • And so if you let the human type in something cryptic like that,

  • notice what has happened logically.

  • Where username equals me at example your email provider.com and password

  • equals quote unquote, so nothing, or 1 equals 1.

  • And why 1 equals 1?

  • If I go back, notice that there's a quote here and a quote here.

  • And the reason that I didn't finish my second quote here or my second quote

  • here is because I'm assuming, as a bad guy,

  • I think Harvard is just going to blindly plug my input

  • into a single quotes of their own.

  • Therefore, I can finish their thought nonsensically.

  • But notice, logically, what happens.

  • Select all users from the database where the user name is me at example email

  • provider and the password is nothing.

  • Or 1 equals 1.

  • Well, when does 1 equal 1?

  • Like, always.

  • So this will always return users from the database,

  • and presumably, therefore, let me log in as one of those users--

  • so incredibly simple to defend against this.

  • Just use placeholder syntax and distrust and sanitize users' input.

  • The syntax in SQL and the CS50 library is quite simply with that colon.

  • But in other libraries, it might be quite the same.

  • So now you are all, families and students alike,

  • inaugurated into the small class of folks

  • in the world who understand particularly geeky humor.

  • You might notice this meme that's gone around the internet for many years

  • now where someone either maliciously or humorously

  • decided to paint this over their license platelets.

  • Let's enhance.

  • Why would someone do this?

  • AUDIENCE: Scanners.

  • DAVID J. MALAN: Scanners, yeah.

  • Tollbooths are going away, at least in the US.

  • And they instead have cameras or readers that are scanning the front of your car

  • and trying to optically do OCR, Optical Character Recognition,

  • on your license plate.

  • And the presumption here is, maybe in some municipality,

  • there's some badly written code where they just blindly plug

  • your license plate into their code.

  • And hopefully you finish the thought where ZU 0666--

  • whatever that is, it's part of the license plate--

  • but "semicolon drop database table dot dot dot."

  • And we didn't even look at that because DROP is pretty extreme.

  • It literally deletes a database itself.

  • But this is a nice way of getting off the hook from a total price.

  • And most canonical perhaps, XKCD is a very popular cartoon strip.

  • It's particularly geek-oriented.

  • And you'll perhaps understand this joke now as well among CS circles.

  • [LAUGHING]

  • I can hear the laughter making its way through.

  • So from here on out, if you take nothing else away,

  • remember little Bobby Tables with pset.

  • Our final, will you actually implement CS50 Finance

  • and coalesce all these ideas.

  • Thank you so much to all of our families for joining.

  • And we will see you next time.

  • [APPLAUSE]

[MUSIC PLAYING]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it