Subtitles section Play video Print subtitles What is the first thing that comes to your mind when you hear the word “database”? For many people, this question is more challenging than it might seem at first. An answer like “a big file where much information is stored” is not satisfactory and would not please potential employers. You should remember there are two main types of databases – relational and non-relational. The former will be the focus of this course, while the latter regards more complex systems. Although understanding non-relational databases requires a serious mathematical and programming background, some of the logic applied in its coding is the same as SQL. Likewise, relational databases have a few advantages on their own. A small bit of theory will explain why they are still the preferred choice in many companies and institutions. Databases’ main goal is to organize huge amounts of data that can be quickly retrieved upon users’ request. Therefore, they must be compact, well-structured, and efficient in terms of speed of data extraction. Today, people need such extra efficiency because data occupies memory space and… the bigger its size, the more sluggish the database is and the slower the retrieval process becomes. If we have a database containing one multi-million-row table, with many columns, then every time a request has been received, the server must load all the records, with all fields, and it would take too much time for a task to be completed. Don’t forget every symbol is a container of information and requires bytes of storage space. Hence, loading that much data will not be an easy job for the computer. So, what allows us to contain so much data on the server, but lets us efficiently use only the portions we need for our analysis? The secret lies behind the use of mathematical logic originating from relational algebra. Please, don’t worry - we will not bother you with math. Imagine each table with data is represented by a transparent circle that contains all the data values of the table, categorized by columns or, as we will often call them, fields. Now, if our database consisted of only one table, a giant circle would represent the entire database, something like this huge table from our fictional example with the “Sales” database. And when we need a piece of information from the database, for example if we wish to see who has bought something on a certain date, we will have to lift this whole big circle and then search for what we need. This challenge seems vague and the process of data extraction will not be efficient. See what can happen if we split the circle into 3 smaller circles, just as we did with the “Sales” database. One circle will stand for the “Sales” table, the other for “Customers”, and the last one for “Items”. There are various theoretical combinations between 3 or more circles, but in our database, we have the following model. “Sales” and “Customers” have the same “customer ID” column, and “Sales” and “Items” have the same “item code” column. This way, we can see the circles overlap as they have common fields. So, if we’d like to extract the same information, the names of the customers who have purchased something on a given date, we will need only the “date of purchase” column from the “Sales” table and the “first” and “last name” from the “Customers” table. So, to satisfy this request, we will not need to lift the third circle from our database, “Items”. This way, we can save energy or, more technically, increase efficiency. Less data, represented as only two of the three circles, will be involved in this operation. The mathematical trick lies in relating the tables to one another. Relationships were formed namely through these common fields. Anyway, I am sure that now you understand why we use the term relational databases. Some professionals may refer to the tables, or the circles in our plot, as relations because, theoretically, they are the smallest units in the entire system that can carry integral logical meaning. Likewise, the three circles are all part of our “Sales” database. When we combine the database and its existing relations, we obtain the famous term relational database management system, frequently abbreviated as RDBMS. ? We hope this theoretical illustration makes things clearer. SQL is designed for managing relational database management systems and can do that by creating relations between the different tables in a database. To complete our theoretical preparation, we will need to learn more database terminology. This is a necessary step that will help us when we start coding in SQL later. Let’s go through the entire process of creating a database. Assume our database containing customer sales data has not been set up yet, ok? So, imagine you are the shop owner and you realize you have been selling goods quite well recently, and you have more than a million rows of data. What do you need, then? A database! But you know nothing about databases. Who do you call, an SQL specialist? No. You need a database designer. She will be in charge of deciding how to organize the data into tables and how to set up the relations between these tables. This step is crucial. If the database design is not perfect from the beginning, your system will be difficult to work with and wouldn’t facilitate your business needs; you will have to start over again. Considering the time and data (…and money!) involved in the process, you want to avoid going back to point 0. What do database designers actually do? They plot the entire database system on a canvas using a visualization tool. There are two main ways to do that. One is drawing an Entity-Relationship diagram, an ER diagram for brief. It looks like this… and, as its name suggests, the different figures represent different data entities and the specific relationships we have between entities. The connections between tables are indicated with lines. This way of representing databases is powerful and professional, but it is complicated. We will not focus in-depth on ER diagrams in this course, but you should know they exist and refer to the process of database design. Another form of representation of a database is the relational schema. This is an existing idea of how the database must be organized. It is useful when you are certain of the structure and organization of the database you would like to create. More precisely, a relational schema would look like this. It represents a table in the shape of a rectangle. The name of the table is at the top of the rectangle. The column names are listed below. All relational schemas in a database form the database schema. You can also see lines indicating how tables are related in the database. To this moment, it has been ideas, planning, abstract thinking, and design. At this stage, it would be correct to say SQL can be used to set up the database physically, as opposed to contriving it abstractly. Then, you can enjoy the advantages of data manipulation. It will allow you to use your dataset to extract business insights that aim to improve the performance and efficiency of the business you are working for. This process is interesting, and, practically, the main part of the course will be related to that. Remember – well thought-out databases that are carefully designed and created are crucial prerequisites for data manipulation. If we have done good work with these steps, we could write effective queries and navigate in a database rather quickly. You will often hear the term database management. It comprises all these steps a business undertakes to design, create, and manipulate its databases successfully. Finally, database administration is the most frequently encountered job amongst all. A database administrator is the person providing daily care and maintenance of a database. Her scope of responsibilities is narrower regarding the ones carried out by a database manager, but she is still indispensable for the database department of a company. All right! So… In this lesson, we will use the “Sales” database to illustrate the concept of relational schemas a little better. The data will be stored in 4 tables – “Sales”, “Customers”, “Items”, and “Companies”. You see how these tables have a tabular shape. Let’s see how a relational schema can be applied to represent them. Let’s build the landscape piece by piece. We care about the sales per customer of our company, and that’s why we have a “Sales” table. It tells us the number with which each purchase was registered, the date of the purchase, the respective customer ID, and the item code. So, in this table, the dates of a few purchases may coincide, because it is normal to sell some of the same goods in the same day. Analogically, in the third column, the ID of a customer may appear a few times, given that some customers make more than one purchase. People may have bought many units of the same product; hence, there is a possibility to see the same item code a few times in the last column. This reasoning does not apply to the first field, though. Each purchase is unique! In databases, this means all the numbers in this column will be different, because each purchase will be assigned a specific number that is unique. There is a term for such type of field. A column (or a set of columns) whose value exists and is unique for every record in a table is called a primary key. Each table can have one and only one primary key. In one table, you cannot have 3 or 4 primary keys. For instance, in our “Sales” table, “number of purchase” can act as a single-column primary key, and there will be no other primary keys. If the database creators decide, the primary key of a table may be composed of a set of columns, not of just one column. In the “Sales” table, both “number of purchase” and “date of purchase” can be thought of as a unique-identifier pair for the data in this table. For instance, purchase number 1 and purchase date, which is the 3rd of September, will form a unique pair and so will purchase number 2 and the same date, 3rd of September. This means these 2 rows are different. Please, don’t overlook the fact you cannot have a combination of the same purchase number, 1, and the same date, 3rd of September, more than once in your table. Either the number or the date must be different. This is because, as we already said, a primary key must be unique.