Subtitles section Play video Print subtitles This lecture is going to serve as an overview of what a probability distribution is and what main characteristics it has. Simply put, a distribution shows the possible values a variable can take and how frequently they occur. Before we start, let us introduce some important notation we will use for the remainder of the course. Assume that “upper-case Y” represents the actual outcome of an event and “lowercase y” represents one of the possible outcomes. One way to denote the likelihood of reaching a particular outcome “y”, is P of, Y equals y. We can also express it as “p of y”. For example, uppercase “Y” could represent the number of red marbles we draw out of a bag and lowercase “y” would be a specific number, like 3 or 5. Then, we express the probability of getting exactly 5 red marbles as “P, of Y equals 5”, or “p of 5”. Since “p of y” expresses the probability for each distinct outcome, we call this the probability function. Good job, folks! So, probability distributions, or simply probabilities, measure the likelihood of an outcome depending on how often it features in the sample space. Recall that we constructed the probability frequency distribution of an event in the introductory section of the course. We recorded the frequency for each unique value and divide it by the total number of elements in the sample space. Usually, that is the way we construct these probabilities when we have a finite number of possible outcomes. If we had an infinite number of possibilities, then recording the frequency for each one becomes impossible, because… there are infinitely many of them! For instance, imagine you are a data scientist and want to analyse the time it takes for your code to run. Any single compilation could take anywhere from a few the milliseconds to several days. Often the result will be between a few milliseconds and a few minutes. If we record time in seconds, we lose precision which we want to avoid. To do so we need to use the smallest possible measurement of time. Since every milli-, micro-, or even nanosecond could be split in half for greater accuracy, no such thing exists. Less than an hour from now we will talk in more detail about continuous distributions and how to deal with them. Let’s introduce some key definitions. Now, regardless of whether we have a finite or infinite number of possibilities, we define distributions using only two characteristics – mean and variance. Simply put, the mean of the distribution is its average value. Variance, on the other hand, is essentially how spread out the data is. We measure this “spread” by how far away from the mean all the values are. We denote the mean of a distribution as the Greek letter ‘mu’ and its variance as “sigma squared”. Okay. When analysing distributions, it is important to understand what kind of data we have - population data or sample data. Population data is the formal way of referring to “all” the data, while sample data is just a part of it. For example, if an employer surveys an entire department about how they travel to work, the data would represent the population of the department. However, this same data would also be just a sample of the employees in the whole company. Something to remember when using sample data is that we adopt different notation for the mean and variance. We denote sample mean as “x bar” and sample variance as “s” squared. One flaw of variance is that it is measured in squared units. For example, if you are measuring time in seconds, the variance would be measured in seconds squared. Usually, there is no direct interpretation of that value. To make more sense of variance, we introduce a third characteristic of the distribution, called standard deviation. Standard deviation is simply the positive square root of variance. As you can suspect, we denote it as “sigma” when dealing with a population, and as “s” when dealing with a sample. Unlike variance, standard deviation is measured in the same units as the mean. Thus, we can directly interpret it and is often preferable. One idea which we will use a lot is that any value between “mu minus sigma” and “mu plus sigma” falls within one standard deviation away from the mean. The more congested the middle of the distribution, the more data falls within that interval. Similarly, the less data falls within the interval, the more dispersed the data is. Fantastic! It is important to know there exists a constant relationship between mean and variance for any distribution. By definition, the variance equals the expected value of the squared difference from the mean for any value. We denote this as “sigma squared, equals the expected value of Y minus mu, squared”. After some simplification, this is equal to the expected value of “Y squared” minus “mu” squared. As we will see in the coming lectures, if we are dealing with a specific distribution, we can find a much more precise formula. Okay, when we are getting acquainted with a certain dataset we want to analyse or make predictions with, we are most interested in the mean, variance and type of the distribution. In our next video we will introduce several distributions and the characteristics they possess. Thanks for watching! 4.2 Types of distributions Hello, again! In this lecture we are going to talk about various types of probability distributions and what kind of events they can be used to describe. Certain distributions share features, so we group them into types. Some, like rolling a die or picking a card, have a finite number of outcomes. They follow discrete distributions and we use the formulas we already introduced to calculate their probabilities and expected values. Others, like recording time and distance in track & field, have infinitely many outcomes. They follow continuous distributions and we use different formulas from the once we mentioned so far. Throughout the course of this video we are going to examine the characteristics of some of the most common distributions. For each one we will focus on an important aspect of it or when it is used. Before we get into the specifics, you need to know the proper notation we implement when defining distributions. We start off by writing down the variable name for our set of values, followed by the “tilde” sign. This is superseded by a capital letter depicting the type of the distribution and some characteristics of the dataset in parenthesis. The characteristics are usually, mean and variance but they may vary depending on the type of the distribution. Alright! Let us start by talking about the discrete ones. We will get an overview of them and then we will devote a separate lecture to each one. So, we looked at problems relating to drawing cards from a deck or flipping a coin. Both examples show events where all outcomes are equally likely. Such outcomes are called equiprobable and these sorts of events follow a Uniform Distribution. Then there are events with only two possible outcomes – true or false. They follow a Bernoulli Distribution, regardless of whether one outcome is more likely to occur. Any event with two outcomes can be transformed into a Bernoulli event. We simply assign one of them to be “true” and the other one to be “false”. Imagine we are required to elect a captain for our college sports team. The team consists of 7 native students and 3 international students. We assign the captain being domestic to be “true” and the captain being an international as “false”. Since the outcome can now only be “true” or “false”, we have a Bernoulli distribution. Now, if we want to carry out a similar experiment several times in a row, we are dealing with a Binomial Distribution. Just like the Bernoulli Distribution, the outcomes for each iteration are two, but we have many iterations. For example, we could be flipping the coin we mentioned earlier 3 times and trying to calculate the likelihood of getting heads twice. Lastly, we should mention the Poisson Distribution. We use it when we want to test out how unusual an event frequency is for a given interval. For example, imagine we know that so far Lebron James averages 35 points per game during the regular season. We want to know how likely it is that he will score 12 points in the first quarter of his next game. Since the frequency changes, so should our expectations for the outcome. Using the Poisson distribution, we are able to determine the chance of Lebron scoring exactly 12 points for the adjusted time interval. Great, now on to the continuous distributions! One thing to remember is that since we are dealing with continuous outcomes, the probability distribution would be a curve as opposed to unconnected individual bars. The first one we will talk about is the Normal Distribution. The outcomes of many events in nature closely resemble this distribution, hence the name “Normal”. For instance, according to numerous reports throughout the last few decades, the weight of an adult male polar bear is usually around 500 kilograms. However, there have been records of individual species weighing anywhere between 350kg and 700kg. Extreme values, like 350 and 700, are called outliers and do not feature very frequently in Normal Distributions. Sometimes, we have limited data for events that resemble a Normal distribution. In those cases, we observe the Student’s-T distribution. It serves as a small sample approximation of a Normal distribution. Another difference is that the Student’s-T accommodates extreme values significantly better. Graphically, that is represented by the curve having fatter “tails”. Overall, this results in more values extremely far away from the mean, so the curve would probably more closely resemble a Student’s-T distribution than a Normal distribution. Now imagine only looking at the recorded weights of the last 10 sightings across Alaska and Canada. The lower number of elements would make the occurrence of any extreme value represent a much bigger part of the population than it should. Good job, everyone! Another continuous distribution we would like to introduce is the Chi-Squared distribution. It is the first asymmetric continuous distribution we are dealing with as it only consists of non-negative values. Graphically, that means that the Chi-Squared distribution always starts from 0 on the left. Depending on the average and maximum values within the set, the curve of the Chi Squared graph is usually skewed to the left. Unlike the previous two distributions, the Chi-Squared does not often mirror real life events. However, it is often used in Hypothesis Testing to help determine goodness of fit. The next distribution on our list is the Exponential distribution. The Exponential distribution is usually present when we are dealing with events that are rapidly changing early on. An easy to understand example is how online news articles generates hits. They get most of their clicks when the topic is still fresh. The more time passes, the more irrelevant it becomes as interest dies off. The last continuous distribution we will mention is the Logistic distribution. We often find it useful in forecast analysis when we try to determine a cut-off point for a successful outcome. For instance, take a competitive e-sport like Dota 2 . We can use a Logistic distribution to determine how much of an in-game advantage at the 10-minute mark is necessary to confidently predict victory for either team. Just like with other types of forecasting, our predictions would never reach true certainty but more on that later! Woah! Good job, folks! In the next video we are going to focus on discrete distributions. We will introduce formulas for competing Expected Values and Standard Deviations before looking into each distribution individually. Thanks for watching! 4.3 Discrete Distributions Welcome back! In this video we will talk about discrete distributions and their characteristics. Let’s get started! Earlier in the course we mentioned that events with discrete distributions have finitely many distinct outcomes. Therefore, we can express the entire probability distribution with either a table, a graph or a formula. To do so we need to ensure that every unique outcome has a probability assigned to it. Imagine you are playing darts. Each distinct outcome has some probability assigned to it based on how big its associated interval is. Since we have finitely many possible outcomes, we are dealing with a discrete distribution. Great! In probability, we are often more interested in the likelihood of an interval than of an individual value. With discrete distributions, we can simply add up the probabilities for all the values that fall within that range. Recall the example where we drew a card 20 times. Suppose we want to know the probability of drawing 3 spades or fewer. We would first calculate the probability of getting 0, 1, 2 or 3 spades and then add them up to find the probability of drawing 3 spades or fewer. One peculiarity of discrete events is that the “The probability of Y being less than or equal to y equals the probability of Y being less than y plus 1”. In our last example, that would mean getting 3 spades or fewer is the same as getting fewer than 4 spades. Alright! Now that you have an idea about discrete distributions, we can start exploring each type in more detail. In the next video we are going to examine the Uniform Distribution. Thanks for watching! 4.4 Uniform Distribution Hey, there! In this lecture we are going to discuss the uniform distribution. For starters, we use the letter U to define a uniform distribution, followed by the range of the values in the dataset. Therefore, we read the following statement as “Variable “X” follows a discrete uniform distribution ranging from 3 to 7”. Events which follow the uniform distribution, are ones where all outcomes have equal probability. One such event is rolling a single standard six-sided die. When we roll a standard 6-sided die, we have equal chance of getting any value from 1 to 6. The graph of the probability distribution would have 6 equally tall bars, all reaching