Subtitles section Play video Print subtitles [MUSIC PLAYING] Last episode, we used a decision tree as our classifier. Today we'll add code to visualize it so we can see how it works under the hood. There are many types of classifiers you may have heard of before-- things like neural nets or support vector machines. So why did we use a decision tree to start? Well, they have a very unique property-- they're easy to read and understand. In fact, they're one of the few models that are interpretable, where you can understand exactly why the classifier makes a decision. That's amazingly useful in practice. To get started, I'll introduce you to a real data set we'll work with today. It's called Iris. Iris is a classic machine learning problem. In it, you want to identify what type of flower you have based on different measurements, like the length and width of the petal. The data set includes three different types of flowers. They're all species of iris-- setosa, versicolor, and virginica. Scrolling down, you can see we're given 50 examples of each type, so 150 examples total. Notice there are four features that are used to describe each example. These are the length and width of the sepal and petal. And just like in our apples and oranges problem, the first four columns give the features and the last column gives the labels, which is the type of flower in each row. Our goal is to use this data set to train a classifier. Then we can use that classifier to predict what species of flower we have if we're given a new flower that we've never seen before. Knowing how to work with an existing data set is a good skill, so let's import Iris into scikit-learn and see what it looks like in code. Conveniently, the friendly folks at scikit provided a bunch of sample data sets, including Iris, as well as utilities to make them easy to import. We can import Iris into our code like this. The data set includes both the table from Wikipedia as well as some metadata. The metadata tells you the names of the features and the names of different types of flowers. The features and examples themselves are contained in the data variable. For example, if I print out the first entry, you can see the measurements for this flower. These index to the feature names, so the first value refers to the sepal length, and the second to sepal width, and so on. The target variable contains the labels. Likewise, these index to the target names. Let's print out the first one. A label of 0 means it's a setosa. If you look at the table from Wikipedia, you'll notice that we just printed out the first row. Now both the data and target variables have 150 entries. If you want, you can iterate over them to print out the entire data set like this. Now that we know how to work with the data set, we're ready to train a classifier. But before we do that, first we need to split up the data. I'm going to remove several of the examples and put them aside for later. We'll call the examples I'm putting aside our testing data. We'll keep these separate from our training data, and later on we'll use our testing examples to test how accurate the classifier is on data it's never seen before. Testing is actually a really important part of doing machine learning well in practice, and we'll cover it in more detail in a future episode. Just for this exercise, I'll remove one example of each type of flower. And as it happens, the data set is ordered so the first setosa is at index 0, and the first versicolor is at 50, and so on. The syntax looks a little bit complicated, but all I'm doing is removing three entries from the data and target variables. Then I'll create two new sets of variables-- one for training and one for testing. Training will have the majority of our data, and testing will have just the examples I removed. Now, just as before, we can create a decision tree classifier and train it on our training data. Before we visualize it, let's use the tree to classify our testing data. We know we have one flower of each type, and we can print out the labels we expect. Now let's see what the tree predicts. We'll give it the features for our testing data, and we'll get back labels. You can see the predicted labels match our testing data. That means it got them all right. Now, keep in mind, this was a very simple test, and we'll go into more detail down the road. Now let's visualize the tree so we can see how the classifier works. To do that, I'm going to copy-paste some code in from scikit's tutorials, and because this code is for visualization and not machine-learning concepts, I won't cover the details here. Note that I'm combining the code from these two examples to create an easy-to-read PDF. I can run our script and open up the PDF, and we can see the tree. To use it to classify data, you start by reading from the top. Each node asks a yes or no question about one of the features. For example, this node asks if the pedal width is less than 0.8 centimeters. If it's true for the example you're classifying, go left. Otherwise, go right. Now let's use this tree to classify an example from our testing data. Here are the features and label for our first testing flower. Remember, you can find the feature names by looking at the metadata. We know this flower is a setosa, so let's see what the tree predicts. I'll resize the windows to make this easier to see. And the first question the tree asks is whether the petal width is less than 0.8 centimeters. That's the fourth feature. The answer is true, so we proceed left. At this point, we're already at a leaf node. There are no other questions to ask, so the tree gives us a prediction, setosa, and it's right. Notice the label is 0, which indexes to that type of flower. Now let's try our second testing example. This one is a versicolor. Let's see what the tree predicts. Again we read from the top, and this time the pedal width is greater than 0.8 centimeters. The answer to the tree's question is false, so we go right. The next question the tree asks is whether the pedal width is less than 1.75. It's trying to narrow it down. That's true, so we go left. Now it asks if the pedal length is less than 4.95. That's true, so we go left again. And finally, the tree asks if the pedal width is less than 1.65. That's true, so left it is. And now we have our prediction-- it's a versicolor, and that's right again. You can try the last one on your own as an exercise. And remember, the way we're using the tree is the same way it works in code. So that's how you quickly visualize and read a decision tree. There's a lot more to learn here, especially how they're built automatically from examples. We'll get to that in a future episode. But for now, let's close with an essential point. Every question the tree asks must be about one of your features. That means the better your features are, the better a tree you can build. And the next episode will start looking at what makes a good feature. Thanks very much for watching, and I'll see you next time. [MUSIC PLAYING]
A2 US data tree classifier testing flower width Visualizing a Decision Tree - Machine Learning Recipes #2 73 9 scu.louis posted on 2017/07/16 More Share Save Report Video vocabulary