Placeholder Image

Subtitles section Play video

  • So now we're going to talk about something that is kind of a specific part of Big Data

  • So the velocity part huge amounts of data being generated all the time, which essentially is a data stream

  • So that's a flow of instances so you could have a flow of images coming in have a flow

  • Video coming in or just a flow of essentially lines to go into a database the thing about the dynamic data

  • Is that the patterns within it can change so if we've got for example a static machine learning model?

  • That's not going to deal very well with a changing pattern happening in the data

  • We build a single model at the start. We use it to make predictions on later data the model

  • Accuracy can kind of degenerate over time as that data changes

  • The problem of kind of designing algorithms to deal with this real time data

  • There's been a research topic for kind of several years now and there's several real world applications on top of that as well

  • so if you think about

  • Banks trying to detect fraud as patterns change of different forwards occurring

  • They want their models to kind of be able to update all the time similar for intrusion detection systems and computer networks

  • They want to be able to update

  • And keep on top of what is happening

  • Ideally, you would want this to happen automatically so minimum interference from humans, because otherwise they've got to spot when changes are happening

  • We just want the machines to be able to do it by themselves

  • So if you think about a traditional classification problem on a static batch of data

  • You assume you have all of that data there already. You have your training test set and you have

  • instances with

  • Features which X and then there's some unknown

  • function f of X which gives you the class label and you want to find a

  • hypothesis that gives you the best prediction possible

  • So what kind of approximates this function as well as possible?

  • So you have a red class and a green class and we have instances that look like this our function f of X may create

  • A class boundary that looks like this. So anything on this side is red. Anything on this side is green

  • Our model doesn't know that but we use standard machine learning techniques decision trees new or networks

  • Whatever you want and it learns a boundary

  • That looks like that and so that will do okay on whatever dates that we have

  • It's not effect, but it may get the results that we want. This is static classifications. We already have all our data

  • So we've got our data we've done our machine learning

  • This is the decision boundary that we've learnt. The dotted line is what is actually the boundary this gives. Okay results

  • Let's now say that this is happening in a data stream. So we get this data originally and we build this model

  • But then later on we have a similar distribution of instance arriving

  • However, what now happens is that some of these instances are now in reality in a different class

  • so the true boundary is now here, but we still have our

  • Model with this decision boundary and so we're now predicting instances here and here into the wrong class if we use that

  • Exact same model. So what we would see in this case in

  • Centage accuracy over time you would see at this change point

  • Accuracy would plummet. So this problem here is called real concept drift. What is effectively happened here

  • is that this function the unknown function has changed but we've kept our hypothesis our machine learning model exactly the same and so

  • It starts to perform badly

  • we can also have a similar problem called virtual drift and what would happen in this case is

  • that the

  • Target decision boundary has stayed the same from this original

  • But the instances we now see in the stream are somewhere else in the feature space. Let's say we now see

  • data

  • like this so though the

  • Kind of optimal decision boundary is in exactly the same place. We now have different data. That means that are predicted boundary

  • It's going to give this instance as wrong because we haven't got a way of incorporating

  • information from this instance into the original model that we built both of these will create this decrease in accuracy so we can also

  • Look at the drift in the data streams in terms of the speed they happen so something that would give us an accuracy plot that

  • Looks like this is called sudden drift we go from straight from one concept in the data stream

  • So one decision boundary straight to another one another possible thing that could happen

  • Is that our accuracy looks like this?

  • So rather than this sudden switch this decision boundary gradually shifts save me your life if we're looking at a very very oversimplified

  • Intrusion detection system. We have only two features that we're looking at in the original dataset

  • anything with these features, this is a

  • security

  • Problem and intrusion anything on this side is good in this case

  • What happens is that suddenly there's a new way of attacking the network and so suddenly

  • What was here is now not good. So we see those patterns and we say ok

  • No, that counts as an intrusion in this case

  • what it means is that we see something that we've not seen before so the model hasn't been trained with any similar data and

  • So it could get it, right it could fall somewhere up here and we correctly say this is bad

  • but it could also fall in an area that we didn't learn the decision boundary so well, so

  • Yeah, we get that prediction wrong. We just looked at what?

  • The problems are with using a single static model when we're dealing with incoming data

  • Over time the distribution changes and we start to see a decrease in accuracy on whatever model we built

  • So what happens in kind of a stream machine learning algorithm would be so first of all

  • You've got X arriving. This is your instance in our previous example, this would just have two values associated with it

  • What would first happen is we make a prediction? So in the classification example, we classify this. Yes

  • It's an intrusion. No, it's not intrusion using the current model that we have then what happens is we update whatever model we have

  • using information from X and we'll talk about some of the ways that this is done in a second and

  • One of the kind of caveats with stream machine learning is that you need for this to happen you?

  • need to have

  • The real class label if you're doing classification

  • So in order to incorporate information from this instance into whatever model you've got you need to have that label there now in some cases

  • It's very easy to say we've seen this data. This is what it's classified us

  • And we do that immediately if we're thinking about

  • Making weather predictions we can almost immediately say yes. This is what the weather is like it may be a day's delay

  • But yeah, we can that's pretty immediate thing four things for example for detection

  • You may see a pattern of data

  • you may

  • Predict it is not being fought and then suddenly two days later this person figures out that actually there's something wrong with their bank accounts

  • They phone up and it does turn out to be fraud

  • And so we'd only have the label for that data after that has happened

  • The final bit is to update the model

  • At this point and so the goal of updating the model over time is so that rather than having a performance plot

  • That looks like this so we go from 95s and accuracy down to 20% accuracy

  • We instead end up with something that okay

  • We may drift a little bit here and have a tiny performance decrease

  • But the model should very quickly recover back to the original level and we still have a high performance

  • So that's the goal of this model update. There's various approaches we can take so the first one is explicit drift handling

  • which means that we first of all detect when a drift happens in the data stream

  • So to do that

  • We have drift detection methods and these are usually statistical tests that look at some aspects of the data arriving

  • So if the distribution of the data we see arriving and the distribution of the classes we see is changing

  • If morph like that as a drift some of these we'll also look at the performance accuracy of the classifier

  • So if the classifier performance suddenly drops we can say well, we've probably got a drift here

  • We need to do something to the model to mitigate this

  • Who spots that though? Is it, you know, is there an algorithm that actually spots that something's different to what it should be

  • Yes, so there are various statistical tests that will do this

  • That will kind of just measure things like the mean of the data arriving and be able to spot things that have changed basically

  • So yeah, once we detected that a drift has happened

  • We then want to take some action. The first thing that we could do is we could do a complete replacement of the model

  • so we get rid of whatever model we had before and

  • we

  • We have taken chunk of recent data

  • And we retrain the model on that and continue using that for predictions until we've hit another drift

  • This is okay. But it means that we could be getting rid of some information in the previous model

  • That is maybe still going to be useful in the future

  • so then there are also methods that we'll look at specific parts of the model and say okay this specific part of it is

  • Causing a performance decrease. So let's get rid of this we can then

  • Learn from new instances something to replace this that will do it better basically

  • so if you think of a decision tree

  • If you can detect that there are certain branches in that decision tree that are no longer

  • Making good predictions you can get rid of them and we grow the tree to perform better prune it. Yeah, exactly

  • It is called pruning. You prune. Yeah, you prune the branches off the tree

  • There are no longer performing as you want them to the alternative to explicit handling is to do implicit drift handling

  • So rather than looking at the data or looking at the performance and saying something has changed we need to take action

  • We're just continually taking action. There are various approaches to implicit drift handling

  • So the first and probably most simple one is to use a sliding window

  • So if we imagine we have the data stream with instances arriving like this

  • We could say we have a sliding window of three instances and we learn a model off of them. We then

  • Take the next three learn a model off of them. So as each instance arrives we get rid of the oldest instance

  • And this makes the assumption that the oldest instances are the least relevant. This is usually the case

  • It's kind of a valid assumption to make so this performs

  • Okay

  • the problem with this though is that it kind of provides a crisp cut off points every

  • Instance within this window is treated with exactly the same

  • Kind of impacts on the classifier. They were weighted the same so we can introduce instance weighting

  • So that older instances will have a lower weight their impact on the classifier will be less

  • So again, the more recent instances will be have the largest impact on the current model

  • and then again these algorithms that we'll use instance weighting will usually have

  • Some threshold. So once the weight gets below a certain point they say that's the instance gone

  • We delete it presumably the windows can be larger or smaller

  • Yes, so setting the window size is a pretty important parameter

  • if you have a window, that is too large then

  • Okay, you're getting a lot of data to construct your model from which is good and cents between learning more data usually good

  • What it also means is that if there's very short-term drifts

  • So this drift happens and then we don't learn from that drift if that makes sense because we see that all as one

  • Chunk of the data again

  • If you didn't set the window to be too small we can react very well to very short-term drifts in the stream

  • But you then have a very limited amount of data to work on to construct the model

  • So there are methods that will automatically adjust the window size. So during times of drift the window size will get smaller

  • so we want to be very rapidly changing the model and then during times when everything is kind of very stable the

  • Window will grow to be as large as possible so that we can

  • Use as much data to construct this model as possible

  • So the problem weird sliding windows and instance weighting is that you need all of those instances available to construct the model

  • Continuously. So every time you add a new instance and delete another one you need to reconstruct that model and

  • So the way we can get around this is by using single pass algorithms

  • So we see each instance once use it to update the model and then get rid of that instance

  • It's probably still in long-term permanent storage, but in terms of what is being accessed to construct this algorithm

  • It's gone now in that respect then you've got information out of the instance, but you don't need the instance itself. Yeah, exactly

  • So we see the instance we incorporate what we can from it into the current model

  • We get rid of it and that instances impact is still in the model an example would be a decision tree

  • So decision trees are kind of constructed by splitting nodes where we're going to get a lot of information gained

  • from making a split on a certain attribute

  • So as the data stream changes the information gained that we might get and some of these nodes may change

  • So if we say get a new instance and it will say okay

  • Now this actually makes this a split worth making

  • We can make that split continue growing the tree and then that instance can go we don't need it anymore

  • But we still have the information from it in our model

  • So we've got our implicit and explicit drift handling appro. You can also have hybrids approaches

  • So the explicit drift handling is very good at spotting sudden drift. So anytime there's a sudden change

  • There'll be a sudden drop in performance that's very easy to pick up on with a simple statistical test

  • But when we then add in the implicit drift handling on top of that

  • It means that we can also deal very well with gradual drift

  • So gradual drift is a bit more difficult to identify

  • Simply because if you look at the previous instance or like say that 10 previous instances

  • With a gradual drift, you're not going to see a significant change

  • So it's a lot harder to detect by combining the implicit and explicit

  • Drift timing methods we end up with a performance plot. That would look something like this

  • We maintain pretty good performance for the entire duration of the data that's arriving the problems of a changing data distribution

  • And not the only problems with streams

  • and

  • so if you can imagine a very high volume stream and

  • high-speed got a lot of data arriving in a very short amount of time if

  • You take a single instance of that data stream and it takes you like five seconds to process it

  • But in that 5 seconds, you've had 10 more instances arrive. You're going to get a battery of instances very very quickly

  • So you need to be the model update stage needs to be very quick to avoid getting any backlog. The second problem is that with?

  • These algorithms we're not going to have the entire history of the stream available

  • To create the current model

  • so the models need to be

  • For example the single path algorithms that can say we don't need the historical data that we have the information we need from it

  • But we don't need to access these

  • Because otherwise, you just end up with huge huge data sets

  • Having to be used to create these models all the time

  • And again these streams of potentially infinite

  • We don't know when they're going to end and we don't know how much data they're going to end up containing

  • Most of the kind of and well-known machine learning algorithms have been adapted in various ways to be suitable for streams

  • So they now include update mechanisms. So they're more dynamic methods. So this includes but decision trees neural networks

  • K nearest neighbors. There's also clustering algorithms have also been adapted. So basically any classic algorithm you can think of there's

  • Multiple streaming versions of it now. So if you are interested in these streaming algorithms

  • There's a few bits of software that you could look at

  • for example, there's the

  • Mower suite of algorithms which interfaces with the worker data mining tool kit

  • This is free to download and use and includes implementations of a lot of popular streaming algorithms it also

  • Includes ways to synthesize data streams so generate essentially a stream of data

  • That you can then run the algorithms on

  • and you can control the amount of drift that you get how certain it is and things like that and

  • that's quite good to play around with to see the effects that

  • Different kinds of drift can have on accuracy in terms of big data streams

  • Specifically there's software such as the spark streaming module for Apache spark

  • well

  • There's also the more recent Apache flink that are designed to process very high volume data streams very quickly

  • you just mentioned some yourself where people can download and have a play with but I mean in the real world as an industry and

  • Websites and things that services that we use every day

  • He was using these streaming algorithms. And so a lot of the big companies or most companies to be honest will be generating data

  • Constantly that they want to model. So for example

  • Amazon recommendations like what to watch next what to buy next they want to

  • Understand changing patterns so that they can keep updating

  • Whatever model they have to get the best

  • recommendations again

  • optimizing ads to suggest based on

  • whatever

  • Searching history you have that's another thing that is being done via this. So yeah, there are a lot of real-world applications for this stuff

  • Now I've got the token so I can load a value in add the value emerged or into it and store it back and hand

  • And now I've got the token again

  • I can load something into its my register you and do the computation split across those machines

  • So rather than having one computer going through I don't know a billion database records. You can have each computer going through

So now we're going to talk about something that is kind of a specific part of Big Data

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it