Uber System Design | Ola System Design | System Design Interview Question - Grab, Lyft

Subtitles section Play video

Hello, everyone!! Myself Sandeep from codeKarle.
And in this video, let's look at how do we design a cab booking system, something like an Uber or OLA or Lyft or something of that sort.
Let's start with the functional and non functional requirements that we want this system to support.
So, very first thing is - as a customer, when you open your app, you should be able to see what cabs are around you.
So, that is the See Cab feature in your vicinity.
The next thing is - if you want to book a cab from one place, point A to point B let's say, you need to know how much time it will take to travel
from point A to point B. And, you should be able to get an approximate price of how much would it cost you to book a cab via this platform.
The next thing is - you should obviously be able to book a cab.
We'll not get into the varieties of.. types of cab like go, premium and all of that. We'll just assume that there is one type of cab.
You can add more features, simply! That's not a big trouble.
The next thing is - there should be a very good location tracking of what driver was at what place at what point in time, for various reasons.
From a non-functional standpoint, this platform should be global and it should be accessible to people of all countries.
At least, we'll design it in that way.
What that means is - you need to have servers in a lot of geographies so as to make sure that people in a certain geography
are accessing the servers near to them and thereby reducing the latency.
Next obvious thing - it needs to work at a fairly low latency.
Though this is not very, very mission critical, but it still needs to be, you know, reasonably fast.
Availability should be very high.
This system should not go down. It will cause a lot of problems to people who are stuck somewhere, if the system is down.
And at the same time, it should have high consistency.
High Availability - High Consistency, might seem to you that it is trying to violate the CAP theorem, which basically says that -
assuming all the systems in the world are distributed nowadays, so out of availability and consistent, you can just get one.
The idea is - certain components of this need to be very highly available and certain other components need to be very highly consistent,
not both at the same time.
From a scale standpoint, this system should scale to a very good number.
If I look at just some of the statistics of Uber, there are roughly 100 million active users that use Uber on a monthly basis. These are unique users.
And Uber, in general, does roughly 14 million rides per day.
So with that thought in mind, let's try to look at a system that can scale up to these numbers, with these criterion.
Now, the main problem that companies like Uber are trying to solve is - when you have a customer's location, who wants to book a cab,
you try to find out some few drivers who are very near to this location and then using some logic,
try to come up with the best driver who is suited to do this trip for this customer.
So the problem then becomes, how do you find these 2 - 3 closest drivers to this customer?
There are multiple ways to do that. We'll go over one of the ways which uses a concept called segment and mapping segments, basically.
Now, this is a term that I just made up. It's not an industry standard term. So, just keep that in mind.
Now let's just say, you have a city, something like this.
The idea is you basically divide it into rectangular segments.
So, you kind of divide this city into multiple pieces.
And you say that - this is probably your segment id 1.
This is your segment id 2, something of that sort.
Now, the idea is - you are dividing a city.
It'll normally be divided into a lot more segments than what you can see here.
Now, the idea is - given certain coordinates of the segment boundary and given certain coordinates of a cab,
you should be able to figure out which segment does a cab belong in.
Now, the problem looks trivial, and it is not difficult also.
So, think of it like a standard coordinate system.
This is point (0,0). This is (0,1). This is (1,0). This is (1,1).
Somewhere here.
Now, if a point lies somewhere in between here, let's say (0.4, 0.5),
you should be able to mathematically say that this point lies within this boundary.
A very similar logic we'll try to use when we try to assign a particular segment to a cab.
Also, keep in mind that cabs continuously moving and their locations are continuously changing.
So, we'll try to make sure that we get continuous pings from all the cabs and then keep a track of which segment do they belong in.
A cab could be here right now. Could be in this particular segment. And the cab driver is going via a road, over here. Now, this changes the segment.
And this information would be calculated at runtime, as and when we are getting pings from the cab.
So, we'll have something called a Maps Service.
This Maps Service will do a couple of things.
The very first thing is that - it will be responsible for dividing the city into these segments and taking care of the segments.
The other thing it will also do is - given a lat long of a cab and given a lat long of a customer,
tell which segment do these users belong to at this point in time.
This service will also be used to calculate ETA from point A to point B and the route from point A to point B and thus even the distance.
But, we'll abstract out. We'll not go into much detail on how that ETA and distance piece is implemented.
Think of it for now as if we'll be using a Google Maps Service and we'll go over the details of implementation of that
when we do another system design video on implementing Google Maps.
With that being said, there is also one more thing that this service does.
So let's just say, there is a huge amount of traffic or huge amount of cab drivers in this particular segment. And it is getting unmanageable.
So the idea of segment is - it should be a small set of drivers that are in the segment.
So this service will take care of dividing this segment into multiple parts. It could do it into 4 parts. It could divide it into 6 parts.
That logic resides within Maps Service.
Let's just say, there is very less traffic in some other locality.
So, it could also decide to merge a couple of segments into one segment and say, this whole thing is now one segment.
So all the segment management remains within this service.
Now, let's look at the overall architecture and how individual users and drivers get connected to the system.
All the users get connected through this User App, which talks to a Load Balancer, which talks to something called as a User Service.
This User Service is your repository of all the user information.
Plus, it is also a proxy, that will connect to other services to get any information that a user wants.
So, for example, if a user wants to see their profile, update their profile, all the APIs to do that are powered by User Service.
If somebody wants to fetch user information, any other service for example, then all the APIs for that are powered by this service.
Let's say, if a user wants to see their trips, then this User Service will talk to Trip Service, fetch all the trips for that user and send it back to the user.
So that's the responsibility of User Service.
From a database standpoint, it sits on top of a MySQL cluster, which stores all the user information within that.
And it also uses a Redis for caching the same information.
So let's say, a GET API to get a user's information is called, it first queries the Redis. If it has information, it returns from there. If it doesn't have
information, then it queries a MySQL slave, fetches that information, stores it in Redis and then returns back to whoever was calling.
The next flow that the user calls is basically when they try to book a cab.
The whole screen that the customer, kind of, goes through when they are trying to book a cab is powered by this Cab Request Service.
Essentially what it does is -
it basically makes a WebSocket connection with the User's App, which displays them a few cabs onto their UI which are around them.
Also, it places a request with something called a Cab Finder. We'll go over this in the next section. Whenever Cab Finder responds back with a cab,
this Cab Request Service talks to the User App, basically sends them a response, through this WebSocket connection,
saying that the cab is booked and these are the details and whatever is required.
This is approximately the major user flows.
Now quickly, let's look at the driver flows.
A driver basically talks via the Driver App.
Again, there is a very similar Driver Service, exactly similar to the User Service.
But for drivers, again, all the APIs for getting, updating all the driver information is powered via this.
If a driver wants to see their payment information for example, their payment history, this Driver Service will expose an API
which the Driver App will call and Driver Service will internally call a Payment Service to get that information and respond back.
It could call Trip Service to get the trips of all the drivers. And, all the UI data gets powered by this service.
This service sits on top of another MySQL which has all the driver information.
And it uses Redis for caching in exactly the same way the User Service does.
This Driver App also talks to something called as a Location Service through a series of servers,
again maintaining a WebSocket connection with these servers.
And, as and when a driver is moving through the city, every 5 seconds or 10 seconds, their location is being sent out to this Location Service,
which then queries the Map Service that we talked about earlier, to find out which segment does the driver belong to.
And when the customer places this cab request, customer's segment is calculated, driver segments are calculated
and they are mixed and matched by the Cab Finder and a couple of other components to give the best suited driver for a particular trip.
Now, let's look at how does a customer and driver come together.
All the active drivers in the system, who are online right now, ready for trips and all, are mentioned as D1, D2, D3.
There'll be a lot more such drivers.
All those drivers.. each of them is connected to one of the servers through a WebSocket connection.
And those servers are called out as WebSocket Handler 1, WebSocket Handler 2, WebSocket Handler 3.
Now, why do we need WebSocket here?
So, we always need a connection between the driver and a service.. for a lot of reasons.
One of the very first things is that - a driver continuously sends location pings to the backend, telling about the location.
Now, if each time they start creating a new connection, that's a kind of a heavy operation. So, we'll have this connection live.
Also, at times, the servers might want to talk to driver. So, let's say, if a trip is assigned to a driver, we need to inform the driver.
So, we can reuse the same connection to talk to a driver and tell them that this is the trip information that you have to do right now.
For all of that, there are these WebSocket Handler servers.
In the real world, there'll be hundreds of such servers, who are interfacing with all the drivers, which are throughout the world, geographically split.
Now, let's say, somebody in the system identifies that a trip is being given to a driver and to reach out to the driver,
they first need to know that which out of these hundreds of WebSocket Handler servers, do I need to talk to.
For that, there is something called as a WebSocket Manager.
Now, this WebSocket Manager is another distributed service which manages the fact that which server is connected to what all drivers.
So, let's say, D3 is a new driver that has come online right now and through the load balancer, it got connected to WebSocket Handler 3.
So, this Handler will inform the Manager, saying I have now got connected to D3 also. So, if there's anything for D3, inform me.
And this Manager will store this in it's database.
Now, let's say, this connection got broken and D3 is offline right now. Again this Handler will inform the Manager that D3 is now offline.
Do not reach out to me for any communication of D3.
This manager sits on top of a Redis cluster.
This Redis would not just be storing data in-memory, it will also be storing it in a persistent store on disk.
And it will basically store 2 kinds of mapping.
One is saying that.. the most frequently used one, is that - which driver is connected to which host..
which is saying something like D1 is connected to H1.
Similarly, there'll be an entry for each of the drivers in the system, saying which driver_id is connected to what host_id.
It'll also have a reverse mapping, saying which host_id is connected to what all driver_id.
So, it could have a mapping saying H1 is connected to D1, D2, D3, so on and so forth. Because that mapping might be used for something.
Coming to other things that this WebSocket is used for.
So these drivers/ devices send location pings to our backend, let's say, every 5 seconds.
So, every 5 seconds, we get a hit about the location information.
All the location related information is managed by something called Location Service. It does a lot of things.
One of the things that happens here is - it stores the information about the driver's location into it's Cassandra.
Why Cassandra?
Because, again, there are like thousands probably or maybe even millions of drivers across the globe
who are sending their location updates every 5 seconds.
So, there are a lot of updates happening. So, a Cassandra should be able to easily scale up to that number. That's the main reason.
There are 2 kinds of information that get stored here.
One is - the live location of the driver, which is the last known location.
The other thing that is stored is - while a driver is doing a trip with a customer, we need to know exactly what was the route followed,
for any auditing purpose or billing purpose.
Very common use case is - Once we know the points that the driver followed, we would be able to trace that out and then come up
with the real distance that the driver actually travelled and then use that to come up with the pricing.
So all of those things are basically responsibilities of Location Service.
Location Service also talks to Map Service.
Remember, Map Service from the previous section. Map Service is a service that maintains the segments that we have created
throughout the city and throughout the globe. So, Map Service maintains not just the segments, it also gives us the ETA, which will be the
time taken from point A to point B and the distance that will be followed and also the route that should be followed from point A to point B.
Think of Map Service as an abstraction that we have. We'll not go into the details of implementation of Map Service right now.
I have made another video which is on the design of Google Maps, that goes into details of how Map Service is implemented.
But, that being said, Location Service, as soon as it gets a ping from a driver,
it basically queries Map Service and tries to figure out that this lat long belongs to which segment.
It then stores it into it's Redis, saying, this segment has these drivers.
This update happens only when a driver's segment changes. If he's in the same segment, then no change happens.
This basically is used for a lot of purposes. So, let's say, we want to find out drivers in a vicinity,
we'll query this service saying who all are the drivers that are basically in S1.
There is one more thing that Map Service does. It basically keeps a mapping of which all are the segments surrounding a particular segment.
Which, we'll come to in a while on how it is being used.
There's something called as Trip Service. Trip Service is basically the source of truth for all the trip information.
It sits on top of a MySQL database and a Cassandra database.
It uses MySQL for all the live information, basically information of all the trips that are either about to happen in some near future or are in progress.
Once the trip is completed, then it basically can move to Cassandra.
Now, why don't we store all the information in MySQL? Because over time, this will become a very massive volume of data.
Plus, if it is just for read queries, Cassandra is also good enough. So, we don't really need to store it in MySQL.
The main reason of storing it in MySQL is because trip would have a lot of information.
It would have information about the customers, about the divers, about potential start times; end times, about the potential distances,
about the real values and maybe some events information that have come in between, maybe some payment information and a lot of other things.
Now, if you look at it in tables terms, these will be a lot of tables.
And, for each event that comes in against a trip, we might need to update a lot of such tables.
And there, it is very good to have transactional properties.
So, that's the reason we'll be using a MySQL for all the trips that will be updated.
And once the trip is completed, then we can move it to Cassandra.
Now, this movement from MySQL to Cassandra is taken up by this Trip Archiver Service, which basically is a cron,
which spawns once in every probably 12 hours and pulls the data from MySQL and puts it into Cassandra.
Coming back to Trip Service.. Trip Service will expose all the APIs around trips. So, if you want to get a trip by id or if you want to get all the trips
of a particular driver or all the trips of a user, all of those APIs would be powered by Trip Service.
And let's say, if it's a search by a driver_id, it will query MySQL, it will query Cassandra, it will get the results from both of them, merge them
and then return it back to whoever was calling. So, that's how it's flow would be.
Now, let's get to the main flow on what happens when a customer actually requests a cab.
The customer flow begins at this point where they basically make a request to Cab Request Service through again
an open connection between both the parties, the Customer and this service.
And basically, what the customer says is - this is my source lat long. I need to go to a destination which is identified by certain lat long. Get me a cab.
I'm assuming there is just one type of cab and no varieties and types of cab. If you want to implement, that is a straightforward thing to implement.
But given this request, Cab Request Service then queries something called as Cab Finder,
which is responsible to come up with that one driver who will do the trip.
At the end of all of it, Cab Finder will respond back to Cab Request Service saying, this is your trip_id, this is your driver_id,
go send it back to the customer.. in a nice form, with all the details about the driver and all of that. And Cab Service would send that.
Cab Finder will also put a notification into a Kafka, whether or not it was able to find a driver.
Let's say, if it's not able to find a driver, may be because there is a scarcity of drivers, all of that would go into Kafka,
which can be used for further Analytics like for example, telling drivers that this is a location where there are more customers and less drivers
so why don't you go into that location to get more trips or something of that sort.
Coming to what Cab Finder does.
The very first thing Cab Finder would do is basically - it has a source lat long, which is basically identifier of a particular location of a customer.
It will first of all query Location Service saying, get me the segment in which this customer currently is.
Along with that, also give me a list of drivers that are near this customer.
What it does is - it first of all queries Map Service with the lat long of the customer, to get the segment which the customer is in.
It then queries surrounding segments. So, I'll try to explain why do we need surrounding segments.
So, let's say, there is a rectangle like this and customer was sitting here.
There were a couple of drivers in various locations in this segment. There could possibly be a driver just here.
But, if we query all the drivers within this segment, then we'll get a driver who is far away.
But, a driver is just next to customer, maybe in another segment.
So, for doing that, we basically.. let's say if it is segment S1, we need to find all the segments that are
surrounding S1 and get the closest 10 drivers, let's say, in all of those segments.
So, let's say, there could be a couple of segments over here and maybe you can have something like this. This is S2, S3, S4, something of that sort.
So, we'll query all of these segments. Basically, we'll not query, we'll just off load that job to Map Service saying, get me all the segments
that are surrounding this segment S1, that Location Service gets. Then, Location Service basically figures out all the drivers
and then tells Map Service to get the closest 10 drivers to this customer, out of a list of those drivers.
Distance between the customer and the driver is also something Map Service is good at.
Remember, it is able to find and identify the distance between two points.
So, it can identify the distance between two customers and drivers, taking into account the road distance, not the aerial distance.
Let's say, it got some 5 - 10 drivers which are close by to the customer. Location Service will return it to Cab Finder.
Now basically, we need to identify 1 driver out of these 10 who will do the trip.
Now, there comes something called modes. There could be multiple modes in which this cab request could be served.
We could say that for certain kind of customers, just pick the best driver.
Let's say, if it's a premium customer, then we just pick the best driver out of the lot and assign that.
Or, if it's an average customer, we might want to do some different thing.
We might want to broadcast to all the drivers and whoever accepts first, we can assign that driver to the trip.
So, all of these modes are basically something that Cab Finder decides that which mode I want to run it in.
Given whatever mode it is, it might need some additional inputs.
So, if it's the best driver mode, then it might need to stack rank all the drivers.
So, all of these are basically something that is handled by Driver Priority Engine. We'll get to the logic, what it follows, later on.
Cab Finder then queries Driver Priority Engine saying, I have these drivers for this kind of a customer, you just arrange them and give me back.
Then it gets a list of those drivers. And given the mode and given the list, it then tries to identify one of the drivers who will actually do that trip.
It then queries WebSocket Manager and asks WebSocket Manager saying,
which was the host that was actually, you know, integrating with this particular driver. It will then call WebSocket Handler.
Let's say it chose D1, it will then call this Handler 1 saying, D1 you have a trip, go do it.
The same notification would be sent via Cab Request Service to Customer saying, Customer, you have got a new driver, which is driver D1.
And then the regular flow follows wherein the driver starts moving to the customer's location.
So, this is how we'll figure out, how to assign a driver to a customer.
Once a driver is assigned, basically, what it means is, a trip has to be created and updated.
So, this will then basically update the Trip Service saying, I've created a trip with this customer, with this particular start point, this end point,
this driver and whatever information it needs to. And, that gets persisted into this MySQL via this Trip Service.
So, this is basically the booking flow.
As part of the booking flow, we had inserted a lot of events.
Whether Cab Finder was able to find a ride for the customer or not. And even the Location Service was putting in a lot of events into Kafka.
Now, let's try to look at how do we use those events.
This Kafka is getting a lot of events like location update events, trip update events, no driver found events, a lot of things.
Let's look at some of the use cases where we can utilize those events to our benefit.
One of the very common thing is - whenever a trip is completed, we need to initiate a payment to the driver.
That could be aggregated over a few hours or a few days or something of that sort. But, we still need to store information about a potential payment.
There would be a Payment Service, which sits on top of this Kafka, which would have a Kafka consumer, which listens to all the
trip completion events. And as soon as a trip is completed, it would insert a record in it's Payment MySQL database, which says that -
this particular driver, did this particular trip_id, for a user with user_id, and with lots of attributes like distance travelled, time taken and all of that.
And finally an amount of money that needs to be paid to this driver.
If that needs to be an instantaneous payment, this Payment Service could talk to a Payment Gateway to deliver a payment.. like transact the money.
Or, if it needs to be aggregated, then it can basically run a cron which does the payment every once in a few days or something of that sort.
If, let's say, a driver wants to see their payment history, there would be APIs that would be running out of this service,
which will give all the payment transaction information against a particular driver,
which we talked about in the very first section that, could be powered through Driver service, talking to Payment Service.
Now, let's look at some other use cases.
On top of this Kafka cluster, there would be a Spark Streaming Cluster in which some Spark streaming jobs would be running.
One of the very common things is to basically create a heat map.
If, let's say, from a particular geography, we are getting a lot of events saying, there are no drivers found,
that means that, there is a surge of customers in that area and there are very few drivers.
So, we can create a heat map within Driver App, powered by this Streaming, which kind of shows a particular segment or a few areas,
which are having this kind of a scarcity so drivers can move to locations to get more tips.
This is a classic example of a Streaming kind of an application.
What it will also do is - it will basically put all the events into a Hadoop cluster, which can be used for further Analytics.
On top of this Hadoop cluster, we could run a lot of ML jobs or regular Spark jobs, which will do a lot of things.
So, the very first thing is basically to do a customer classification into various categories.
So, if a customer takes a trip with us every day, we'll classify them as a premium customer.
If it's a once in a while coming customer, then it would be just another customer for us.
Same classification could be done for drivers.
So, there'll be these User Profiling and Driver Profiling jobs, which would ideally be a ML classification model running,
which will classify those users as premium or regular, or drivers as premium drivers or regular drivers.
The same information would also be used to create an ML model, which can do the Driver Priority that we talked about in the previous section.
So, based on certain attributes of drivers, for example, their ratings or the customer feedback or their ETA..
basically the accuracy of how much ETA was supposed to happen and how much ETA did actually the driver take.
There could be a lot of attributes on which we stack rank the drivers.
All of those could be aggregated in these jobs and drivers could be given a score.
And using those scores, drivers could be ranked by Driver Priority Engine. So, something of that sort could also be built.
Now, we could also power the same... I forgot to make this link.
This same information could be used to generate a fraud score.
Let's say, if there is a very high correlation between a driver and a customer. Let's say, if all the location pings say that
a customer and driver move together or if all the trips done by a driver are against a particular customer.
For most of them, we can safely say that this customer and driver are either friends or that's the same person
using two mobile phones to book a trip by this customer and do a trip and thereby, you know, using the..
there's something called as incentive programme run by these ubers and all the other companies.
So, it's basically a way to exploit that kind of a programme. So, all those kind of frauds could be captured by these kind of models again.
The same data could be used as an input into Map Service.
Let's say, if we don't have traffic information into maps. All these lat longs could basically power the traffic information of Map Service.
We can safely assume that - we'll not know the exact number of people traveling on the road,
but, if some road has more of our cars, then we can safely assume that probably there is more traffic also there.
If we are at a Uber scale, we can safely assume that.
So, that could be an input into Map Service or at least the traffic data and some of the road condition data also.
So, if the average speed of a driver on certain roads is very high, we can safely assume that that's a freeway or a highway or a good road.
If the average speed is very low, we can assume that there is either high traffic or the road is not in good condition.
Or there is something wrong with the road.
And also the same information could then be used to come up with a better, enhanced ETA Calculation Engine.
Exact details of it, we'll go over in the Google Maps video. So, I would recommend that you look at that.
But this could be one of the use cases where we use all of these information.
So, yeah, I think this is mainly about an Uber kind of an application.