import pandas as pd
In this report, I'll analyze data on the operations of Jumpman23, a fictitious business that operates in the New York City market. First, I'll get the overall picture of the market and then I'll deep dive into 3 topics: Customer Experience, Jumpmen efficiency and Logistics.
df = pd.read_csv("analyze_me.csv")
df.head()
from datetime import datetime as dt
from matplotlib import pyplot as plt
import folium
from folium.plugins import HeatMap
df["delivery_date"] = df.when_the_delivery_started.str.slice(0,10)
df.delivery_date = df.delivery_date.apply(lambda date: dt.strptime(date, '%Y-%m-%d'))
num_deliveries = df[["delivery_id", "delivery_date"]].groupby("delivery_date").nunique()
num_deliveries.columns = ["delivery_id", "temp"]
ax = plt.plot(num_deliveries.delivery_id)
ax = plt.xticks(rotation=45)
ax = plt.title("Number of deliveries in October, 2014")
print("Average number of deliveries per day: " + str(num_deliveries.delivery_id.mean()))
num_deliveries.delivery_id.describe()
The number of deliveries in NYC lied between 138 and 265, with an average of 173.8 deliveries per day. One month of data is not enough time to spot any long term growth trend, but it would be useful to compare this number to past data of other cities that Jumpman23 entered.
Furthermore, we can see a cycle in the data. Let's explore it.
num_deliveries.reset_index(inplace=True)
num_deliveries["delivery_weekday"] = num_deliveries.delivery_date.apply(lambda x: x.weekday())
weekday_deliveries = num_deliveries[["delivery_id", "delivery_weekday"]].groupby("delivery_weekday").delivery_id.sum()
ax = plt.plot(weekday_deliveries)
ax = plt.xticks(rotation=45)
ax = plt.title("Number of deliveries per weekday (0 = Monday)")
The result is reasonable. We have a higher volume of deliveries on Sundays and a lower volume on Mondays.
Let's now take a look at our user base:
print("Number of customers: " + str(len(df.customer_id.unique())))
print("Number of jumpmen: " + str(len(df.jumpman_id.unique())))
For the month of October, 2014, we had 3,192 active customers served by 578 jumpmen. Again, it would be useful to compare those numbers with historical data.
Finally, let's take a look at the density of deliveries by pickup location and dropoff location.
df["pickup_lat2"] = df.pickup_lat.apply(lambda x: round(x*100)/100)
df["pickup_lon2"] = df.pickup_lon.apply(lambda x: round(x*100)/100)
pickup_hm = df.groupby(["pickup_lat2", "pickup_lon2"]).delivery_id.nunique().reset_index()
hmap = folium.Map(titles="A", location=[df.pickup_lat.mean(), df.pickup_lon.mean()], zoom_start=12)
hm_wide = HeatMap( list(zip(pickup_hm.pickup_lat2.values, pickup_hm.pickup_lon2.values, pickup_hm.delivery_id.astype(float).values)),
min_opacity=0.2,
max_val=pickup_hm.delivery_id.max(),
radius=25, blur=15,
max_zoom=1,
)
loc = 'Density of delivery pickups'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
hmap.get_root().html.add_child(folium.Element(title_html))
hmap.add_child(hm_wide)
df["dropoff_lat2"] = df.dropoff_lat.apply(lambda x: round(x*100)/100)
df["dropoff_lon2"] = df.dropoff_lon.apply(lambda x: round(x*100)/100)
dropoff_hm = df.groupby(["dropoff_lat2", "dropoff_lon2"]).delivery_id.nunique().reset_index()
hmap = folium.Map(location=[df.dropoff_lat.mean(), df.dropoff_lon.mean()], zoom_start=12)
hm_wide = HeatMap( list(zip(dropoff_hm.dropoff_lat2.values, dropoff_hm.dropoff_lon2.values, dropoff_hm.delivery_id.astype(float).values)),
min_opacity=0.2,
max_val=dropoff_hm.delivery_id.max(),
radius=25, blur=15,
max_zoom=1,
)
loc = 'Density of delivery dropoffs'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
hmap.get_root().html.add_child(folium.Element(title_html))
hmap.add_child(hm_wide)
We can see that both pickups and dropoffs are concentrated in Lower and Midtown Manhattan. One hypothesis is that, as most of Jumpan23 partnerships with pickup places are in Lower and Midtown Manhattan, the delivery time to other regions is high, which makes the customer experience not good. Then, it would make sense to expand the partnerships to other regions of Manhattan and even the suburbs of NYC. In the next topic section, I'll explore the delivery time.
Here, I'm going to analyze customer behavior and experience in the platform. First, let's see how many orders per customer we had over time.
df["delivery_week"] = df.delivery_date.apply(lambda date: date.isocalendar()[1])
orders_per_cust = df[["delivery_id", "customer_id", "delivery_week"]].groupby("delivery_week").nunique()
orders_per_cust
orders_per_cust.columns = ["delivery_id", "customer_id", "temp"]
ax = plt.plot(orders_per_cust.delivery_id/orders_per_cust.customer_id)
ax = plt.title("Orders per customer per week")
print("Average orders per customer per week: " + str(orders_per_cust.delivery_id.sum()/orders_per_cust.customer_id.sum()))
print("Average orders per customer overall: " + str(df.delivery_id.nunique()/df.customer_id.nunique()))
df.groupby("customer_id").delivery_id.nunique().describe()
We had on average 1.63 orders per customer in NYC in October 2014, and on average 1.18 orders per week. Also, 75% of customers ordered twice or less, and the maximum number of orders for a customer was 23. We should try to increase the frequency of orders with promotions and targeted advertisement and we should retain yhe high frequency customers.
We can see a peak in week 42, but the standard deviation is pretty low. It would be interesting to compare those numbers with past data.
Let's now see how many items we have per order on average. Here, we have missing data for the item name, but we know that each row corresponds to a different item.
items_per_order = df.groupby("delivery_id").count().customer_id
print("# items per order: " + str(items_per_order.mean()))
items_per_order.describe()
Most of the time, customers tend to order only 1 item in the platform, with a maximum value of 5 items. Therefore, the company should think of ways to make customers add more items in the checkout, as for example recommending other items that go well together.
df.groupby("place_category").delivery_id.nunique().sort_values(ascending=False)
The top place category ordered in NYC is Italian, followed by Burger and American. The least ordered category is Clothing, tied up with other categories like Book Store. Therefore, the company should understand if people are not interested by clothing and books, or if the platform doesn't provide a good experience in those categories (few places or few items available).
df.groupby("pickup_place").delivery_id.nunique().sort_values(ascending=False)
The top place ordered in NYC is Shake Shack, so Jumpman should maintain a good relationship with this stakeholder.
Let's now see how long it takes for a customer to place an order.
df.how_long_it_took_to_order = df.how_long_it_took_to_order.apply(lambda date: dt.strptime(date, '%H:%M:%S.%f') if date==date else date)
df["seconds_to_order"] = df.how_long_it_took_to_order.apply(lambda time: time.minute*60 + time.second if time != None else time)
time_to_order = df.groupby("delivery_id").seconds_to_order.agg(["mean", "count"]).dropna()
time_to_order["avg_time"] = time_to_order["mean"]/time_to_order["count"]
print("Avg time to place an order (minutes): " + str(time_to_order["mean"].mean()/60))
print("Avg time per item to place an order (minutes): " + str(time_to_order.avg_time.mean()/60))
It takes around 7.6 minutes for customers to order in the platform, and around 7 minutes per item. It would be useful to compare those values with past data from other cities, as well.
from datetime import timedelta as td
import seaborn as sns
Let's now analyse the operations from the perspective of the Jumpmen. First, it is interesting to know the number of deliveries each Jumpmen performs on average.
dlvrs_per_jumpman = df.groupby("jumpman_id").delivery_id.nunique()
plt.hist(dlvrs_per_jumpman, bins=15)
plt.title("Distribution of # of deliveries per jumpman")
dlvrs_per_jumpman.describe()
dlvrs_per_jumpman.sort_values(ascending=False).head(5)
Most of Jumpmen delivered only once, which is a low number. Jumpman23 should think of ways to increase loyalty of jumpmen, as for example creating targets of number of deliveries and rewarding jumpmen for that.
On average, they deliver around 9 times and the maximum number of deliveries was 67. The top performers, as '99219', should get incentives to stay in the platform.
Let's see the number of active Jumpmen over time.
plt.plot(df.groupby("delivery_date").jumpman_id.nunique()/df.jumpman_id.nunique())
ax = plt.xticks(rotation=45)
plt.title("% utilization of Jumpmen")
From the active Jumpmen in October 2014, we had a % of utilization ranging between 14% and 22% per day. However, we can't reach any conclusions with this data, because we don't know if the remaining Jumpmen were connected to the platform waiting for deliveries or if they were offline.
Let's now see how long it takes to commute around the city. As for some deliveries the delivery started after the Jumpman arrived at the pickup, I'll consider that it took 0 minutes to commute in these cases.
df["pickup_commute_time"] = df.when_the_Jumpman_arrived_at_pickup.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)-df.when_the_delivery_started.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)
df.pickup_commute_time = df.pickup_commute_time.apply(lambda x: 0 if x.days < 0 else x.seconds/60)
df["dropoff_commute_time"] = df.when_the_Jumpman_arrived_at_dropoff.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)-df.when_the_Jumpman_left_pickup.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)
df.dropoff_commute_time = df.dropoff_commute_time.apply(lambda x: 0 if x.days < 0 else x.seconds/60)
df["total_commute_time"] = df.pickup_commute_time + df.dropoff_commute_time
total_commute_time = df.groupby(["jumpman_id", "delivery_id"]).total_commute_time.mean().reset_index()
total_commute_per_jumpman = total_commute_time.groupby("jumpman_id").total_commute_time.mean()
ax = plt.hist(total_commute_per_jumpman, bins=50)
plt.title("Distribution of commute time per jumpman")
total_commute_per_jumpman.describe()
On average, a Jumpman takes around 27 minutes commuting in NYC picking up and dropping off. The fastest Jumpman took on average 2.3 minutes and the slowest, 84.3 minutes on average. The fast ones are probably lucky to get pickup and dropoff locations close to each other. However, it would be interesting to understand why some jumpmen are taking almost 1 hour and a half to commute. It may be due to traffic, pickup and dropoff not close to each other, slow mean of transportation or other inefficiency.
Let's now break down the commute time into pickup and dropoff commute time.
pickup_commute_time = df.groupby(["jumpman_id", "delivery_id", "pickup_lat2", "pickup_lon2"]).pickup_commute_time.mean().reset_index()
pickup_commute_per_jumpman = pickup_commute_time.groupby("jumpman_id").pickup_commute_time.mean()
ax = plt.hist(pickup_commute_per_jumpman, bins=40)
plt.title("Distribution of pickup commute time per jumpman")
pickup_commute_per_jumpman.describe()
The distribution of pickup commute time is a little different. We have a considerable amount of Jumpmen that arrived at the pickup before the delivery started. On average, it took around 12 minutes for Jumpmen to pick up an order, with a maximum average time of 38.5 minutes for some Jumpman, probably because they were assigned more distant pickups or because they usually work during peak traffic hours.
dropoff_commute_time = df.groupby(["jumpman_id", "delivery_id", "dropoff_lat2", "dropoff_lon2"]).dropoff_commute_time.mean().reset_index()
dropoff_commute_per_jumpman = dropoff_commute_time.groupby("jumpman_id").dropoff_commute_time.mean()
ax = plt.hist(dropoff_commute_per_jumpman, bins=40)
plt.title("Distribution of dropoff commute time per jumpman")
dropoff_commute_per_jumpman.describe()
The dropoff distribution looks more gaussian. On average, Jumpmen took 15 minutes to drop orders off, with a maximum average time of 69.6 for some Jumpman, probably because they were assigned distant dropoffs or because they usually work during peak traffic hours.
We can see that dropoff is slightly slower than pickup. It would be useful for Jumpman23 to do a deep dive on those anomalous drivers to understand if they are slow because of external factors like traffic, distance and mean of transportation, or because they are not efficient.
It is usufel to analyze the commute time versus the pickup and dropoff location, to see if places distant from Lower and Midtown Manhattan have a higher commute time because Jumpmen and places are concentrated in that region.
pickup_hm = df.groupby(["pickup_lat2", "pickup_lon2", "delivery_id"]).mean().groupby(["pickup_lat2", "pickup_lon2"]).pickup_commute_time.mean().reset_index()
hmap = folium.Map(titles="A", location=[df.pickup_lat.mean(), df.pickup_lon.mean()], zoom_start=12)
hm_wide = HeatMap( list(zip(pickup_hm.pickup_lat2.values, pickup_hm.pickup_lon2.values, pickup_hm.pickup_commute_time.values)),
min_opacity=0.2,
max_val=pickup_hm.pickup_commute_time.max(),
radius=25, blur=15,
max_zoom=1,
)
loc = 'Density of delivery pickup time'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
hmap.get_root().html.add_child(folium.Element(title_html))
hmap.add_child(hm_wide)
dropoff_hm = df.groupby(["dropoff_lat2", "dropoff_lon2", "delivery_id"]).mean().groupby(["dropoff_lat2", "dropoff_lon2"]).dropoff_commute_time.mean().reset_index()
hmap = folium.Map(titles="A", location=[df.dropoff_lat.mean(), df.dropoff_lon.mean()], zoom_start=12)
hm_wide = HeatMap( list(zip(dropoff_hm.dropoff_lat2.values, dropoff_hm.dropoff_lon2.values, dropoff_hm.dropoff_commute_time.values)),
min_opacity=0.2,
max_val=dropoff_hm.dropoff_commute_time.max(),
radius=25, blur=15,
max_zoom=1,
)
loc = 'Density of delivery dropoff time'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
hmap.get_root().html.add_child(folium.Element(title_html))
hmap.add_child(hm_wide)
We can see that places far from the Lower and Midtown Manhattan have considerably higher commute times, especially dropoff commute time. This validates our initial hypothesis that people in the suburbs are having a worse experience. Jumpman23 should consider expanding the partnerships with businesses in those distant neighbourhoods.
Let's now analyze the time Jumpmen spend waiting to pick up.
df["wait_time"] = df.when_the_Jumpman_left_pickup.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)-df.when_the_Jumpman_arrived_at_pickup.apply(lambda time: dt.strptime(time, '%Y-%m-%d %H:%M:%S.%f') if time==time else time)
df.wait_time = df.wait_time.apply(lambda x: 0 if x.days < 0 else x.seconds/60)
wait_time = df.groupby(["jumpman_id", "delivery_id", "place_category", "pickup_place"]).wait_time.mean().reset_index()
wait_time_per_jumpman = wait_time.groupby("jumpman_id").wait_time.mean()
ax = plt.hist(wait_time_per_jumpman, bins=40)
plt.title("Distribution of wait time per jumpman")
wait_time_per_jumpman.describe()
We have again a near normal distribution, with some anomalous cases. On average, a Jumpman waits for around 20 minutes in the pickup location, and it varies from 6 to 100 minutes. It is very important to understand what happened with the Jumpman who waited 100 minutes, because it affects a lot the customer experience. It could be a problem in the pickup place, in the data collection or a problem with the driver.
Let's now segment the wating time by place category.
ax = plt.figure(figsize=(30,10))
sns.set(font_scale = 2)
sns.boxplot(x=wait_time.place_category, y=wait_time.wait_time)
ax = plt.xticks(rotation=90)
wait_time.groupby("place_category").wait_time.mean().sort_values(ascending=False)[[0,1,-2,-1]]
The fastest waiting time were on the categories Russian and Shop, while the slowest were on the cateogories Art Store (few data points, though) and Grocery Store. It makes sense, as usually doing groceries is very time consumming. It would be interesting to find ways of speeding up the grocery shopping, but this would require cooperation from both Jumpmen and the grocery stores.
wait_time.groupby("pickup_place").wait_time.mean().sort_values(ascending=False).head()
The slowest place was Denny's, a dinner restaurant chain, followed by Friend of a Farmer. Jumpman23 should investigate why waiting time is so high in those places.
import math
import numpy as np
Here, I'm going to check how the logistics operation is doing. I'll first start by segmenting the means of transportation.
y = df.groupby("vehicle_type").delivery_id.nunique().sort_values()
sns.set(font_scale = 1)
ax = plt.xticks(rotation=45)
plt.bar(y.index,y)
plt.title("# of deliveries per vehicle type")
Most of the deliveries are currently done by bycicle.
Let's see how the vehicle type affects the commute speed.
R = 6371e3
phi1 = df.pickup_lat * math.pi/180
phi2 = df.dropoff_lat * math.pi/180;
delta_phi = (df.dropoff_lat-df.pickup_lat) * math.pi/180;
delta_lambda = (df.dropoff_lon-df.pickup_lon) * math.pi/180;
a = np.sin(delta_phi/2) * np.sin(delta_phi/2) + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda/2) * np.sin(delta_lambda/2)
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
K = 0.000621371
df["pickup_dropoff_distance"] = R * c * K # in miles
commute_per_veh = df.groupby(["vehicle_type", "delivery_id"])[["pickup_dropoff_distance", "dropoff_commute_time"]].mean().reset_index()
commute_per_veh["veh_speed"] = commute_per_veh.pickup_dropoff_distance*60/commute_per_veh.dropoff_commute_time
sns.boxplot(x=commute_per_veh.vehicle_type, y=commute_per_veh.veh_speed)
plt.ylim(0,40)
speed_per_veh = commute_per_veh.groupby("vehicle_type").sum()
speed_per_veh.veh_speed = speed_per_veh.pickup_dropoff_distance*60/speed_per_veh.dropoff_commute_time
speed_per_veh.veh_speed.sort_values()
Here, I used the linear distance between 2 coordinates to calculate the speed. For distant points that is a fair approximation, but for closeby points it can underestimate a lot the real distance. Having said that, the truck is the fastest mean of transportation, with an average speed of 7 miles per hour, while the bicycle and walker are the slowest, with an average speed of 5 miles per hour.
Ideally, we should use walker, bicycle and scooter for lower quantities and cars, vans and trucks for larger quantities delivered. I'll check if this is happening now.
df.groupby(["vehicle_type", "delivery_id"]).item_quantity.sum().groupby("vehicle_type").mean().sort_values()
Indeed, trucks deliver a larger quantity than the other vehicles, but the van delivers the smallest quantity. Let's check the types of items delivered by vans.
df.query("vehicle_type == 'van'").head(20)
It seems that regular meals are being delivered by trucks in small quantities, which is not optimal. Maybe Jumpman23 should try to implement a shared delivery system, where trucks, vans and cars pick items up in more than 1 place and deliver to multiple customers. In order for that to work, we need to verify if we have multiple orders in nearby pickup and dropoff places.
df.when_the_delivery_started = df.when_the_delivery_started.apply(lambda date: dt.strptime(date, '%Y-%m-%d %H:%M:%S.%f') if date==date else date)
start = dt.fromisoformat('2014-10-05 12:49:30')
end = start + td(minutes=2)
df_map = df.query("when_the_delivery_started >= @start and when_the_delivery_started <= @end")
pickups = df_map[['pickup_lat', 'pickup_lon']]
pickup_list = pickups.values.tolist()
dropoffs = df_map[['dropoff_lat', 'dropoff_lon']]
dropoff_list = dropoffs.values.tolist()
hmap = folium.Map(titles="A", location=[df.pickup_lat.mean(), df.pickup_lon.mean()], zoom_start=12)
for i in range(len(pickups)):
folium.Marker(pickup_list[i], icon=folium.Icon(color='green')).add_to(hmap)
folium.Marker(dropoff_list[i], icon=folium.Icon(color='red')).add_to(hmap)
loc = 'Pickup and dropoff locations in a time window of 2 minutes'
title_html = '''
<h3 align="center" style="font-size:16px"><b>{}</b></h3>
'''.format(loc)
hmap.get_root().html.add_child(folium.Element(title_html))
hmap
In the example above, we can see that 2 of the 3 pickup points in green are very close to each other in a time interval of 2 minutes (in an interval of 2 minutes, we started those 3 deliveries in the map). This means that there was an opportunity for a Jumpman to do multiple pickups and then drop the orders off in each dropoff location. Jumpan23 coould then implement an online routing system to assign multiple nearby pickup and dropoff points to drivers. The customer could wait 2 minutes for the routing algorithm to find other orders, for example.