In the previous article I have analyzed the location and characteristics of the BiciMAD stations, now it is now time to continue the analysis to understand where users are going and what are the most popular routes, clustering on their category of user_type (subscribers, occasional users, …).
Please note that the figures, the dataset, the Tableau reports and the Python notebook discussed in this article can be found (for free) on GitHub.
For the analysis of the routes I used the data provided by the Municipal de Transportes de Madrid on their Open Data web site. Routes data are available in 18 archives with as many JSON files, from April 2017 to September 2018. Each line of the file is a beginning to end JSON that contains informations of every time one e-bike was removed from its base (including when it was done from EMT employers).
Overall, the files contain the details of over 5 million withdrawals and each archive size 500 to 1500 Mb (while zipped), a remarkable point is that unfortunately there are 17 ZIP archive and 1 RAR archive, this is not good as the whole dataset is too big to be uncompressed and imported into a single Pandas Dataframe.
For this reason I chose to go a different way: I identified some information to extract and then programmed Python to extract each month JSON file, open it an archive the desidered informations at a time. The script processed 25,000 lines at a time, which were then aggregated together in several CSV files.
The EMT company also provides a short guide to undertand the meaning of each fields, the guide, available in Spanish, could be downloaded from this link.
At first glance we can see that two interesting fields are missing: a bicycle ID and a user ID. The lack of user ID is probably due to privacy reasons but I can’t understand the reason for the lack of a unique identifier of the bicycle. It would have been very interesting to extract informations related to the use of each bike between the stations (by users and by EMT employers).
The target of this analysis are:
- to understand the habits of BiciMAD users, in particular about when they use the bikes and for which purpose;
- to understand if there is some evident segmentation of users;
- to understand the most frequent trips that BiciMAD users make with the bikes.
Exploratory data analysis
First we evaluate the number and the main statistical parameters of the data at our disposal.
First I evaluate the number and the main statistical parameters of the data at our disposal.
After that I verify that the IDs of the plug and unplug stations are all present in the registry generated in the previous article.
One of the aspects that is immediately visible is the fact that the maximum value of the stations is 2008. From the previous post we know that the stations are identified by an ID ranging from 1 to 175, so 2008 is an anomalous value.
Since the DataFrame was generated by a script I made, I check if there were any lines transferred to the 2008 stations in the original data provided by EMT, just to make sure that was not my mistake. I take the month of December 2017 because (as winter month) the file is definitely smaller that others.
The 2008 value as a station ID is also present in the data provided originally, so it is not the result of errors related to my data aggregation script.
Intuitively 2008 could be an ID of a logic station where are placed the bikes that have some anomalous, for example those in maintenance or those stolen. For me it is really difficult to verify this fact, but considering that these lines are only XXX, XXX on the total, I can remove them from the data frame without incurring the risk of having a clear effect of falsification on the final results.
An additional station that is inside the routes are out of the stations registry made in the last articole: station number 22.
With two simple queries I understand that the station may have been active only from ‘2017-05-01’ to ‘2017-12-01’, and now it does not exist anymore, but the open data on the stations, provided by the EMT, only cover months between July and September 2018.
To get over of this issue, I searched some 2017 maps of the BiciMAD stations, but without achieving useful informations as stations on the maps are identified with the “number” field instead of the “ID”.
As a last check, I looked at the stations above and below the one with ID 22.
# Station 21 and 23 df_stations.loc[df_stations['id'].isin(['21','23'])]
|20||Calle Alcalá nº 49||21||40.419234||-3.695461||Banco de España A||20a||30||Centro|
|21||Calle Montera nº 48||23||40.419787||-3.701481||Red de San Luis A||21a||24||Centro|
From the data I see that there is a station 20a. Usually when EMT identify a station with alphabet letter, is because there are at least two nearby stations, but in this case station 20b is missing: it is likely that the ID=22 station I am looking for was actually the number 22b.
These informations are sufficient to restrict my search perimeter:
– a station between ID=20
– it was there since ‘2017-05-01’ to ‘2017-12-01’
From Google Maps I can finally find the station ID=22, it is the 20b station that until a few months ago was located in Calle Alcalá nº 49.
As the station 22 really existed in the past, I decided to not remove its lines from the files.
a=df_trip_stations['idplug_station'].unique() b=df_stations['id'].unique() set(a) - set(b)
Analysis of trips
In the following graphs color red identify annual membership users, color blue identify casual users and color green , eventually, identify EMT employers.
First of all I plot the entire data provided from EMT on a timeline with month granularity, this to understand if there is some seasonality with the number of bike rented.
The results (shown in this graph) is that there is a great seasonality in this service. We can see that both of lines have a peak during summer months, a part from the month of August where red line (annual membership user) drop down, probably because during August lots of offices have a lighter work load due to holiday so it could demonstrate that user_type=1 use bikes to go from home to their workplace and back. From this graph we can make more conclusions:
- During winter the temperature in Madrid is very chill so both of user_type prefer to travel less by bike that with other transports;
- There is a huge gap between the number of trips from user_type=1 and user_type=2: the are plot together but check the scale on the y axis on the left and on the right…they have different order of magnitude!
This graph show the trips per day of week during the whole months provided by EMT, so from May 2017 to Sep 2018. Is is clear that annual subscriber have a peak of use during the first and the last working days of the week while casual user have nearly a complementary use.
In this graph we can se the evolution of the trips during the hours of the day. I think that the red line support the thesis that a big part of annual subscriber use the e-bikes of BiciMAD for travels between home and office, we can easily see the user going from home to the workplace (7-9 AM), going out for lunch (2-3 PM) and going home 5-7 PM).
After the series of plots I decided to go further and have a look of the whole one and half year provided by EMT to see if I can understand more informations about the habits of users. For this reason I decided to make 3 calmap graph to understand the distributions of trips along the week days and along the whole timeline.
These graphs show me some unexpected data that, if possible, it would be very interesting to investigate, maybe in future post on this blog:
- in May, Jun and August there are few days with almost zero travels from user_type=1 (annual subscriber) and a an explosion of user_type=2 (casual users) and user_type=3 (employers). This is very suspect! It could be an error during the data collection or maybe some technical problem that prevented annual subscriber to use their cards to rent the bikes so they had to rent as casual user (and lots of EMT employers were working to solve the problem).
- There is a surprising peak of EMT employers usage of bikes during the weekend, it would be interesting to understand if they are most of maintenance of bikes or moving the fleet for the week that will start.
- During the first days of the first week of May 2017 there is no one user_type=2 trips…is it correct?
- During October 2017 there is a deep blue days in the graph of user_type=2, this means a hight number of trips during this period of time, more that the weeks before and after. It would be interesting to understand the causes of this peak, it could be some marketing campaign about BiciMAD service.
For this analysis last point, I investigated about is the number of ring trips on the total. A ring trips is a trip that started and ended in the same station. Considering the price of renting a bike is is very unlikely that users unplug the bike, go for example at the supermarket without plug the bike to another base and the come back, unless it is a very quick errand. Ring trips would be probably more likely between user_type=2 as the can use the bikes to visit the city of Madrid and some point of interesting (for example the Retiro park) faster then just walking.
# Number of ring trips % (per user_type) round_trips=df_trip_stations['count'].loc[df_trip_stations['idplug_station']==df_trip_stations['idunplug_station']].groupby(df_trip_stations['user_type']).sum() total_trips=df_trip_stations['count'].groupby(df_trip_stations['user_type']).sum() round_trips.div(total_trips)*100
Name: count, dtype: float64
As we can see there are more ring trips from user of group 2 than from user of group 1. There is also a surprising 31% of ring trips from group number 3 (EMT employers), it could be an indicator that more ore less 31% of “trips” are for maintenance the e-bikes and not for just moving the fleet!
If we have a look about the most frequent trips of user_type=1
we could come to the same conclusions about ring trips.