Exploratory analysis of SFo Bike Share Data

Background

I had participated in a Kaggle competition where the goal was to predict Bike share rentals per day given a small dataset by Kaggle. Details about that project can be found here: http://manishkurse.github.io/PythonProjects/BikeSharingDemand.html

This project aims to build on that experience by performing exploratory analysis on a larger dataset in the same domain in combination with a secondary dataset.

Tools used here: SQLite, Pandas, Numpy, scikit-learn

Objectives

  • Perform exploratory analysis on San Francisco bicycle share user trip data
  • Can we find nearest SFo restaurants to bike stations? (Motivation: Can we get discount deals for users of these stations?)

Other questions (for future):

  • Can we predict number of bikes and docks that will be available at each station?
  • Can we predict inventory needed at stations based on a year's data?

Datasets: Bicycle share data: http://www.bayareabikeshare.com/datachallenge

San Francisco Restaurant data: https://data.sfgov.org/Economy-and-Community/Registered-Restaurants-in-SF/qwyj-4qec

Future projects: Can we recommend restaurants/cafes occur on the route of subscribed bikers?
-- Motivation: This information could then be used to work with those businesses(restaurants/gyms/cafes) to give the bikers discounts. -- Data: In addition to the above data: ** this would need getting data from Yelp on these businesses, potentially implementing a recommendation engine ** and biking (if not available, then driving) routes obtained from Mapquest or Google Maps API between start and finish point of a bike path.

Summary

Insight:

  • Similar weekday and weekend behavior of casual customers and subscribers in SFo as DC. They are very different. Also patterns between subscribers and casual customers over the course of the day is very different.
  • 1% of subscribers pay overtime fee, whereas 28% of casual customers pay overtime fee.
  • Median distance and duration traveled by casual customers is almost twice as that of subscribers. However 80% of total trips are made by subscribers.
  • There are a few stations that have a huge percentage of traffic higher than other stations. This distribution of traffic is different for casual customers and subscribers.
  • 5 restaurants near each bike station was identified. Most bike stations are near metro stations and are in the San Francisco downtown area (as expected).

Code

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
import numpy as np
import pygmaps
import sqlite3 as lite
import re
%matplotlib inline


read_from_csv = False
create_new_tables = False
read_from_sql = True
In [2]:
# Read Data from CSV

def replace_camelcase_with_underscore(name):
    '''
    Taken from Stack Overflow: 
    http://stackoverflow.com/questions/1175208/elegant-python-function-to-convert-camelcase-to-camel-case
    '''
    s1 = re.sub('(.)([A-Z][a-z]+)', r'\1_\2', name)
    return re.sub('([a-z0-9])([A-Z])', r'\1_\2', s1)

def clean_col_names(df):
    '''
    Replace space with underscore in column names
    '''
    df.columns = [name.lstrip() for name in df.columns.tolist()]
    df.columns = [name.rstrip() for name in df.columns.tolist()]
    df.columns = [name.replace(' ','') for name in df.columns.tolist()]    
    df.columns = [name.replace('_','') for name in df.columns.tolist()]    
    df.columns = [replace_camelcase_with_underscore(name) for name in df.columns.tolist()]
    return df

if read_from_csv:
    # Read data from csv files
    trip_df = pd.read_csv("Datasets/201402_trip_data.csv", nrows=None)
    weath_df = pd.read_csv("Datasets/201402_weather_data.csv", nrows=None)
    stn_df = pd.read_csv("Datasets/201402_station_data.csv", nrows=None)
    res_df = pd.read_csv("Datasets/Registered_Restaurants_in_SF.csv", nrows=None)
    rebal_df = pd.read_csv("Datasets/201402_rebalancing_data.csv", nrows=None)
    weath_df = weath_df.rename(columns = {'zip' : 'Zip'})
                               
    # Data cleaning and feature engineering
    # Clean column names
    dfs = [trip_df, weath_df, stn_df, res_df, rebal_df]
    for idx,df in enumerate(dfs):
        dfs[idx] = clean_col_names(dfs[idx])

    trip_df2 = pd.read_csv("Datasets/201408_trip_data.csv", nrows=None)
    weath_df2 = pd.read_csv("Datasets/201408_weather_data.csv", nrows=None)
    stn_df2 = pd.read_csv("Datasets/201408_station_data.csv", nrows=None)
    rebal_df2 = pd.read_csv("Datasets/201408_rebalancing_data.csv", nrows=None)

    #Rename column in trip_df2
    trip_df2 = trip_df2.rename(columns = {"Subscriber Type": "Subscription Type"})
    weath_df2 = weath_df2.rename(columns = {'PDT' : 'Date'})

    dfs = [trip_df2, weath_df2, stn_df2, rebal_df2]
    for idx,df in enumerate(dfs):
        dfs[idx] = clean_col_names(dfs[idx])
        
    # Combine two sets of dataframes to one
    trip_df = trip_df.append (trip_df2, ignore_index = True)
    weath_df = weath_df.append (weath_df2, ignore_index=True)
    stn_df = stn_df.append (stn_df2, ignore_index=True)
    rebal_df = rebal_df.append (rebal_df2, ignore_index=True)
    

    # Remove outliers. Outliers defined as values greater than 99.5th percentile
    maxVal = np.percentile(trip_df['Duration'],99.5)
    trip_df = trip_df[trip_df['Duration'] <= maxVal]

    # Extract date, month, hour of start
    trip_df['Start_day'] = trip_df['Start_Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y %H:%M")).day)
    trip_df['Start_month'] = trip_df['Start_Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y %H:%M")).month)
    trip_df['Start_hour'] = trip_df['Start_Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y %H:%M")).hour)

    # Clean the restaurant table to extract the longitude and latitude information
    geo_loc = res_df['Business_Location'].apply(lambda x: x.split('(')[1])
    res_df['rest_lat'] = geo_loc.apply(lambda x: np.float(x.split(',')[0]))
    res_df['rest_long']= geo_loc.apply(lambda x: np.float(x.split(',')[1][:-1]))
In [3]:
# Create SQL table
def create_table(df,table_name,con):
    df.to_sql(table_name,con,if_exists='replace')
    return 1

# Write csv data to SQL
if create_new_tables:
    table_names = ["trip_info","weather","station","restaurant","rebal"]
    dfs = [trip_df,weath_df,stn_df,res_df,rebal_df]
    con = lite.connect("sfo_bike.db")    
    
    for idx,df in enumerate(dfs):
        create_table(df,table_names[idx],con)
In [4]:
if read_from_sql == True:
    con = lite.connect("sfo_bike.db")
    query_res = '''
            SELECT Location_ID, Zip_Code, rest_lat,rest_long
            FROM restaurant;
            '''

    query_trip = '''
            SELECT Trip_ID , Duration , Start_Terminal, Start_Date, 
                   End_Terminal , Subscription_Type , 
                   Zip_Code , Start_day , Start_month , Start_hour
            FROM trip_info;
            '''

    query_stn = '''
            SELECT stationid , lat , long , dockcount
            FROM station;
            '''
    query_reb = '''
            SELECT stationid , bikesavailable , docksavailable, time
            FROM rebal
            LIMIT 10;
            '''
    query_weat  = '''
            SELECT stationid , bikesavailable , docksavailable, time
            FROM weather
            '''

    res_df = pd.read_sql_query(query_res, con, index_col=None, coerce_float=True, \
                           params=None, parse_dates=None, chunksize=None)

    trip_df = pd.read_sql_query(query_trip, con, index_col=None, coerce_float=True, \
                           params=None, parse_dates=None, chunksize=None)

    stn_df = pd.read_sql_query(query_stn, con, index_col=None, coerce_float=True, \
                           params=None, parse_dates=None, chunksize=None)
    rebal_df = pd.read_sql_query(query_reb, con, index_col=None, coerce_float=True, \
                           params=None, parse_dates=None, chunksize=None)
else:
    
    # Only a subsets of the data frames are sufficient for our analysis
    res_df = res_df[['Location_ID','Zip_Code','rest_lat','rest_long']]
    trip_df = trip_df[['Trip_ID','Duration','Start_Terminal','End_Terminal','Subscription_Type','Zip_Code','Start_day','Start_month','Start_hour']]
    stn_df = stn_df[['station_id','lat','long','dockcount']]
In [5]:
def replace_hyp_with_bck(ser):
    vals = [strg.replace('-','/') for strg in ser.values] 
    return vals

rebal_df['time'] = replace_hyp_with_bck(rebal_df['time'])

Feature engineering

In [6]:
trip_df['day_of_week'] = trip_df['Start_Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y %H:%M")).weekday())

rebal_df['Day'] = rebal_df['time'].map(lambda x:(datetime.strptime(x, "%Y/%m/%d %H:%M:%S")).day)
rebal_df['month'] = rebal_df['time'].map(lambda x:(datetime.strptime(x, "%Y/%m/%d %H:%M:%S")).month)
rebal_df['hour'] = rebal_df['time'].map(lambda x:(datetime.strptime(x, "%Y/%m/%d %H:%M:%S")).hour)

# weath_df['Day'] = weath_df['Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y")).day)
# weath_df['Day'] = weath_df['Date'].map(lambda x:(datetime.strptime(x, "%m/%d/%Y")).month)

bike_avail = rebal_df.groupby(['month','Day','hour'])['bikesavailable'].min()
dock_avail = rebal_df.groupby(['month','Day','hour'])['docksavailable'].min()
In [7]:
def get_crowfly_distance(lat1, lon1, lat2, lon2):
    '''
    Function to calculate distance based on latitude,longitude of 2 places
    using Haversine formula
    http://andrew.hedges.name/experiments/haversine/
    '''
    
    lon1 = lon1*np.pi/180
    lat1 = lat1*np.pi/180
    lon2 = lon2*np.pi/180
    lat2 = lat2*np.pi/180

    dlon = lon2 - lon1 
    dlat = lat2 - lat1 

    R = 3961 # miles
    a = (np.sin(dlat/2))**2 + np.cos(lat1) * np.cos(lat2) * (np.sin(dlon/2))**2 
    c = 2 * np.arctan2( np.sqrt(a), np.sqrt(1-a) ) 
    d = R * c
    return d


# Update the trip data table with the latitudes and longitudes of start and end
trip_df = trip_df.rename(columns = {'Start_Terminal': 'stationid'})
trip_df = pd.merge(trip_df,stn_df[['lat','long','stationid']], \
                   how = 'left', on=['stationid'])
trip_df = trip_df.drop_duplicates()

trip_df = trip_df.rename(columns = {'lat':'Start_lat','long' : 'Start_long','stationid': 'Start_Terminal'})
# trip_df.drop('station_id', axis=1, inplace=True)
trip_df = trip_df.rename(columns = {'End_Terminal': 'stationid'})
trip_df = pd.merge(trip_df,stn_df[['lat','long','stationid']], \
                    how = 'left', on=['stationid'])
trip_df = trip_df.drop_duplicates()
trip_df = trip_df.rename(columns = {'stationid': 'End_Terminal','lat':'End_lat','long' : 'End_long'})

trip_df['Distance'] = get_crowfly_distance(trip_df['Start_lat'], trip_df['Start_long'], trip_df['End_lat'], trip_df['End_long'])

Exploratory Data Analysis

In [8]:
# Defining a color pattern based on Tableau's color code
colrcode = [(31, 119, 180), (255, 127, 14),\
             (44, 160, 44), (214, 39, 40), \
             (148, 103, 189),  (140, 86, 75), \
             (227, 119, 194), (127, 127, 127), \
             (188, 189, 34), (23, 190, 207)]

for i in range(len(colrcode)):  
    r, g, b = colrcode[i]  
    colrcode[i] = (r / 255., g / 255., b / 255.)
In [9]:
# Understanding the number of trips based on weekday.

fig,axes = plt.subplots(figsize=(12, 4), nrows=1, ncols=2)

a = trip_df[trip_df['Subscription_Type'] == 'Subscriber'].groupby(['day_of_week','Start_hour'])['Trip_ID'].count()
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

plt.sca(axes[0])

for i in range(7):
    x = a[i].plot(label = days[i])
    print('Peak on Day {0} occurs on {1}00 hours'.format(i,np.argmax(a[i])))
plt.title('Number of trips by hour : Subscriber')    
plt.legend()

a = trip_df[trip_df['Subscription_Type'] == 'Customer'].groupby(['day_of_week','Start_hour'])['Trip_ID'].count()
days = ['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

plt.sca(axes[1])
for i in range(7):
    x = a[i].plot(label = days[i])
    print('Peak on Day {0} occurs at {1}00 hours'.format(i,np.argmax(a[i])))
plt.title('Number of trips by hour : Customer')    
plt.legend()
    
Peak on Day 0 occurs on 800 hours
Peak on Day 1 occurs on 800 hours
Peak on Day 2 occurs on 800 hours
Peak on Day 3 occurs on 800 hours
Peak on Day 4 occurs on 800 hours
Peak on Day 5 occurs on 1200 hours
Peak on Day 6 occurs on 1500 hours
Peak on Day 0 occurs at 1600 hours
Peak on Day 1 occurs at 1700 hours
Peak on Day 2 occurs at 1700 hours
Peak on Day 3 occurs at 1500 hours
Peak on Day 4 occurs at 1300 hours
Peak on Day 5 occurs at 1400 hours
Peak on Day 6 occurs at 1500 hours
Out[9]:
<matplotlib.legend.Legend at 0x10d6ed890>
In [10]:
plt.figure(figsize = [8,6])
h = plt.hist(trip_df[trip_df['Subscription_Type'] == 'Subscriber']['Duration'].values,\
             range = [0,2500],color = colrcode[0],alpha = .6, label = 'Subscriber')

plt.plot([1800,1800],[0,float(trip_df.shape[0]/2)],color = colrcode[1],linestyle = '--')
plt.text(1820,float(trip_df.shape[0]/4),'-> Overtime charge')

h = plt.hist(trip_df[trip_df['Subscription_Type'] == 'Customer']['Duration'].values,\
             range = [0,2500],color = colrcode[1],alpha = .6, label = 'Casual')
plt.title('Histogram: Duration in sec')
plt.xlabel('Duration of ride (s)')
plt.legend()
Out[10]:
<matplotlib.legend.Legend at 0x10df0ab50>
In [11]:
val = trip_df[(trip_df['Subscription_Type'] == 'Customer') & (trip_df['Duration']>1800)]['Duration'].count()/\
    float(trip_df[(trip_df['Subscription_Type'] == 'Customer')]['Duration'].count())
print("{0}% of causal customers pay overtime fee! ".format(np.round(val*100,0)))

val = trip_df[(trip_df['Subscription_Type'] == 'Subscriber') & (trip_df['Duration']>1800)]['Duration'].count()/\
    float(trip_df[(trip_df['Subscription_Type'] == 'Subscriber')]['Duration'].count())
print("{0}% of subscribers pay overtime fee! ".format(np.round(val*100,0)))
28.0% of causal customers pay overtime fee! 
1.0% of subscribers pay overtime fee! 
  • As is seen in the histogram above, most rides use it for below 30 minutes above which there is an overtime fee. Subscribers keep their commute shorter.
  • Quite a large number of casual customers end up paying overtime fee! (28.0%)
In [12]:
# Understanding How Metrics are different based on Subscription Type
num_sub = trip_df.groupby('Subscription_Type')['Duration'].count()
dur_sub = trip_df.groupby('Subscription_Type')['Duration'].median()/60

dis_sub = trip_df.groupby('Subscription_Type')['Distance'].median()/60

fig,axes = plt.subplots(figsize=(12, 6), nrows=1, ncols=3)
plt.sca(axes[0])
num_sub = num_sub/np.sum(num_sub.values)*100
num_sub.plot(kind = 'bar')
plt.ylabel('% of total')
plt.title('Subscribers by type')
plt.sca(axes[1])
dur_sub.plot(kind = 'bar')
plt.ylabel('Duration in minutes')
plt.title('Median Duration rented by subscribers')
plt.sca(axes[2])
dis_sub.plot(kind = 'bar')
plt.ylabel('Distance in miles')
plt.title('Median Distance as the crow flies')
Out[12]:
<matplotlib.text.Text at 0x10d2b8410>
In [13]:
# Examining traffic between stations

counts_per_endStation = trip_df[trip_df['Subscription_Type'] == 'Subscriber'].groupby(['Start_Terminal','End_Terminal'])['Trip_ID'].count()
counts_per_endStation  = counts_per_endStation.unstack()
plt.figure(figsize = (15,15))
h = sns.heatmap(counts_per_endStation ,linewidths=0, square=True , vmax=450)
h = plt.xticks(rotation='vertical')
h= plt.yticks(rotation='horizontal')
plt.title('Traffic between pairs of bike stations: Subscribers')
plt.savefig('HeatMap_StationUsage.png')

counts_per_endStation = trip_df[trip_df['Subscription_Type'] == 'Customer'].groupby(['Start_Terminal','End_Terminal'])['Trip_ID'].count()
counts_per_endStation  = counts_per_endStation.unstack()
plt.figure(figsize = (15,15))
h = sns.heatmap(counts_per_endStation ,linewidths=0, square=True , vmax=450)
h = plt.xticks(rotation='vertical')
h= plt.yticks(rotation='horizontal')
plt.title('Traffic between pairs of bike stations: Customers')
plt.savefig('HeatMap_StationUsage_Customer.png')
In [14]:
# Determine nearest neighbor restaurants to each of the stations
k = 5 # num nearest neighbors
from sklearn.neighbors import NearestNeighbors
a = np.matrix([res_df.rest_lat.values,res_df.rest_long.values])
b = a.transpose()
samples = b
neigh = NearestNeighbors(k, radius = 1)

# Train the nearest neighbor algorithm
neigh.fit(samples)  

lat_lon = np.matrix([stn_df.lat.values,stn_df.long.values])
lat_lon = lat_lon.transpose()



mymap = pygmaps.maps(37.7858345,-122.4059035,15)
for i in range(res_df.shape[0]):    
    mymap.addpoint(res_df.rest_lat[i],res_df.rest_long[i], "#DBDBD6") # red "#FF0000 for red Green "#00FF00"

for i in range(stn_df.shape[0]):
    mymap.addpoint(stn_df.lat[i],stn_df.long[i], "#0000FF") # red "#FF0000 for red Green "#00FF00"
    nb = neigh.kneighbors(lat_lon[i], k,return_distance=False) 
    for count in range(k):
        mymap.addpoint(samples[nb[0][count],0],samples[nb[0][count],1], "#FF0000") # red "#FF0000 for red Green "#00FF00"       
mymap.draw('./mymap.html')
In [20]:
import webbrowser
new = 2 # open in a new tab, if possible
url = "./mymap.html"
h = webbrowser.open(url,new=new)

Insight gained:

1. The Bike Stations are near the subway line (not unexpected, but good to see).
2. We now have a set of potential restaurant owners to work with.
In [16]:
# no_bikes = rebal_df[rebal_df['bikes_available']==0]['station_id'].value_counts()
# plt.figure(figsize=(8,6))
# h = plt.bar(no_bikes.keys().values,no_bikes.values)
In [ ]: