Bikeshare Prediction

1.Introduction

Bike share is a form of public transportation which provides people with access to durable bicycles that can transfer it users from one place to another and allows them to return the bikes at any available station in the bike-share system. It’s an ideal option for those who want to make a one-way travel. However, it would be a problem if people go to a station but find that there are no available bicycles or if people are ready to return bikes but find no space available. So, to better serve its users, there’s a great demand for the bike share system to carry out the Re-balancing Plan, which ensures that bicycles are available for use/return at any station at all times(especially during peak periods). This plan will be carried out by giving riders extra credits to encourage them to take bikes from stations with high bike inventory or to return bikes to stations with low bike inventory. Also, in peak times, there will be some trucks to move the bikes from one station to another.

Here, I will use Philadelphia’s Indego bike-share system as an example to build a model to predict the peak space/time usage of bike-share so that the re-balancing program can be carried out effectively. Ideally, this model would be able to predict the usage of bike-share during same hours on weekdays and same hours on weekends in 2 weeks. In general, the core of this project is to accurately predict peaks to deploy bikes between stations.

2.Set Up

2.1 Import Bike-share data

The data is downloaded from Indego website.(https://www.rideindego.com/about/data/) The time period is set to May 20th-June 17th, 5 weeks in total.

knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(lubridate)
library(gganimate)
library(gifski)
library(caret)
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

plotTheme2 <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

options(tigris_class = "sf")
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

palette5 <- c("#FCD1D1","#ECE2E1","#D3E0DC","#AEE1E1","#97CFCF")
p5 <- c("#D6A3DC","#F7DB70","#EABEBF","#75CCE8","#A5DEE5")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
p2 <- c("#FCD1D1","#97CFCF")

options(knitr.duplicate.label = "allow")

#Philadelphia Bikeshare data From 2020 April-June
ride <- read.csv("D:/0MUSA/MUSA508/Assignment/5/indego-trips-2020-q2.csv")

ride2 <-
  ride %>% 
  mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))%>%filter(week>20,week<26)

2.1 Import Census data

I use the tidycensus package to download census geography and variables, then join it to the origin station and destination station of the ride-share data. The census data can help us build the relationship between the ride-share station and the social factors of the tract it belongs to.

PhillyTracts <- 
  PhillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

dat_census <- st_join(ride2 %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        PhillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., PhillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.3 Import Weather Data

Since people hardly ever travel in bad weather, such as stormy days or high-temperature days, it is important to use weather data to predict the usage of bike-share. The Philadelphia weather data is imported with the function riem_measures , and the weather station is KPHL.

weather.Panel <- 
  riem_measures(station = "KPHL", date_start = "2020-05-20", date_end = "2020-06-24") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

glimpse(weather.Panel)

## Rows: 840
## Columns: 4
## $ interval60    <dttm> 2020-05-20 00:00:00, 2020-05-20 01:00:00, 2020-05-20 02…
## $ Temperature   <dbl> 55.9, 55.0, 54.0, 54.0, 53.1, 52.0, 51.1, 51.1, 50.0, 50…
## $ Precipitation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Wind_Speed    <dbl> 19, 15, 17, 14, 14, 13, 11, 13, 11, 10, 12, 13, 14, 18, …

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme(),
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme(),
  top="Weather Data - Philadelphia KPHL - May-June, 2020")

3.Data Wrangling

3.1 Create Space-time Model

The ride.panel is created where each time period in the study is presented by a row. If there’s no trip at a time, it will be replaced with 0.

All the variables that are needed in the following predictions are joined to this panel.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station,  Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station,  Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel <- 
  left_join(ride.panel, PhillyCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3.2 Create time lags

Because our goal is to predict future demand, this means that we need to detect the relationship between the usage of bike-share at one time and that of 1 hour, 12 hours later, and a day later. In the expected case, the results after one hour should be similar to the selected time period, the results after 12 hours should be the most different, and the results after one day should be basically the same.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, "0" ))

as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))

## # A tibble: 6 × 2
##   Variable   correlation
##   <fct>            <dbl>
## 1 lagHour           0.92
## 2 lag2Hours         0.79
## 3 lag3Hours         0.61
## 4 lag4Hours         0.41
## 5 lag12Hours       -0.62
## 6 lag1day           0.8

3.3 Split Train and Test Set

The data is split into a 3-week training set, from week 21 to 23, and a 2-week test set, from week 24 to 25. If the prediction highly fit the observed value, that means my model can be used by the bike-share company.

ride.Train <- filter(ride.panel, week <= 23)
ride.Test <- filter(ride.panel, week > 23)

4.Explore Data

4.1 Trip_Count serial autocorrelation

When analyzing the time-distribution of bike-trips, we can find that the time series pattern show a high degree of consistency among different weeks, and the peaks and valleys occur at basically the same time of each different day.

mondays <- 
  mutate(ride.panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0)

st_drop_geometry(rbind(
  mutate(ride.Train, Legend = "Training"), 
  mutate(ride.Test, Legend = "Testing"))) %>%
    group_by(Legend, interval60) %>% 
      summarize(Trip_Count = sum(Trip_Count)) %>%
      ungroup() %>% 
      ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
        scale_colour_manual(values = p2) +
        geom_vline(data = mondays, aes(xintercept = monday)) +
        labs(title="Bide-share trips by week: May-June",
                       x="Day", y="Trip Count") +
        plotTheme() + theme(panel.grid.major = element_blank(),legend.position = "bottom",aspect.ratio=3/9)

The histogram of the hourly trips per station in 4 representative time periods is presented. We can see that as expected, more people are likely to use the bike-share during AM-rush hours and PM-rush hours.

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 0.3,fill="#AEE1E1")+
  labs(title="Mean Number of Hourly Trips Per Station. Philadelphia, May-June, 2020",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme()

Most of the stations have no more than 10 trips per hour.

ggplot(dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1,fill="#AEE1E1")+
  labs(title="Bike share trips per hr by station. Philadelphia, May-June, 2020",
       x="Trip Counts", 
       y="Number of Stations")+
    scale_fill_manual(values = palette5) +
  plotTheme()

Comparing the weekday bike-share usage with the weekend usage, It is obvious that there are more trips take place during weekdays. However, it is weird that there’s only one peak in weekday’s curve, and there seems to be no peak in AM rush hours, contrary to what I expected. The peak of weekend-trips happens in mid-day and the peak of weekday-trips happens in PM-rush hour.

ggplot(dat_census %>% mutate(hour=hour(interval60)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia, by day of the week, May-June, 2020",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

ggplot(dat_census %>% 
         mutate(hour = hour(interval15),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in Philadelphia - weekend vs weekday, May-June, 2020",
       x="Hour", 
       y="Trip Counts")+
     plotTheme()

It can be viewed from the plot that the correlation decreases as time lag hours increases, the lowest correlation takes place with 12-hours time lag, and when the time lag comes to one day, the predictive power returns.

plotData.lag <-
  filter(as.data.frame(ride.panel), week == 23) %>%
  dplyr::select(starts_with("lag"), Trip_Count) %>%
  gather(Variable, Value, -Trip_Count) %>%
  mutate(Variable = fct_relevel(Variable, "lagHour","lag2Hours","lag3Hours",
                                          "lag4Hours","lag12Hours","lag1day"))
correlation.lag <-
  group_by(plotData.lag, Variable) %>%
    summarize(correlation = round(cor(Value, Trip_Count, use = "complete.obs"), 2))

ggplot(plotData.lag, aes(Value, Trip_Count)) +
  geom_point(size = 0.1) +
  geom_text(data = correlation.lag, aes(label = paste("r =", round(correlation, 2))),
            x=-Inf, y=Inf, vjust = 1.5, hjust = -.1) +
  geom_smooth(method = "lm", se = FALSE, colour = "blue") +
  facet_wrap(~Variable, ncol = 3, scales = "free") +
  labs(title = "Bike-share trip count as a function of time lags") +
  plotTheme()

4.2 Trip Count spatial autocorrelation

The bike-share trips map are sorted by 4-time periods and by weekday/weekend. We can find that more trips take place in PM-rush hours during weekdays.

ggplot()+
  geom_sf(data = PhillyTracts %>%
          st_transform(crs=4326),fill="grey30")+
  geom_point(data = dat_census %>% 
            mutate(hour = hour(interval60),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.6, size = 1.3)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. Philadelphia, May-June, 2020")+
  mapTheme()

4.3 Animated Map of Space/Time correlation

The animated map is generated to better illustrate the correlation between trip counts and time/space. It is set to 15-minutes interval in one day. We can view the change of trip counts during each time period more directly.

week23 <-
  filter(ride2 , week == 23 & dotw == "Mon")

week23.panel <-
  expand.grid(
    interval15 = unique(week23$interval15),
    Origin.Tract = unique(dat_census$Origin.Tract))

ride.animation.data <-
  mutate(week23, Trip_Counter = 1) %>%
    right_join(week23.panel) %>% 
    group_by(interval15, start_lat, start_lon,Origin.Tract) %>%
    summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
    ungroup() %>% 
    left_join(PhillyTracts, by=c("Origin.Tract" = "GEOID")) %>%
    st_sf() %>%
    mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                             Trip_Count > 0 & Trip_Count <= 1 ~ "0-1 trips",
                             Trip_Count > 1& Trip_Count <= 3~ "1-3 trips",
                             Trip_Count > 3 & Trip_Count <= 5 ~ "3-5 trips",
                             Trip_Count > 5 ~ "5+ trips")) %>%
    mutate(Trips  = fct_relevel(Trips, "0 trips","0-1 trips","1-3 trips",
                                       "3-5 trips","5+ trips"))

rideshare_animation <-
  ggplot() +geom_sf(data = PhillyTracts %>%
          st_transform(crs=4326),fill="#000000")+
    geom_point(data = ride.animation.data, aes(x=start_lon, y = start_lat, color = Trips),fill = "transparent", alpha = 0.8, size = 2) +
    scale_color_manual(values = p5) +
    labs(title = "Bike-share trips for one day in June 2020",
         subtitle = "15 minute intervals: {current_frame}") +
    transition_manual(interval15) +
    mapTheme()

animate(rideshare_animation, duration=20, renderer = gifski_renderer())

4.4 Correlation with Weather

It seems that the weather influences people’s choice of taking bike-share trip, since there are more trips take place in none rainy days.

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Precipitation = first(Precipitation)) %>%
  mutate(isPercip = ifelse(Precipitation > 0,"Rain/Snow", "None")) %>%
  group_by(isPercip) %>%
  summarize(Mean_Trip_Count = mean(Trip_Count)) %>%
    ggplot(aes(isPercip, Mean_Trip_Count)) + geom_bar(stat = "identity",fill="#AEE1E1") +labs(title="Does ridership vary with percipitation?",x="Percipitation", y="Mean Trip Count") +plotTheme()

st_drop_geometry(ride.panel) %>%
  group_by(interval60) %>% 
  summarize(Trip_Count = mean(Trip_Count),
            Temperature = first(Temperature)) %>%
  mutate(week = week(interval60)) %>%
  ggplot(aes(Temperature, Trip_Count)) + 
    geom_point() + geom_smooth(method = "lm", se= FALSE) +
    facet_wrap(~week, ncol=3) + 
    labs(title="Trip Count as a fuction of Temperature by week",
         x="Temperature", y="Mean Trip Count") +
    plotTheme()

5.Modeling and validation

I build 5 models to predict bike rides.

^{reg1-Time_FE: time intervals60, day of the week, temperature}

^{reg2-Space_FE: start station, day of the week, temperature}

^{reg3-Time_Space_FE: time intervals60, start station, day of the week, temperature, precipitation}

^{reg4-Time_Space_FE_timeLags: time intervals60, start station, day of the week, temperature, precipitation, time lags}

^{reg5-Time_Space_FE_timeLags_holidayLags: time intervals60, start station, day of the week, temperature, precipitation, time lags, holiday lags}

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

5.1 Predict for test data

The result of 5 models are shown below.

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week)

model_pred <- function(dat, fit){
   pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

## # A tibble: 10 × 8
##     week data                   Regression   Predi…¹ Obser…² Absol…³   MAE sd_AE
##    <dbl> <list>                 <chr>        <list>  <list>  <list>  <dbl> <dbl>
##  1    24 <tibble [23,688 × 29]> ATime_FE     <dbl>   <dbl>   <dbl>   1.00  1.38 
##  2    25 <tibble [23,688 × 29]> ATime_FE     <dbl>   <dbl>   <dbl>   0.961 1.19 
##  3    24 <tibble [23,688 × 29]> BSpace_FE    <dbl>   <dbl>   <dbl>   1.08  1.38 
##  4    25 <tibble [23,688 × 29]> BSpace_FE    <dbl>   <dbl>   <dbl>   1.03  1.16 
##  5    24 <tibble [23,688 × 29]> CTime_Space… <dbl>   <dbl>   <dbl>   1.00  1.38 
##  6    25 <tibble [23,688 × 29]> CTime_Space… <dbl>   <dbl>   <dbl>   0.960 1.19 
##  7    24 <tibble [23,688 × 29]> DTime_Space… <dbl>   <dbl>   <dbl>   0.856 1.12 
##  8    25 <tibble [23,688 × 29]> DTime_Space… <dbl>   <dbl>   <dbl>   0.821 0.999
##  9    24 <tibble [23,688 × 29]> ETime_Space… <dbl>   <dbl>   <dbl>   0.855 1.12 
## 10    25 <tibble [23,688 × 29]> ETime_Space… <dbl>   <dbl>   <dbl>   0.820 1.00 
## # … with abbreviated variable names ¹Prediction, ²Observed, ³Absolute_Error

5.2 k-Fold Cross Validation

The k-fold cross-validation is generated for the five models to ensure that the goodness to fit results for a single hold out is not a fluke.

ctrl <- trainControl(method = "cv", number = 100)
set.seed(999)

ride.panel<-ride.panel%>%mutate(hour=hour(interval60))
cvFit1 <- train(Trip_Count ~ ., data = ride.panel %>% 
                                   dplyr::select(Trip_Count,
                                  dotw , hour,Temperature), 
                method="lm", trControl = ctrl, na.action = na.pass)

cvFit1

## Linear Regression 
## 
## 117453 samples
##      3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 116279, 116278, 116279, 116278, 116279, 116278, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.603264  0.05731437  0.9647888
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

ctrl <- trainControl(method = "cv", number = 100)
set.seed(999)
ride.panel<-ride.panel%>%mutate(hour=hour(interval60))
cvFit2 <- train(Trip_Count ~ ., data = ride.panel %>% 
                                   dplyr::select(Trip_Count,
                                  start_station,dotw ,Temperature), 
                method="lm", trControl = ctrl, na.action = na.pass)

cvFit2

## Linear Regression 
## 
## 117453 samples
##      3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 116279, 116278, 116279, 116278, 116279, 116278, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   1.628182  0.02742689  1.009085
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

ctrl <- trainControl(method = "cv", number = 100)
set.seed(999)
ride.panel<-ride.panel%>%mutate(hour=hour(interval60))
cvFit3 <- train(Trip_Count ~ ., data = ride.panel %>% 
                                   dplyr::select(Trip_Count,
                                  start_station,dotw , hour,Temperature,Precipitation), 
                method="lm", trControl = ctrl, na.action = na.pass)

cvFit3

## Linear Regression 
## 
## 117453 samples
##      5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 116279, 116278, 116279, 116278, 116279, 116278, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE    
##   1.601729  0.05911642  0.96438
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

ride.panel<-ride.panel%>%mutate(hour=hour(interval60))
cvFit4 <- train(Trip_Count ~ ., data = ride.panel %>% 
                                   dplyr::select(Trip_Count,
                                  start_station,dotw , hour,Temperature,Precipitation,lagHour,lag2Hours,lag3Hours,lag12Hours,lag1day), 
                method="lm", trControl = ctrl, na.action = na.pass)

cvFit4

## Linear Regression 
## 
## 117453 samples
##     10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 116278, 116279, 116279, 116279, 116279, 116279, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.334181  0.3441758  0.8101575
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

ride.panel<-ride.panel%>%mutate(hour=hour(interval60))
cvFit5 <- train(Trip_Count ~ ., data = ride.panel %>% 
                                   dplyr::select(Trip_Count,
                                  start_station,dotw , hour,Temperature,Precipitation,lagHour,lag2Hours,lag3Hours,lag12Hours,lag1day,holidayLag, holiday), 
                method="lm", trControl = ctrl, na.action = na.pass)

cvFit5

## Linear Regression 
## 
## 117453 samples
##     12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 116278, 116278, 116278, 116278, 116279, 116279, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.334516  0.3458756  0.8099843
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

cvFit4$resample[1:5,]

##       RMSE  Rsquared       MAE Resample
## 1 1.382012 0.2414419 0.8245601  Fold001
## 2 1.525382 0.3024641 0.8413009  Fold002
## 3 1.233107 0.3419654 0.7868448  Fold003
## 4 1.344571 0.3775340 0.8115390  Fold004
## 5 1.245069 0.2584670 0.7879909  Fold005

From the distribution of MAE histogram, we can see that the distribution of errors cluster tightly together, meaning that the model generalized well.

MAE4 <- cvFit4$resample
ggplot(MAE4, aes(x=MAE)) + 
  geom_histogram( fill="#97CFCF") +
  xlim(0, 1) +
  labs(title = "Histogram of MAE", subtitle = "Model: DTime_Space_FE_timeLags",  x = "MAE", y= "Count") +plotTheme2

5.3 Measure model Accuracy

Comparing the five models, the 4&5 model, which is the time/space with time lag model, have the best fit. The MAE value of these model are the lowest among all models, and the prediction curve match well with the observed curve. In addition, we can see the holiday lag seems to have little influence on the result, maybe it’s due to the holiday does not have much impact on people’s travel demand.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme2

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "Chicago; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme2

It seems that the higher amount of trips occur at a station, the higher the MAE is, which results in the spatial pattern.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = PhillyCensus, color = "grey80", fill = "white")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.6,size=3)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme()

5.4 Space-Time Error Evaluation

From the scatter plots, we can see that the predicted line is below the observed line. In general, we underpredict the trip counts. It seems to be no significant difference in MAE across times .

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme()

The MAE map matches with the trip-count distribution map, the higher the trip counts, the higher the MAE. The highest errors occur in the center city during PM-rush period in weekdays , and during mid-day and PM-rush hours in weekends.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = PhillyCensus, color = "grey80", fill = "white")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.6,size=2)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "B")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 5")+facet_grid(weekend~time_of_day)+
  mapTheme()

From the scatter plots of Errors as a function of socio-economic variables, it seems that there are some stations which are resistant to our prediction model, where people have high income, low transit usage and are majority white.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
     select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression,dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
   mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
   group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = chicagoCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme()

6.Conclusion

In general, my Time_Space_FE_timeLags model is useful for predicting demands of bike-share, and is good to be put into real practice. First, by comparing the prediction curve with the observed curve in Predicted/Observed bike share time series , we can find that the predict curve matches well with the observed curve, and shows almost identical peaks and troughs. This means that my model can accurately predict the trend of bike-share usage. Second, the MAE for my model is 0.81, less than an average of one ride per hour, which is pretty alright for overall accuracy, since one bike is only a small portion of total bikes in one station. In addition, the cross-validation results show that my model has comparable goodness of fit metrics across each fold, which means it is generalizable to new data.

However, the model also has some shortcomings. From the scatter plot of the observed vs predicted value, we know that we underpredict for periods of high demand, due to the social patterns of them. In addition, the MAE have differences across spaces, in some stations, the MAE is high, which means my model might not perform well in those regions.