Communicating reproducible, data-driven insight into the questions relating but not limited to, increasing annual membership subscriptions.
1 Intro
Download Files and DB Creation
A view the dl_toDirectory.R script used below.
# ----# CC BY-SA, Eric Mossotti# ----# Description ----# # Downloads files from the cloud, calls the file relocating function and cleans# up the environment.# # ----source("Scripts/unz_relocate.R")dl_toDirectory<-function(durls,tmpzip_dir,tmpfile_dir,tmpzip_paths,tmpfile_paths,tmpfile_names){# Create a directory to store the temporary filesdir.create(tmpzip_dir)# A simple way to download and relocate those files from the working directory to the file-folder paths created earlier.curl::multi_download(durls, destfiles =tmpzip_paths)# Create tempFile directorydir.create(tmpfile_dir)# Execute sourced custom function from the unz_relocate.R fileunz_relocate(fPaths =tmpfile_paths, zPaths =tmpzip_paths, fNames =tmpfile_names)# To remove the directory and contents thereof after having finished usingunlink(tmpzip_dir, recursive =TRUE)}
A view of the csv_toDB.R script used below.
# ----# CC BY-SA, Eric Mossotti# ----# Description ----# # Reads the downloaded files into a dataframe. Then writes the df to local # storage and cleans up the environment.# # ---- library(duckdb)csv_toDB<-function(tmpfile_dir,tmpfile_paths,db_dir,database_path,original_path,complete_path){# Would like to read in all unzipped files to a single dataframedataTibble<-purrr::map(tmpfile_paths[1:12], arrow::read_csv_arrow)|>purrr::list_rbind()# duckDB for data storage ----# Folder to hold database filesdir.create(db_dir)# Initialize a duckDB database connectiondbconn<-DBI::dbConnect(duckdb::duckdb(), dbdir =database_path, read_only =FALSE)# Original data tabledataTibble|>as.data.frame()|>duckdb::dbWriteTable(conn =dbconn, name =original_path, overwrite =TRUE)# Clean data and working environment ----rm(dataTibble)# Complete observations writedplyr::tbl(dbconn, original_path)|>dplyr::collect()|>tidyr::drop_na()|>duckdb::dbWriteTable(conn =dbconn, name =complete_path, overwrite =TRUE)unlink(tmpfile_dir, recursive =TRUE)}
Execute initial download and removal of incomplete observations, then store original data and complete data separately.
Or, just connect to existing db, then source frequently used scripts used later.
The primary stakeholders in this analysis are Divvy, Lyft (the parent company of Divvy), and the City of Chicago Department of Transportation. The analysis aims to provide these stakeholders with data-driven insights to enhance the Divvy bike-sharing service, better serving the residents of Chicago and its users. The initial rationale behind Divvy’s implementation included improving air quality, promoting economic recovery, and reducing traffic congestion within the city. (About Divvy)
Code processing steps are accessible via buttons like the one below. Drop-down code summaries and tables therein add context and transparency regarding the presented findings to enhance understanding.
Data import scripts and initial analysis setup decision
1.3 Source
The raw 2023 dataset was imported from Divvy Data. (Divvy Data)
# List of column labels to feed tabler() and add_multiple_footnotes()location_list<-dplyr::tbl(dbconn, original_path)|>dplyr::collect()|>colnames()|>as.list()# A simple list of footnotes to feed tabler() and add_multiple_footnotes().note_list<-list("Anonymized trip identifier.", "The bicycle type.", "Starting date-time (to the second).","Ending date-time (to the second).", "Station name of where the trip started.","Station ID of where the trip started.", "Station name of where the trip ended.","Station ID of where the trip ended.", "Latitude associated with the starting location.","Longitude associated with the starting location.", "Latitude associated with the ending location.","Longitude associated with the ending location.", "If someone is an annual subscriber or not.")dplyr::tbl(dbconn, original_path)|>dplyr::collect()|>dplyr::slice_head(n =10)|>tabler(title ="What data did we start with?", source_note =gt::md("**Source**: Divvy Data"), note_list =note_list, location_list =location_list, noteColumns =TRUE, label_n =NULL)|>gt::tab_options(table.font.size =gt::pct(75), footnotes.multiline =FALSE)
Table 1: Unaltered data (other than being combined and formatted for presentation).
What data did we start with?
ride_id1
rideable_type2
started_at3
ended_at4
start_station_name5
start_station_id6
end_station_name7
end_station_id8
start_lat9
start_lng10
end_lat11
end_lng12
member_casual13
F96D5A74A3E41399
electric_bike
2023-01-21 20:05:42
2023-01-21 20:16:33
Lincoln Ave & Fullerton Ave
TA1309000058
Hampden Ct & Diversey Ave
202480.0
41.92407
-87.64628
41.93000
-87.64000
member
13CB7EB698CEDB88
classic_bike
2023-01-10 15:37:36
2023-01-10 15:46:05
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79957
-87.59475
41.80983
-87.59938
member
BD88A2E670661CE5
electric_bike
2023-01-02 07:51:57
2023-01-02 08:05:11
Western Ave & Lunt Ave
RP-005
Valli Produce - Evanston Plaza
599
42.00857
-87.69048
42.03974
-87.69941
casual
C90792D034FED968
classic_bike
2023-01-22 10:52:58
2023-01-22 11:01:44
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79957
-87.59475
41.80983
-87.59938
member
3397017529188E8A
classic_bike
2023-01-12 13:58:01
2023-01-12 14:13:20
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79957
-87.59475
41.80983
-87.59938
member
58E68156DAE3E311
electric_bike
2023-01-31 07:18:03
2023-01-31 07:21:16
Lakeview Ave & Fullerton Pkwy
TA1309000019
Hampden Ct & Diversey Ave
202480.0
41.92607
-87.63886
41.93000
-87.64000
member
2F7194B6012A98D4
electric_bike
2023-01-15 21:18:36
2023-01-15 21:32:36
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79955
-87.59462
41.80983
-87.59938
member
DB1CF84154D6A049
classic_bike
2023-01-25 10:49:01
2023-01-25 10:58:22
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79957
-87.59475
41.80983
-87.59938
member
34EAB943F88C4C5D
electric_bike
2023-01-25 20:49:47
2023-01-25 21:02:14
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79959
-87.59467
41.80983
-87.59938
member
BC8AB1AA51DA9115
classic_bike
2023-01-06 16:37:19
2023-01-06 16:49:52
Kimbark Ave & 53rd St
TA1309000037
Greenwood Ave & 47th St
TA1308000002
41.79957
-87.59475
41.80983
-87.59938
member
Source: Divvy Data
1 Anonymized trip identifier. 2 The bicycle type. 3 Starting date-time (to the second). 4 Ending date-time (to the second). 5 Station name of where the trip started. 6 Station ID of where the trip started. 7 Station name of where the trip ended. 8 Station ID of where the trip ended. 9 Latitude associated with the starting location. 10 Longitude associated with the starting location. 11 Latitude associated with the ending location. 12 Longitude associated with the ending location. 13 If someone is an annual subscriber or not.
Glimpse
A ‘glimpse’ function output of the raw data with data type information.
Another worthy objective of this analysis is to achieve reproducibility and efficiency. To facilitate future research and enable subsequent analyst teams to build upon this work, the project aimed to provide adequate code documentation and adhere to best practices regarding clean and modular code.
For instance, certain design decisions were incorporated to eliminate the need for re-downloading and re-processing data. For analysts conducting analysis over an extended period, such as days or months, on this dataset, it is now possible to simply reconnect to the single database file containing all the original data, including tables generated throughout the analysis process, following the initial download and subsequent processing.
The underlying code incorporates an if-else decision, which includes a source code script responsible for handling the initial data processing and establishing the database filesystem. Opting for a persistent DuckDB filesystem (as opposed to a purely in-memory solution) appeared optimal in terms of simplicity, cost-effectiveness of SQL database queries, and retaining progress made over extended periods. (Why DuckDB)
To streamline the process, reduce code duplication, and maintain consistent formatting throughout the project, reusable functions were developed for generating most of the tables and figures. These functions are located in the “Scripts” folder within the working directory. Their modular design not only simplifies the implementation of formatting changes but also facilitates the integration of additional code snippets when necessary. For instance, certain plots might require limiting the range of the axes, which can be achieved by combining these functions with appropriate code addendum. By leveraging these functions, the project benefits from reduced redundancy, improved efficiency, and cohesive formatting across all visualizations and data representations.
1.5 Initial Database Table List
Code
# Since we already know the paths, we'll manually define these Otherwise, might# have to delete db folder and start overdbList<-list(Paths =c("db/original_data.db", "db/complete_data.db"))|>as.data.frame()dbList|>tabler(title =gt::md("Which tables have<br>been created so far?"), note_list =list(gt::md("Tables in `db/data.db` at this stage")), location_list =list("Paths"), noteColumns =TRUE, source_note =gt::md("**Source**: `db/data.db`"), label_n =NULL, align_parameter ="left")|>gt::cols_align(align ="left")
Table 2: The list of tables created in data.db so far.
Which tables have been created so far?
Paths1
db/original_data.db
db/complete_data.db
Source: db/data.db
1Tables in db/data.db at this stage
Figure 1: dbList_1
A view of the filesystem directory. Notice there are not separate files for tables. The file, data.db, in this view does not represent a table name but the name of the database.
2 Tidying
2.1 Duplicates
Code to Remove Duplicates
First, record original observations from the raw data.
# Need to save this count for the summary table lateroriginal_nobs<-dplyr::tbl(dbconn, original_path)|>dplyr::collect()|>nrow()
Create a table containing the duplicated observations.
# This is a separate table used to analyze the observations returned as not# distinct (n > 1). This adds an extra column, labeled 'n'.dupeTable<-dplyr::tbl(dbconn, complete_path)|>dplyr::select(started_at:end_station_name)|># Counts of unique rows added for column 'n'dplyr::add_count(started_at, ended_at, start_station_name, end_station_name)|># Only observations that have been duplicated 1 or more times are shown.dplyr::filter(n>1)|># To see all rows, not just one row for each obs.dplyr::ungroup()|>dplyr::arrange(started_at)|>dplyr::collect()
Record a count of distinct duplicates and total observations.
Create a table of the now unduplicated observations seen earlier.
# The issue is, we need to get rid of not all of these rows, but just the extra# duplicate observations.# If there were 2 rows of duplicates, one would want to end up with 1 row after# removing the extras.undupedTable<-dupeTable|>dplyr::distinct(started_at, start_station_name, ended_at, end_station_name)
Record a count of the incorrect observations.
# Run an incorrect count on how many rows or observations there are in the# dataset.count_incorrectDists<-dplyr::tbl(dbconn, complete_path)|>dplyr::distinct(dplyr::pick("ride_id"))|>dplyr::count(name ="Incorrect Distinct Observations")|>dplyr::collect()|>as.integer()
Record a count of the correct observations.
# For the correct count of obscount_correctDists<-dplyr::tbl(dbconn, complete_path)|>dplyr::distinct(dplyr::pick("started_at", "start_station_name", "ended_at", "end_station_name"))|>dplyr::count()|>dplyr::collect()|>as.integer()
Lastly, write the unduplicated data to the database.
A crucial question arises: How can one identify and handle duplicate data? This section covers the process of checking for duplicates and selectively removing them while exercising caution. It is essential to recognize that the presence of unique values in a single column does not necessarily guarantee the uniqueness of each observation or row.
While all values in the ride_id column were found to be unique, not all observations were truly distinct. To verify the uniqueness of each observation, additional columns such as start_time, end_time, start_station, and end_station were utilized. These columns provide more granular information, including the precise starting and ending times down to the second, as well as the starting and ending locations. It was assumed that observations with identical starting and ending date-times and stations, despite having different rider IDs, were potentially erroneous duplicates.
Although the cause of such duplication errors is unknown, it could be assumed that one person checked out multiple bikes simultaneously. In that scenario, each bike would be assigned a unique ride_id. However, this occurrence was relatively rare, happening only 18 times over the course of a year. Since there is only one duplicate for each instance, it raises concerns and warrants further investigation. It is possible that trips could be grouped where one person pays for another rider’s fare. However, if that were the case, it raises the question of why there is always precisely one duplicate.
In Table 3, duplicate observations are listed and grouped by color for visual clarity. In contrast, Table 4 presents the data after removing the extra copy of each duplicate observation while preserving the unique observations. Of the duplicates identified, each had one extra copy. It was noted that the number of rows in the duplicates table is 36. Each duplicated observation has one duplicate, where n (the count) is always 2. Therefore, the expected number of observations to be removed was 18. A complication arose in determining how to remove not all observations but only the extra duplicate observation from each group.
To ensure the accurate removal of duplicates, the count of distinct n-values (representing the number of occurrences) for the un-duplicated table was computed, confirming the expected 18 unique instances. Subsequently, the total number of observations in the dataset was recorded, initially standing at 4,331,707. After removing the identified duplicate observations, the correct count of observations was 4,331,689. In summary, 18 additional observations were successfully removed, aligning with the expected number of duplicates identified earlier. These steps are documented in Table 5 for reference.
By carefully analyzing the count of distinct n-values and the total occurrences before and after reduplication, it was ensured that only the precise number of duplicate observations was removed, preserving the integrity of the unique data while eliminating the identified duplicates. This meticulous approach to data cleaning is crucial for maintaining data quality and reliability throughout the analysis process.
Code used for handling duplicates in this section
2.2 Outliers
Transform and Filter the Database
If you happen to be re-using this code - this is so you do not have to re-download or re-filter after making further adjustments.
filtered_path<-"db/filtered_data.db"# Do we still need to filter the database?if(duckdb::dbExistsTable(dbconn, filtered_path)==FALSE){source("Scripts/transFilter.R")transFilter(conxn =dbconn, oldPath ="db/data_unduped.db", newPath =filtered_path)}
This would execute if the conditions were met to filter the db/data.db database table
# ----# Author: Eric Mossotti# CC-BY SA# ----# Performs data transformations and filters to enforce consistency across# all of the analyses. It also will simplify query syntax in extended analyses.# Filtering can be based on flexible, but sensible criteria.# ----transFilter<-function(conxn, oldPath, newPath){dplyr::tbl(conxn, oldPath)|>dplyr::collect()|>dplyr::mutate("trip_time"=lubridate::time_length(lubridate::interval(started_at, ended_at), unit ="minute"), miles =geosphere::distGeo( p1 =cbind(start_lng, start_lat), p2 =cbind(end_lng, end_lat))/1000*0.62137119, mph =(miles/(trip_time/60)), .keep ="all",)|># Floor rationale - less than 0.1 miles are distances easily walked# Speed ceiling rationale - pro cyclists average around 20 mph# Speed floor rationale - accounts for trips possibly spent idling# rideable_type rationale - docked_bike stopped being recorded as a distinct# category within the time being analyzed (2023)# docked_bike was phased out and not much info about what it meansdplyr::filter(trip_time>1,trip_time<480,rideable_type!="docked_bike",miles>=0.1,mph<=20,mph>=1)|>duckdb::dbWriteTable(conn =conxn, name =newPath, overwrite =TRUE)}
Observations deemed erroneous or irrelevant for identifying usage trends among members and casual users were filtered out. Keeping track of these errors is a good practice, as they might provide insights into the differences in how members and casuals utilize the service.
Trips with negative duration were flagged as errors and removed. Additionally, trips lasting less than a minute but greater than zero were noted and removed, as they could potentially skew the derived statistics. These extremely short trips might be attributed to users briefly trying out the service before committing or quickly realizing their dissatisfaction with it. While some observations seemed nonsensical, most of the data was retained.
Consistent with the previous approach, an if-else decision was employed to facilitate testing. An external database filtering script was utilized to streamline the code within the main Quarto document. The resulting filtered data served as the foundation for subsequent analysis and table generation.
To get a count of the new total observations after filtering.
Provides a useful script for consistent formatting of many of the plots.
# ----# Author: Eric Mossotti# CC BY-SA# ----# To help reduce duplicate code and implement a consistent theme throughout a# markdown project. This doesn't account for all plot types possible yet.# ----plotter<-function(data,x_col,y_col,group_col=NULL,color_col=NULL,title=NULL,subtitle=NULL,x_label=NULL,y_label=NULL,geomType=NULL,lineGroup_palette="Paired",colGroup_palette="Paired",line_palette="YlOrRd",col_palette="YlOrRd",facetCol_palette="YlGnBu",fill_col="YlOrRd",axis_ticks_color="lavenderblush",bg_color="#222222",text_color="seashell",grid_color="grey30",x_lim=c(NA, NA),isFaceted=FALSE,is_lineGroup=FALSE,is_colGroup=FALSE,isHistogram=FALSE,isDensity=FALSE,breaks=NULL,limits=NULL,angle=ggplot2::waiver(),n.dodge=1,binwidth=NULL,areaFill=NULL,density_fill=ggplot2::waiver(),density_color=ggplot2::waiver(),alpha=ggplot2::waiver(),isROC=FALSE,roc_color=NULL,vline_color=NULL,vline_size=NULL,density_alpha=ggplot2::waiver(),dnorm_color=ggplot2::wavier(),bins=NULL,histo_breaks=NULL,low=NULL,high=NULL,isTime=FALSE,date_breaks=ggplot2::waiver(),date_labels=ggplot2::waiver(),date_minor_breaks=ggplot2::waiver(),lineColor=NULL,colPosition="stack",labels=ggplot2::waiver(),isTimeHist=FALSE,quartiles=NULL,qformat=NULL){# line ----if(geomType=="line"){if(is_lineGroup==TRUE){plot<-data|>ggplot2::ggplot(mapping =ggplot2::aes( x ={{x_col}}, y ={{y_col}}, color ={{color_col}}))+ggplot2::geom_line(show.legend =TRUE)+ggplot2::scale_color_brewer(palette =lineGroup_palette, name ="")}else{plot<-data|>ggplot2::ggplot(mapping =ggplot2::aes(x ={{x_col}}, y ={{y_col}}))+ggplot2::geom_line(show.legend =TRUE, color =lineColor)}}# col ----elseif(geomType=="column"){# non-grouped, non-faceted, non-density, non-histogram ----if(isFALSE(is_colGroup)&&isFALSE(isDensity)&&isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y ={{y_col}}, fill ={{y_col}}))+ggplot2::geom_col(show.legend =FALSE)+ggplot2::scale_fill_distiller(palette =col_palette)}# grouped, non - faceted, non - density, non - histogram----elseif(!isFALSE(is_colGroup)&&isFALSE(isFaceted)&&isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y ={{y_col}}, fill ={{group_col}}))+ggplot2::geom_col( show.legend =TRUE, position =colPosition, color =color_col)+ggplot2::scale_fill_brewer(palette =colGroup_palette, name ="")}# grouped, faceted, non-density, non-histogram ----elseif(!isFALSE(is_colGroup)&&!isFALSE(isFaceted)&&isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y ={{y_col}}, fill ={{y_col}}))+ggplot2::geom_col(show.legend =FALSE)+ggplot2::facet_grid(rows ="member_casual")+ggplot2::scale_fill_distiller(palette =facetCol_palette)}# grouped, faceted, density, non-histogram ----elseif(!isFALSE(is_colGroup)&&!isFALSE(isFaceted)&&!isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y =..density.., fill ={{group_col}}))+ggplot2::geom_density(alpha =alpha)+ggplot2::facet_grid(rows ="member_casual")+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =1, angle =angle))}# grouped, non-faceted, density, histogram ----elseif(!isFALSE(is_colGroup)&&isFALSE(isFaceted)&&!isFALSE(isDensity)&&!isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y =..density.., fill ={{group_col}}))+ggplot2::geom_histogram( binwidth =binwidth, color =color_col, alpha =alpha, breaks =histo_breaks)+ggplot2::geom_density(alpha =alpha, color =density_color, fill =density_fill)+ggplot2::stat_function(fun =dnorm, args =list(mean =mean({{x_col}}), sd =sd({{x_col}})))+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =1, angle =angle))}# grouped, non-faceted, density, non-histogram ----elseif(!isFALSE(is_colGroup)&&isFALSE(isFaceted)&&!isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y =ggplot2::after_stat(density), fill ={{group_col}}))+ggplot2::geom_density(alpha =density_alpha, color =color_col)+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))+ggplot2::scale_fill_brewer(palette =colGroup_palette, name ="")}# non-grouped, non-faceted, density, non-histogram ----elseif(isFALSE(is_colGroup)&&isFALSE(isFaceted)&&!isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y =..density.., fill =density_fill, color =density_color))+ggplot2::geom_density(show.legend =FALSE)+ggplot2::geom_vline(ggplot2::aes(xintercept =mean({{x_col}})), color =vline_color, size =vline_size, linetype ="solid")+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# grouped, non-faceted, non-density, histogram ----elseif(!isFALSE(is_colGroup)&&isFALSE(isFaceted)&&isFALSE(isDensity)&&!isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}, fill ={{group_col}}))+ggplot2::geom_histogram( binwidth =binwidth, color =color_col, alpha =alpha, breaks =histo_breaks)+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# grouped, faceted, non-density, non-histogram ----elseif(!isFALSE(is_colGroup)&&!isFALSE(isFaceted)&&isFALSE(isDensity)&&isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes( x ={{x_col}}, y ={{y_col}}, fill ={{group_col}}))+ggplot2::geom_col(show.legend =TRUE, position ="dodge")+ggplot2::scale_fill_brewer(palette =colGroup_palette, name ="")+ggplot2::scale_x_discrete( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# grouped, faceted, non-density, histogram ----elseif(!isFALSE(is_colGroup)&&!isFALSE(isFaceted)&&isFALSE(isDensity)&&!isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}, fill ={{group_col}}))+ggplot2::geom_histogram( binwidth =binwidth, color =color_col, bins =bins, breaks =breaks)+ggplot2::geom_vline(ggplot2::aes(xintercept =mean({{x_col}})), color =vline_color, size =vline_size, linetype ="solid")+ggplot2::facet_grid(rows ="member_casual")+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# non-grouped, non-faceted, non-density, histogram ----elseif(isFALSE(is_colGroup)&&isFALSE(isFaceted)&&isFALSE(isDensity)&&!isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}))+ggplot2::geom_histogram( show.legend =FALSE, color =color_col, binwidth =binwidth,ggplot2::aes(fill =..count..))if(!isFALSE(isTimeHist)){plot<-plot+ggplot2::scale_x_datetime( date_breaks =date_breaks, date_labels =date_labels, date_minor_breaks =date_minor_breaks, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle), sec.axis =ggplot2::sec_axis(~., breaks =quartiles, labels =scales::date_format(qformat), guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle)))}else{plot<-plot+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle), sec.axis =ggplot2::sec_axis(~., breaks =quartiles, labels =scales::label_number(), guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle)))}plot<-plot+ggplot2::scale_y_continuous()+ggplot2::scale_fill_gradient(low =low, high =high)+ggplot2::geom_vline(ggplot2::aes(xintercept =mean({{x_col}})), color =vline_color, size =vline_size, linetype ="solid")+ggplot2::geom_vline( data =data.frame(q =quartiles),ggplot2::aes(xintercept =q, color =factor(q)), linetype ="dashed", size =1)+ggplot2::scale_color_manual( values =c("orange", "green", "purple"), labels =c("25th", "50th", "75th"))}# non-grouped, non-faceted, density, histogram ----elseif(isFALSE(is_colGroup)&&isFALSE(isFaceted)&&!isFALSE(isDensity)&&!isFALSE(isHistogram)){plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}, y =..density..))+ggplot2::geom_histogram(color =color_col, bins =bins, breaks =breaks)+ggplot2::geom_density(color =density_color, fill =density_fill, alpha =density_alpha)+ggplot2::geom_vline(ggplot2::aes(xintercept =mean({{x_col}})), color =vline_color, size =vline_size, linetype ="solid")+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# grouped, faceted, density, histogram ----else{plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}, fill ={{group_col}}))+ggplot2::geom_histogram( binwidth =binwidth, color =color_col, bins =bins, breaks =histo_breaks)+ggplot2::geom_density(color =density_color, fill =density_fill, alpha =density_alpha)+ggplot2::geom_vline(ggplot2::aes(xintercept =mean({{x_col}})), color =vline_color, size =vline_size, linetype ="solid")+ggplot2::facet_grid(rows ="member_casual")+ggplot2::scale_x_continuous( breaks =breaks, limits =limits, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))}# other misc plot types ----}else{# pROC objects ----if(!isFALSE(isROC)){plot<-data|>pROC::ggroc(aes ="linetype", color =roc_color)}# time-series data ----# grouped, non-faceted, density, non-histogramif(!isFALSE(isTime)){plot<-data|>ggplot2::ggplot(ggplot2::aes(x ={{x_col}}, fill ={{group_col}}))+ggplot2::geom_density(alpha =density_alpha, color =color_col)+ggplot2::scale_x_datetime( date_breaks =date_breaks, date_labels =date_labels, date_minor_breaks =date_minor_breaks, labels =labels, guide =ggplot2::guide_axis(n.dodge =n.dodge, angle =angle))+ggplot2::scale_fill_brewer(palette =colGroup_palette, name ="")}}# For the rest of the otherwise likely duplicated plot settings ----plot<-plot+ggplot2::labs(title =title, subtitle =subtitle, x =x_label, y =y_label)+ggplot2::theme( panel.background =ggplot2::element_rect(fill =bg_color, color =NA), plot.background =ggplot2::element_rect(fill =bg_color, color =NA), text =ggplot2::element_text(color =text_color), panel.grid =ggplot2::element_blank(), axis.title.x =ggplot2::element_text(margin =grid::unit(c(5, 5, 5, 5), "mm")), axis.title.y =ggplot2::element_text(margin =grid::unit(c(5, 5, 5, 5), "mm")), axis.text.x =ggplot2::element_text(color ="Snow", margin =grid::unit(c(1, 1, 1, 1), "mm")), axis.text.y =ggplot2::element_text(color ="Snow", margin =grid::unit(c(2, 2, 2, 2), "mm")), axis.ticks =ggplot2::element_line(color =axis_ticks_color), axis.ticks.y =ggplot2::element_blank(), panel.grid.major.x =ggplot2::element_line(color =grid_color, linetype ="dotted"), panel.grid.major.y =ggplot2::element_line(color =grid_color), legend.background =ggplot2::element_rect(fill =bg_color), legend.title =ggplot2::element_blank())}
Performs many of the more backend query transformations I needed for this document.
# Author: Eric Mossotti# CC BY-SA# ----# Programmatic db extraction and df transformation, utilizing tidy evaluation.## Allows one to specify which columns to select, which columns to group by for # counting, and which column (if any) to transform. It be used for a variety of # similar data transformation tasks.# # Results can be weighted to help certain modeling functions run faster while # not altering the results one would expect from using non-aggregated datasets # as input.# ---- transformData<-function(conn,path,df,select_cols,group_cols=NULL,binary_col=NULL,ntile_col=NULL,pred_col=NULL,zero_val=NULL,one_val=NULL,qtile_levels=NULL,doWeights=FALSE,doQuantile=FALSE,isDF=FALSE){# Weight the data? ----if(isTRUE(doWeights)){# Is the data from a df or db? ----if(isTRUE(isDF)){freq_data<-df|>dplyr::select(dplyr::all_of(select_cols))|>dplyr::add_count(dplyr::across(dplyr::all_of(group_cols)))|>dplyr::distinct()|>dplyr::arrange(dplyr::across(dplyr::all_of(group_cols)))}else{freq_data<-dplyr::tbl(conn, path)|>dplyr::select(dplyr::all_of(select_cols))|>dplyr::add_count(dplyr::across(dplyr::all_of(group_cols)))|>dplyr::distinct()|>dplyr::arrange(dplyr::across(dplyr::all_of(group_cols)))|>dplyr::collect()}}else{# Is the data from a df or db? ----if(isTRUE(isDF)){freq_data<-df|>dplyr::select(dplyr::all_of(!!select_cols))|>dplyr::arrange(dplyr::across(dplyr::all_of(!!select_cols)))}else{freq_data<-dplyr::tbl(conn, path)|>dplyr::select(dplyr::all_of(select_cols))|>dplyr::arrange(dplyr::across(dplyr::all_of(select_cols)))|>dplyr::collect()}}# Do we want to transform a column to binary for modeling?if(!is.null(binary_col)&&binary_col%in%names(freq_data)){freq_data<-freq_data|>dplyr::mutate(# Tidy-selects data with ':=' and bang-bang operators!!binary_col:=factor(.data[[binary_col]], levels =c(zero_val, one_val)))}# Do we want to predict anything?if(isTRUE(doQuantile)){freq_data<-freq_data|>dplyr::mutate(!!ntile_col:=dplyr::ntile(.data[[pred_col]], n =4),!!ntile_col:=factor(.data[[ntile_col]], levels =c(1, 2, 3, 4), labels =as.vector(qtile_levels)))}return(freq_data)}
Helpfully, this contributes to the Chi-Squared tables by allowing me to add the chi-squared statistic and degrees of freedom information in the footnotes.
# ----# Author: Eric Mossotti# CC BY-SA# ----# The code for returning chi-square test results as a tibble for use in tables.# ----chisqTest<-function(data, variable, by){test_result<-chisq.test(x =data[[variable]], y =data[[by]])|>broom::tidy()|>dplyr::select(statistic, parameter, p.value)return(test_result)}
Scripts used frequently in this section
3.1 Methods
This section explains certain methodologies that might be less obvious to the reader.
3.1.1 Chi-Square
Set up
Null Hypothesis (H0) There is no association between membership status and travel behavior.
Alternative Hypothesis (H1) There is an association between membership status and travel behavior.
Significance level, \((\alpha) = 0.05\)
Collect data
Tables are transformed and stored for later use. Where needed, the appropriate table is extracted from the database for analysis.
The critical value can be determined from the chi-square distribution table or software using \(α\) and the df.
The p-value
The p-value is calculated from the chi-square statistic and the df. When calculating the p-value, statistical software or functions (like the ones later in this report) internally use this integral:
\(\chi^2\) is the calculated chi-square statistic (the lower bound of integration)
\(\infty\) is the upper bound of integration
\(f(x,\ df)\) is the chi-square PDF (probability density function)
\(x\) is the variable of integration
Interpreting results
Our calculated chi-square statistic is compared to the critical value (greater or less than). We then would be checking for consistency by comparing the p-value to the significance level value. Then, we either reject, where \(p < \alpha\) , or fail to reject, where \(p>\alpha\) , the null hypothesis. This informs us if there is a significant association between the membership status and the given travel behavior parameter.
Calculation Process
The software used to generate the Chi-Square tables in this report makes it so a human doesn’t need to work with the integrals directly. The software calculates the chi-square statistic and df from the data and uses that information as input to the chi-square calculator, which handles the integration to return the p-value.
3.1.2 Binary Logistic Regression
Model Fitting
The logistic regression model is fitted using Maximum Likelihood Estimation (MLE) through an Iteratively Reweighted Least Squares (IRLS) algorithm. This process determines the optimal coefficients \((\beta)\) for the predictors.
Odds Ratio Calculation
For each predictor variable, the odds ratio is calculated as \(exp(β)\), where \(\beta\) is the coefficient estimated by the model. This represents the change in odds of the outcome for a one-unit increase in the predictor.
Statistical Significance
p-values are computed for each predictor using the Wald test. This test calculates a z-statistic, \((\beta/SE(\beta))\), and compares it to a standard normal distribution to determine the probability of observing such a result under the null hypothesis.
Results Compiled As
Predictor variables (Characteristics)
Odds Ratios (OR)
p-values
Sometimes confidence intervals for the ORs (not shown in this example)
Reference Category
For categorical predictors, one category is set as the reference (marked with a dash in the OR column). This dashed-out value is very close to 1.00. This process transforms the raw model output into an interpretable format, allowing readers to quickly assess the direction, magnitude, and significance of each predictor’s effect on the outcome variable.
3.2 Membership
Database Operations
Table Preview
Database Code
Write to .db
if(isFALSE(duckdb::dbExistsTable(dbconn, "db/membership.db"))){dplyr::tbl(dbconn, filtered_path)|>dplyr::select(member_casual)|>dplyr::arrange(member_casual)|>dplyr::collect()|>duckdb::dbWriteTable(conn =dbconn, name ="db/membership.db", overwrite =TRUE)}
Figure 4: The membership distribution of bicycle choice
Table 12 presents a Chi-Square analysis of bicycle type usage among casual users and members. In summary, members show a higher preference for classic bikes (65%) compared to casual users (59%). Casual users have a higher proportion of electric bike usage (41%) compared to members (35%).
There is a statistically significant association between bicycle type and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between bicycle type and membership status. This suggests that the choice of bicycle type is not independent of membership status.
The large \(\chi^2\) value (14762.37) with just 1 degree of freedom (calculated as [rows - 1] * [columns - 1])** results in the very small p-value (< 0.001). This combination strongly suggests that the difference in bicycle preference between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.
Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble<-dplyr::tbl(dbconn, "db/bType.db")|>dplyr::select(rideable_type, member_casual)|>dplyr::collect()chiResult<-chisqTest(data =data_tibble, variable ="rideable_type", by ="member_casual")
Code
chi_table<-tabler(title =gt::md("Chi-Square:<br>The signficance of bicycle choice and membership"), source_note =gt::md("**Source**: `db/bType.db`"), label =list(rideable_type ="Bicycle Type"), by =member_casual, isSummary =TRUE, chiVar ="rideable_type", chiBy ="member_casual", tbl_name =data_tibble, chi_result =chiResult)chi_table
Table 12: Testing indepdence of bicycle choice and membership variables
Chi-Square: The signficance of bicycle choice and membership
Characteristic
casual
N = 1,260,6211
member
N = 2,636,7771
p-value2
Bicycle Type
<0.001
classic_bike
739,475 (59%)
1,714,250 (65%)
electric_bike
521,146 (41%)
922,527 (35%)
Source: db/bType.db
1n (%); χ² = 14762.37; df = 1
2Pearson’s Chi-squared test
Table 13 presents the results of a binary logistic regression analyzing the relationship between bicycle type and membership status. The analysis compares classic bikes and electric bikes, with classic bikes serving as the reference category.
The odds of membership for users of electric bikes were 0.76 times the odds for users of classic bikes. The difference in membership likelihood between electric and classic bike users is highly statistically significant (p < 0.001).
Predicting the log-odds of being a member versus being a casual user.
model<-dplyr::tbl(dbconn, "db/bType_wb.db")|>glm(formula =member_casual~rideable_type, weights =n, family =binomial)
Table 18: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel duration.
Summary Stats: Member travel duration
Characteristic
casual N = 1,260,621
member N = 2,636,777
Duration (mins)
Median (Q1, Q3)
12.0 (7.0, 20.0)
9.0 (5.0, 14.0)
Mean (SD)
16.5 (15.8)
11.3 (9.2)
Min, Max
1.0, 475.0
1.0, 424.0
Source: db/duration.db
To accompany the summary table, two plots, Figure 7 and Figure 8, present a highly granular view of the duration dataset (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.
Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 7: A more detailed look that illustrates the quartile distribution of casual travel duration.
Code
# tried 2.5 for binwidth but not sure if want to keepgplot<-qdf_member|>plotter(title =paste0("The quartile", "\n", "distribution of members' travel duration"), x_label ="Minutes", y_label ="n", x_col =trip_time, geomType ="column", isHistogram =TRUE, angle =45, color_col ="transparent", vline_color ="lightyellow", vline_size =0.5, low ="blue", high ="red", limits =c(0, 100), breaks =seq(0,100, by =5), binwidth = \(x)2*IQR(x)/(length(x)^(1/3)), quartiles =quartiles_member)gplot
Figure 8: A more detailed look that illustrates the quartile distribution of member travel duration.
Supplementing the histogram, Figure 9 shows a density plot comparing the duration of trips for casual users and members. The x-axis represents time in minutes, limited to a range of 0 to 100 for presentation purposes, while the y-axis shows the density (a measure of relative frequency).
The member group (darker blue) has a higher and narrower peak compared to the casual group (lighter blue). This indicates that members tend to have more concentrated distribution of a session duration around their most common length. The casual group appears to have a slightly fatter tail, extending further to the right than the member group. This suggests that casual users might occasionally have longer sessions than members, even if it’s less common.
Both groups exhibit right-skewed distributions, with a peak near the left side and a long tail extending to the right. This suggests that for both groups, a shorter duration is more common, while a longer duration occurs less frequently but can extend quite far. They seem to have their peak density around 5-10 minutes, with members peaking slightly earlier than casual users. There is significant overlap between the two distributions, indicating that while there are differences, there is also considerable similarity in duration patterns between casual users and members.
Figure 9: Visualizing the probability-density distribution between duration and membership variables.
Table 19 presents the results of a binary logistic regression model, analyzing the relationship between ride duration and membership status. The analysis divides ride duration into quartiles, with Q1 (1.02 - 5.73 minutes) serving as the reference category.
Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other duration quartiles ( p < 0.001 for all comparisons). With Q2 (5.73 - 9.55 minutes), the odds of membership were 0.38 times as high as with Q1. This indicates a substantial decrease (62%) in the likelihood of membership for slightly longer rides. With Q3 (9.55 - 16.13 minutes), the odds of membership were 0.08 times as high as in Q1. This shows a dramatic decrease (92%) in membership likelihood for medium-length rides. With Q4 (16.13 - 475.22 minutes), the odds of membership were 0.06 times as high as in Q1. This represents an even more pronounced decrease (94%) in membership likelihood for the longest rides.
Query, process, and create model R object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/duration_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Figure 11: The membership distribution of trips taken by month
Table 24 presents a Chi-Square of monthly travel among members and casual users. In summary, members exhibit higher usage for periods Jan - Mar and Oct - Dec. Both groups prefer the months of Apr - Sep, where casuals show a higher interest, proportionally for their respective group.
There is a statistically significant association between monthly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between monthly travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.
The large \(\chi^2\) value (71802.26) with 11 degrees of freedom (calculated as [rows - 1][columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.
Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble<-dplyr::tbl(dbconn, "db/moy.db")|>dplyr::select(abbMonths, member_casual)|>dplyr::arrange(abbMonths, member_casual)|>dplyr::collect()chiResult<-chisqTest(data =data_tibble, variable ="abbMonths", by ="member_casual")
Code
chi_table<-tabler(title =gt::md("Chi-Square:<br>The signficance of monthly travel on membership"), source_note =gt::md("**Source**: `db/moy.db`"), label =list(abbMonths ="Month"), by =member_casual, isSummary =TRUE, tbl_name =data_tibble, chi_result =chiResult)chi_table
Table 24: Testing the independence between month and membership variables
Chi-Square: The signficance of monthly travel on membership
Characteristic
casual
N = 1,260,6211
member
N = 2,636,7771
p-value2
Month
<0.001
Jan
25,262 (2.0%)
111,624 (4.2%)
Feb
27,111 (2.2%)
109,707 (4.2%)
Mar
39,137 (3.1%)
143,992 (5.5%)
Apr
87,211 (6.9%)
199,132 (7.6%)
May
139,998 (11%)
268,375 (10%)
Jun
177,745 (14%)
297,011 (11%)
Jul
194,030 (15%)
308,489 (12%)
Aug
189,563 (15%)
329,860 (13%)
Sep
168,881 (13%)
292,283 (11%)
Oct
114,475 (9.1%)
259,793 (9.9%)
Nov
64,508 (5.1%)
193,005 (7.3%)
Dec
32,700 (2.6%)
123,506 (4.7%)
Source: db/moy.db
1n (%); χ² = 71802.26; df = 11
2Pearson’s Chi-squared test
Two plots, Figure 12 and Figure 13, present a highly granular view of the monthly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.
Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 13: A more detailed look that illustrates the quartile distribution of monthly member travel.
To visualize monthly users through the lens of their respective concentrations, see Figure 14. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).
The casual group has higher peaks near the middle of the distribution. The member group has higher peaks near the left and right sides of the distribution. Both of the groups exhibit somewhat normal, multi-modal distributions. They overlap each other significantly, suggesting similarity in travel patterns.
Figure 14: Visualizing the probability-density distribution of months by membership.
Table 25 presents the results of a binary logistic regression analyzing the relationship between months of the year and membership status. The analysis divides the year into quartiles, with Q1 (January 01 - May 20) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).
With Q2 (May 20 - Jul 21), the odds of membership were 0.57 times as high as in Q1. This indicates a substantial decrease (43%) in the likelihood of membership during late spring and early summer. With Q3 (Jul 21 - Sep 18), the odds of membership were 0.58 times as high as in Q1. This shows a similar decrease (42%) in membership likelihood during late summer and early fall, nearly identical to Q2. And with Q4 (Sep 18 - Dec 31), the odds of membership were 0.87 times as high as in Q1. While still lower than Q1, this represents a less pronounced decrease (13%) in membership likelihood during fall and early winter.
Query, process and create model R object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/moy_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Code
model|>gtsummary::tbl_regression(label =list(quartile ="Months Ranges"), conf.int =FALSE, exponentiate =TRUE)|>tabler(title =gt::md("Binary Logistic Regression:<br>Modeling the likelihood of monthly member travel"), source_note =gt::md("**Source**: `db/moy.db`"), isBinary =TRUE)
Table 25: Modeling the probability of monthly member travel.
Binary Logistic Regression: Modeling the likelihood of monthly member travel
Table 27 and Figure 15 show an aggregated distribution of the overall daily travel.
Code
# Values were too similar to visualize differences, see coord_cartesion()gplot<-transformData(conn =dbconn, path ="db/dow.db", select_cols ="abbDays", group_cols ="abbDays", doWeights =TRUE)|>plotter(x_col =abbDays, y_col =n, geomType ="column", title =paste0("Which days\ndo people tend to ride?"), x_label ="Days of the Week", y_label ="n")+ggplot2::coord_cartesian(ylim =c(4.5*10^5, NA))gplot
Figure 15: The overall distribution of trips taken by day of the week
Figure 16: The membership distribution of trips taken by day of the week
In Table 29, present a Chi-Square of daily travel among members and casual users. Members show higher usage Mon - Thu compared to casual users. The pattern is more regular compared to casual users. Casual users show higher usage Fri - Sun compared to members. Their preferences fall within the weekend.
There is a statistically significant association between daily travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between daily travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.
The large \(\chi^2\) value (76305.71) with 6 degrees of freedom (calculated as [rows - 1] * [columns - 1]) results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.
Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble<-dplyr::tbl(dbconn, "db/dow.db")|>dplyr::select(abbDays, member_casual)|>dplyr::arrange(abbDays, member_casual)|>dplyr::collect()chiResult<-chisqTest(data =data_tibble, variable ="abbDays", by ="member_casual")
Code
tabler(title =gt::md("Chi-Square:<br>The signficance of daily travel on membership"), source_note =gt::md("**Source**: `db/moy.db`"), label =list(abbDays ="Day"), by =member_casual, isSummary =TRUE, tbl_name =data_tibble, chi_result =chiResult)
Table 29: Testing the independence between month and membership variables
Chi-Square: The signficance of daily travel on membership
Characteristic
casual
N = 1,260,6211
member
N = 2,636,7771
p-value2
Day
<0.001
Sun
200,588 (16%)
285,687 (11%)
Mon
142,860 (11%)
364,495 (14%)
Tue
152,604 (12%)
425,491 (16%)
Wed
155,978 (12%)
429,288 (16%)
Thu
169,117 (13%)
428,246 (16%)
Fri
189,429 (15%)
376,579 (14%)
Sat
250,045 (20%)
326,991 (12%)
Source: db/moy.db
1n (%); χ² = 76305.71; df = 6
2Pearson’s Chi-squared test
Two plots, Figure 17 and Figure 18, present a highly granular view of the daily data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.
Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 18: A more detailed look that illustrates the quartile distribution of daily member travel.
To visualize daily users through the lens of their respective concentrations, see Figure 19 The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).
The casual group has higher peaks near the left and right sides of the distribution. Their highest peak appears to be on Saturday. The member group has higher peaks near the middle of the distribution. They also show less day-to-day variability during weekdays compared to casual users.
On weekdays, both groups show a bimodal distribution, with two peaks each day. On weekends, there is a unimodal distribution. They overlap each other significantly, suggesting similarity in travel patterns.
Figure 19: Visualizing the probability-density distribution of the day by membership variables.
Table 30 presents the results of a binary logistic regression analyzing the relationship between days of the week and membership status. The analysis divides the week into quartiles, with Q1 (Sunday 12:00 am - Monday 11:40 am) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).
With Q2 (Monday 11:40 am - Wednesday 05:14 am), the odds of membership were 1.52 times higher than in Q1. This suggests a substantial increase in the likelihood of members riding during the early part of the work week.
With Q3 (Wednesday 05:14 am - Friday 12:19 pm), the odds of membership were 1.36 times higher than in Q1. This indicates a continued higher likelihood of membership during the latter part of the work week, though slightly lower than Q2. With Q4 (Friday 12:19 pm - Saturday 11:59 pm), the odds of membership were 0.80 times as high as in Q1. This represents a significant decrease in the likelihood of membership during the weekend period.
Query, process, and create a model object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/dow_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Code
model|>gtsummary::tbl_regression(label =list(quartile ="Weekday Ranges"), conf.int =FALSE, exponentiate =TRUE)|>tabler(title =gt::md("Binary Logistic Regression:<br>Modeling the likelihood of daily member travel"), source_note =gt::md("**Source**: `db/dow_wq.db`"), isBinary =TRUE)
Table 30: Modeling the probability of daily member travel.
Binary Logistic Regression: Modeling the likelihood of daily member travel
Characteristic
OR1
p-value
Weekday Ranges
Q1 (Sun 12:00 am - Mon 11:40 am]
—
Q2 (Mon 11:40 am - Wed 05:14 am]
1.52
<0.001
Q3 (Wed 05:14 am - Fri 12:19 pm]
1.36
<0.001
Q4 (Fri 12:19 pm - Sat 11:59 pm]
0.80
<0.001
Source: db/dow_wq.db
1OR = Odds Ratio
3.7 Hour
Database Operations
Table Preview
DB Operations
Write to .db
if(isFALSE(duckdb::dbExistsTable(dbconn, "db/hod.db"))){dplyr::tbl(dbconn, filtered_path)|>dplyr::select(started_at, member_casual)|>dplyr::arrange(started_at)|>dplyr::collect()|>dplyr::mutate(started_at_time =update(started_at, year =2023, month =1, day =1), hr =stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="hour"), "%I %p")), hrMin =stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="minute"), "%I:%M %p")), hrminSec =stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="second"), "%r")), hr =forcats::as_factor(hr), hrMin =forcats::as_factor(hrMin))|>dplyr::select(member_casual:hrminSec)|>duckdb::dbWriteTable(conn =dbconn, name ="db/hod.db", overwrite =TRUE)}
Query, transform, and write weighted quartile data to hod_wq.db.
Similarly, Table 34 and Figure 21 summarize the distribution, grouping the sums by membership.
Code
gplot<-transformData(conn =dbconn, path ="db/hod.db", select_cols =c("hr", "member_casual"), group_cols =c("hr", "member_casual"), doWeights =TRUE)|>plotter(title =paste0("Which hours of the day do members ride?"), x_label ="Hour of Day", y_label ="n", x_col =hr, y_col =n, group_col =member_casual, geomType ="column", isFaceted =TRUE, is_colGroup =TRUE)+ggplot2::scale_x_discrete(guide =ggplot2::guide_axis(n.dodge =1, angle =45))+ggplot2::theme(axis.text =ggplot2::element_text(size =8))gplot
Figure 21: The membership distribution of trips taken by hour of the day
Table 35 presents a Chi-Square of hourly travel among members and casual users. In summary:
Compared to casual users, members show relatively higher service usage at intervals [5:30am - 9:30am] and [4:30pm - 6:30pm]. Casual users show relatively higher rates at intervals [10:30am - 4:30pm] and [6:30pm - 5:30am].
There is a statistically significant association between hourly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between hourly travel and membership status. The frequency of hourly travel is not likely to be independent of membership status.
The large \(\chi^2\) value (72733.38) with 23 degrees of freedom (calculated as [rows - 1][columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.
Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
tabler(title =gt::md("Chi-Square:<br>The signficance of hourly travel on membership"), source_note =gt::md("**Source**: `db/hod.db`"), label =list(hr ="Hour"), by =member_casual, isSummary =TRUE, tbl_name =data_tibble, chi_result =chiResult)
Table 35: Testing the independence between hour and membership variables
Chi-Square: The signficance of hourly travel on membership
Characteristic
casual
N = 1,260,6211
member
N = 2,636,7771
p-value2
Hour
<0.001
12 am
24,468 (1.9%)
27,868 (1.1%)
01 am
15,391 (1.2%)
16,207 (0.6%)
02 am
10,859 (0.9%)
9,519 (0.4%)
03 am
5,053 (0.4%)
5,256 (0.2%)
04 am
3,164 (0.3%)
4,015 (0.2%)
05 am
3,918 (0.3%)
10,948 (0.4%)
06 am
12,330 (1.0%)
48,697 (1.8%)
07 am
27,046 (2.1%)
114,203 (4.3%)
08 am
43,590 (3.5%)
184,374 (7.0%)
09 am
45,413 (3.6%)
154,471 (5.9%)
10 am
47,734 (3.8%)
108,217 (4.1%)
11 am
59,590 (4.7%)
112,614 (4.3%)
12 pm
75,634 (6.0%)
137,910 (5.2%)
01 pm
81,646 (6.5%)
142,158 (5.4%)
02 pm
83,794 (6.6%)
140,673 (5.3%)
03 pm
90,150 (7.2%)
156,530 (5.9%)
04 pm
105,809 (8.4%)
207,565 (7.9%)
05 pm
124,466 (9.9%)
284,140 (11%)
06 pm
118,031 (9.4%)
253,878 (9.6%)
07 pm
91,667 (7.3%)
185,810 (7.0%)
08 pm
63,696 (5.1%)
125,410 (4.8%)
09 pm
48,455 (3.8%)
91,288 (3.5%)
10 pm
44,013 (3.5%)
68,679 (2.6%)
11 pm
34,704 (2.8%)
46,347 (1.8%)
Source: db/hod.db
1n (%); χ² = 72733.38; df = 23
2Pearson’s Chi-squared test
Two plots, Figure 22 and Figure 23, present a highly granular view of the hourly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.
Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 23: A more detailed look that illustrates the quartile distribution of hourly member travel.
To visualize hourly users through the lens of their respective concentrations, see Figure 24. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).
The member group parallels the daily bimodal patterns seen in daily density plot. The casual group likewise, but to less extent parallels the bimodal patterns seen in the daily density plot. The two groups, largely, are overlapping, with the highest densities for both falling around 5:00pm.
Figure 24: Visualizing the probability-density distribution of the hour by membership.
Table 36 presents the results of a binary logistic regression analyzing the relationship between hour of the day and membership status. The analysis divides the day into quartiles, with Q1 (12:00 am - 10:59 am) serving as the reference category. In summary:
Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 1.44 times as high in Q2 (10:59 am - 03:24 pm), 1.04 times as high in Q3 (03:24 pm - 06:05 pm), and 0.97 times as high in Q4 (06:05 pm - 11:59 pm).
These results reveal a non-linear relationship between time of day and membership status. The highest likelihood of membership occurs during \(Q2\), corresponding to midday hours. There is a slight increase in membership likelihood during Q3 (late afternoon) compared to the reference period, while evening hours (Q4) show a slight decrease in membership likelihood.
Query hod_wq.db, process, and create model R object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/hod_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Code
model|>gtsummary::tbl_regression(label =list(quartile ="Hour Ranges"), conf.int =FALSE, exponentiate =TRUE)|>tabler(title =gt::md("Binary Logistic Regression:<br>Modeling the likelihood of hourly member travel"), source_note =gt::md("**Source**: `db/hod_wq.db`"), isBinary =TRUE)
Table 36: Modeling the probability of hourly member travel.
Binary Logistic Regression: Modeling the likelihood of hourly member travel
Table 41: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel distance.
Summary Stats: Member travel distance
Characteristic
casual N = 1,260,621
member N = 2,636,777
Distance (miles)
Median (Q1, Q3)
1.12 (0.72, 1.83)
0.97 (0.59, 1.71)
Mean (SD)
1.48 (1.17)
1.34 (1.11)
Min, Max
0.10, 20.54
0.10, 16.02
Source: db/distance.db
Two plots, Figure 27 and Figure 28, present a highly granular view of the distance data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.
Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 28: A more detailed look that illustrates the quartile distribution of member travel distance.
To visualize monthly users through the lens of their respective concentrations, see Figure 29. The y-axis shows the density (a measure of relative frequency).
The member group has a narrower, taller spike in between the 0.1 and 1 mile distance. There is less of a concentration than the casual group between the 1-3 mile distance. The casual group displays a broader spike with much overlap to the member group, but peaks closer to the 1 mile mark than for members. There is greater concentration than members between the 1-3 miles distance, but less around the 0.1-1 mile distance. Both groups overlap nearly completely from around the 3 mile mark and onwards.
Figure 29: Visualizing the probability-density distribution of distance by membership.
Table 42 presents the odds ratios for membership status across distance quartiles, with Q1 serving as the reference category. In summary:
Compared to Q1, the odds of being a member versus a casual rider were significantly lower in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 0.63 times as high in Q2, 0.59 times as high in Q3, and 0.65 times as high in Q4.
These results indicate an inverse relationship between ride distance and membership status, with members generally associated with shorter ride distances. Interestingly, the lowest odds of membership were observed in Q3, rather than Q4, suggesting a non-linear relationship between distance and membership likelihood.
Query, process, and create model R object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/distance_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Table 47: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel speed.
Summary Stats: Member travel speed
Characteristic
casual N = 1,260,621
member N = 2,636,777
Speed
Median (Q1, Q3)
6.3 (4.5, 8.1)
7.2 (5.7, 8.8)
Mean (SD)
6.4 (2.7)
7.3 (2.4)
Min, Max
1.0, 20.0
1.0, 20.0
Source: db/speed.db
Two plots, Figure 32 and Figure 33, present a highly granular view of the travel speed data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.
Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
Figure 33: A more detailed look that illustrates the quartile distribution of member travel speed.
To visualize travel speed patterns through the lens of their respective concentrations, see Figure 34. The y-axis shows the density (a measure of relative frequency).
The member group has a narrower, taller spike centering around 7mph. There is higher concentration than the casual group between ~ 5-15mph. The casual group has a broader, shorter spike centering closer to 6mph. There is higher concentration than the member group between ~ 1-5mph. Alongside significant overlap, both groups exhibit a unimodal distribution.
Figure 34: Visualizing the probability-density distribution of speed by membership.
Table 48 presents the odds ratios for membership status across speed quartiles, with Q1 serving as the reference category. In summary:
Compared to Q1, the odds of being a member versus a casual rider were significantly higher in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 2.09 times higher in Q2, 2.50 times higher in Q3, and 2.69 times higher in Q4.
These results suggest a strong positive association between riding speed and membership status, with the likelihood of membership increasing monotonically across speed quartiles.
Query, process, and create model R object for hour based on quartile range.
model<-dplyr::tbl(dbconn, "db/speed_wq.db")|>dplyr::collect()|>glm(formula =member_casual~quartile, family =binomial, weights =n)
Table 48: Modeling the probability of members’ travel by speed.
Binary Logistic Regression: Modeling the likelihood of members’ travel speed
Characteristic
OR1
p-value
Speed Ranges
Q1 (1.0 - 5.4]
—
Q2 (5.4 - 7.0]
2.09
<0.001
Q3 (7.0 - 8.6]
2.50
<0.001
Q4 (8.6 - 20]
2.69
<0.001
Source: db/speed_wq.db
1OR = Odds Ratio
3.10 Summarizing Tabsets
The EDA (exploratory data analysis) sections employ various statistical methods to uncover patterns in user behavior and preferences. A chi-square analysis reveals a significant association between bicycle type and membership status (p < 0.001). Binary logistic regression further quantifies this relationship, showing that electric bike users have lower odds of being members compared to classic bike users. Section 3.3
Trip duration analysis, also utilizing binary logistic regression, uncovers a notable trend: the odds of membership decrease substantially as trip duration increases. This model, using quartiles of trip duration, indicates that members generally take shorter, more concentrated trips, while casual users are more likely to engage in longer rides. Section 3.4
Seasonal trends emerge when examining monthly ridership patterns through another logistic regression model. The odds of membership fluctuate throughout the year, with the highest proportion of members riding during the colder months and early spring. As the weather warms, there’s a noticeable shift towards more casual ridership, as evidenced by lower odds ratios in the summer months. Section 3.5
Weekly and daily patterns, analyzed using similar regression techniques, provide further insights into user behavior. Weekdays, Section 3.6, particularly during typical commuting hours, Section 3.7, see higher odds of member rides. In contrast, weekends and evenings show decreased odds of membership, suggesting an increased likelihood of casual ridership during these times.
The analysis of trip distances, again using logistic regression with distance quartiles, reveals an inverse relationship with membership status. Members are more likely to take shorter trips, while casual users tend to embark on longer journeys. This aligns with the duration findings and reinforces the idea that members use the service for quick, routine travel. Section 3.4
Interestingly, trip speed shows a strong positive association with membership status in the logistic regression model. The odds of membership increase monotonically across speed quartiles, indicating that members generally ride at higher speeds compared to casual users. Section 3.9
These findings, derived from a combination of chi-square tests for independence and multiple binary logistic regression models, paint a picture of two distinct user groups: members who typically use the bike share for short, fast, routine trips during weekdays, and casual users who tend to take longer, slower rides, often during leisure hours or weekends.
Contingency tables and visualizations, including density plots and histograms, supplement these statistical analyses, providing a comprehensive view of the data distribution across various parameters such as bike type, trip duration, time of day, and day of the week.
The robust statistical approach, combining hypothesis testing (chi-square) with predictive modeling (logistic regression), provides strong evidence for the observed patterns in user behavior. These insights could prove valuable for tailoring marketing strategies, optimizing bike distribution, and enhancing service offerings to better serve both user segments.
3.11 Geographic Data
3.11.1 Traffic Flow
Figure 35 presents an intriguing bird’s-eye view of trip behaviors through an interactive epiflows graph. ]Moraga et al. (2019)] This R package used for creating this graph was re-purposed from its original intent for visualizing the spread of disease. This visualization employs a network of nodes (circles) connected by lines, where the thickness of the lines roughly corresponds to the volume of trips between the nodes, with thicker lines indicating a higher number of trips. The top 34 most frequently traveled stations are depicted in this visual network diagram.
Moraga, P., Dorigatti, I., Kamvar, Z. N., Piatkowski, P., Toikkanen, S. E., Nagraj, V. P., Donnelly, C. A., & Jombart, T. (2019). epiflows: an R package for risk assessment of travel-related spread of disease. https://doi.org/10.12688/f1000research.16032.3
The interactive nature of the epiflows allows users to click on individual nodes and lines to access more detailed information. Additionally, a drop-down window provides further exploration capabilities, enabling users to delve deeper into the data.
These stations represent the most active locations within the system. Fortunately, Section 3.11.2 explores a potential approach to gain insights into the typical high-traffic station locations and the underlying reasons behind their elevated activity levels.
Creating an EpiFlow
First, creates the frequency of trips taken to and from pairs of stations. We are only going to look deeper into the top 50 most traveled pairs.
Second, we need statistics but also to combine the statistics for every unique station name.
locationData<-dplyr::tbl(dbconn, filtered_path)|>dplyr::select(start_station_name, end_station_name, started_at, ended_at, trip_time)|>dplyr::group_by(start_station_name, end_station_name)|>dplyr::mutate(trip_time =round(trip_time, digits =0))|>dplyr::summarize(trip_count =dplyr::n(), first_date =min(started_at), last_date =max(ended_at),)|>dplyr::ungroup()|>dplyr::rename(from_station =start_station_name, to_station =end_station_name)|>dplyr::arrange(desc(trip_count))|>dplyr::collect()# Need to combine all names to single column and recalculate or retain other# stats.locationData_pivoted<-locationData|>tidyr::pivot_longer(cols =1:2, values_to ="allNames")|>dplyr::group_by(allNames)|>dplyr::summarize(trips_toAndfrom =sum(trip_count), first_date =min(first_date), last_date =max(last_date), )|>dplyr::arrange(trips_toAndfrom)
Third, creates epiflow objects, which take in a pair of dataframes and creates the flows between them.
# for all the pairsef_test<-epiflows::make_epiflows(flows =flowData, locations =locationData_pivoted, num_cases ="trips_toAndfrom")
Tables
First, just a quick view of the flow data table we made earlier.
flowData
# A tibble: 50 × 3
from_station to_station n
<chr> <chr> <dbl>
1 Ellis Ave & 60th St Ellis Ave & 55th St 6927
2 Ellis Ave & 60th St University Ave & 57th St 6600
3 Ellis Ave & 55th St Ellis Ave & 60th St 6349
4 University Ave & 57th St Ellis Ave & 60th St 6168
5 Calumet Ave & 33rd St State St & 33rd St 5417
6 State St & 33rd St Calumet Ave & 33rd St 5343
7 DuSable Lake Shore Dr & Monroe St Streeter Dr & Grand Ave 4023
8 Loomis St & Lexington St Morgan St & Polk St 3719
9 Morgan St & Polk St Loomis St & Lexington St 3379
10 University Ave & 57th St Kimbark Ave & 53rd St 3112
# ℹ 40 more rows
Second, another quick view, but for thethe location data we pivoted earlier.
# A tibble: 1,567 × 4
allNames trips_toAndfrom first_date last_date
<chr> <dbl> <dttm> <dttm>
1 Streeter Dr & Grand … 86422 2023-01-01 00:05:43 2024-01-01 00:19:01
2 Kingsbury St & Kinzi… 61277 2023-01-01 01:21:59 2023-12-31 21:30:50
3 DuSable Lake Shore D… 60808 2023-01-01 02:12:09 2023-12-31 23:34:53
4 Clark St & Elm St 60552 2023-01-01 01:06:48 2023-12-31 23:29:33
5 Clinton St & Washing… 58278 2023-01-01 00:44:39 2023-12-31 18:03:02
6 Wells St & Concord Ln 57642 2023-01-01 01:15:27 2023-12-31 23:51:50
7 Michigan Ave & Oak St 54000 2023-01-01 00:59:17 2023-12-31 23:09:35
8 Wells St & Elm St 52315 2023-01-01 00:59:22 2023-12-31 23:51:48
9 DuSable Lake Shore D… 48833 2023-01-01 00:14:47 2023-12-31 16:49:56
10 Theater on the Lake 48349 2023-01-01 03:14:22 2023-12-31 22:53:53
# ℹ 1,557 more rows
Figure 35: EpiFlow Network
Code Steps
Table Preview
3.11.2 Checking the Map
This section was made possible thanks to the latitude and longitude coordinates data provided alongside the stations names. Coming from the epiflow diagram, this should help make the data less abstract. The accordion below expands and collapses four OpenStreet maps found in the callout section below. These maps were split for viewing logistics. They contain from the epiflow in the section above. These maps are interactive, so the default views are zoom-able and movable. The transparent burst buttons enable snappy zooming-in of the station groups.
Code for Mapping
Processing ‘flowData’ created earlier to include geolocation data for mapview plots.
# All distinct stations in one columnnames<-flowData|>dplyr::select(from_station, to_station)|>tidyr::pivot_longer(cols =1:2, names_to =NULL, values_to ="station_names")|>dplyr::distinct()# The important geo-coordinates corresponding to station namesmapData<-dplyr::tbl(dbconn, filtered_path, check_from =FALSE)|>dplyr::select(start_station_name, start_lat, start_lng, end_station_name, end_lat,end_lng)# Filter to include all observations that match the station names listed in# 'names'. We need the geo-coordinates alongside the names.mapData1<-mapData|>dplyr::collect()|># Filter, but through a vector of conditions.dplyr::filter(start_station_name%in%names[[1]], end_station_name%in%names[[1]])|>dplyr::select(start_station_name:start_lng)# Had to split 'mapData' into two and pivot into a single table.mapData2<-mapData|>dplyr::collect()|>dplyr::filter(start_station_name%in%names[[1]], end_station_name%in%names[[1]])|>dplyr::select(end_station_name:end_lng)# Nice groupingstations_groupMap<-dplyr::bind_rows(mapData1, mapData2)|>dplyr::select(start_station_name, start_lat, start_lng)|>dplyr::rename(station_names =start_station_name, lat =start_lat, lng =start_lng)|>dplyr::distinct()|>dplyr::group_by(station_names)# Setting seed for samplingset.seed(113)# Taking 10 random samples from each station_name groupsampled_stations<-stations_groupMap|>dplyr::slice_sample(n =10)|>tidyr::drop_na()
Creates a map coloring palette excluding grays.
# All of the r-colorsallPalette<-colors()# The grays are vast so we don't want those watering down the samples.colorfulPal<-purrr::discard(allPalette, stringr::str_detect(allPalette, "gr(a|e)y"))# When we sample the colors, 10 should be slightly more than needed.n_colors<-10
First, sourcing the script needed to generate the maps and creating the list of vectors used as input. These vectors are the slices of the top most traveled stations.
# ----# Author: Eric Mossotti# CC BY-SA# ----# I needed the stations groups' burst buttons to fit the viewing window in my # document and the only way I could think of is to split the stations into # multiple maps. This reduces duplicate code.# ----mapViewer<-function(x){nameSlice<-sampled_stations|>dplyr::ungroup()|>dplyr::distinct(station_names)|>dplyr::slice(x)viewMap<-sampled_stations|>dplyr::filter(station_names%in%nameSlice$station_names)|>sf::st_as_sf(coords =c(3:2), crs =4326)|>mapview::mapview( zcol ="station_names", col.regions =randomColors, map.types ="OpenStreetMap", burst =TRUE, legend =FALSE)return(viewMap)}
Figure 39: Street Dr & Grand Ave - Woodlawn Ave & 55th St
3.11.3 Summarzing Geographic EDA
For example, suppose the user selects University Ave & 57th St in the epiflow visualization. This intersection happens to be at the heart of the University of Chicago campus. The natural next question is: where does the traffic to and from this location typically flow? By selecting one of the other nodes highlighted with flows directing away from the previous node, the user can identify Kimbark Ave and 53rd St. As seen in the map view, this location is situated adjacent to the Vue 53 Apartments complex. By analyzing such connections between nodes, the user can gain insights into common routes and destinations originating from a particular point of interest, potentially revealing patterns related to student housing, campus facilities, or other points of interest in the vicinity.
The data suggests individual members utilize the service multiple times weekly. However, further analysis is needed to determine if a significantly larger volume of unique individuals are annual members. Verifying associations between specific locations and higher or lower traffic could be a next step. Preliminary observations indicate universities, shopping centers, major companies, and nearby apartment complexes tend to have the highest ridership volumes.
To improve membership, addressing factors deterring individuals from becoming annual members could be key. These may include a lack of stations within walking distance of residences or destinations, or concerns over electric bicycle battery life and charging station availability, potentially explaining their lower utilization compared to classic bikes. Offering trial periods could allow casual users to experience the service’s reliability and convenience, encouraging conversion to annual memberships.
3.12 Updated Database Tables List
Code
dbList2<-duckdb::dbListTables(dbconn)|>as.data.frame()|>tabler(title =gt::md("Which tables have<br>been created so far?"), note_list =list(gt::md("Tables in `db/data.db` at this stage")), location_list =list("duckdb::dbListTables(dbconn)"), noteColumns =TRUE, source_note =gt::md("**Source**: `db/data.db`"), label_n =NULL)|>gt::cols_label(`duckdb::dbListTables(dbconn)` ="Table Paths")|>gt::cols_align(align ="left")dbList2
Table 49: The list of tables created by the end of the analysis.
Which tables have been created so far?
Table Paths1
db/bType.db
db/bType_wb.db
db/complete_data.db
db/data_unduped.db
db/distance.db
db/distance_wq.db
db/dow.db
db/dow_wq.db
db/duration.db
db/duration_wq.db
db/filtered_data.db
db/hod.db
db/hod_wq.db
db/membership.db
db/moy.db
db/moy_wq.db
db/original_data.db
db/speed.db
db/speed_wq.db
Source: db/data.db
1Tables in db/data.db at this stage
Figure 40: dbList_2
Notice the size difference from the previous image. The database is still represented in the data folder as one file.
3.13 Export the Final Database
For future use, it might save a lot of time to not have to re-download the original data or recreate tables. The database is exported and compressed to parquet files.
A view of the data exporting script.
# ----# CC BY-SA, Eric Mossotti # ----# Description ----# # Simplifies duckDB database export to a separate directory. Prevents the need# for re-downloading and recreating db tables in some circumstances.dbExporter<-function(dbdir, query){if(dir.exists("db_exported")==FALSE){dir.create("db_exported")}conn<-DBI::dbConnect(duckdb::duckdb(), dbdir)DBI::dbExecute(conn, query)}
Execute above script with a custom SQL query string parameter to the db_exported directory.
These findings and recommendations are based on robust statistical analyses, including chi-square tests, binary logistic regression models, and visualization techniques. They provide a data-driven foundation for enhancing the Divvy bike-sharing service and better serving the residents of Chicago.
4.1 Key Findings
Membership status significantly influences bike usage patterns:
Members prefer classic bikes over electric bikes.
Casual users have a higher electric bike usage compared to members.
Members typically take shorter, faster trips.
Casual users tend to engage in longer, slower rides.
Temporal patterns reveal distinct user behaviors:
Weekdays and typical commuting hours see higher member activity.
Weekends and evenings show increased casual ridership.
Membership likelihood is highest during colder months and early spring.
Summer months see a shift towards more casual ridership.
Trip characteristics vary by user type:
Members are associated with shorter ride distances.
Trip speed shows a strong positive association with membership status.
The likelihood of membership decreases as trip duration increases.
High-traffic stations are often near universities, shopping centers, major companies, and apartment complexes.
The large sample size (nearly 4 million users) allows for high statistical significance in observed differences.
4.2 Recommendations
Tailor marketing strategies to target potential members for short, frequent trips, especially for commuting purposes.
Optimize bike distribution to meet the demand for classic bikes among members and electric bikes among casual users.
Implement promotional campaigns during summer months to convert casual users to members.
Consider offering trial periods to allow casual users to experience the benefits of membership.
Investigate factors deterring individuals from becoming annual members, such as station proximity to residences or destinations.
Address potential concerns over electric bicycle battery life and charging station availability.
Focus on improving service near high-traffic areas like universities, shopping centers, and residential complexes.
Develop targeted strategies to encourage casual users of longer, leisure rides to consider membership benefits.
Utilize the epiflows visualization tool to identify and optimize popular routes and destinations.
Continue to collect and analyze data to refine understanding of user behavior and preferences over time.
---title: "Bike-Sharing in the Streets of Chicago"author: "Eric Mossotti"date: "05-23-2024"date-modified: last-modifieddate-format: "MMM D, YYYY"bibliography: references.bibrepo: https://github.com/ericMossotti/Bike_Sharesource: index.qmdabstract-title: "Objective"abstract: "Communicating reproducible, data-driven insight into the questions relating but not limited to, increasing annual membership subscriptions."description-meta: "Communicate reproducible, data-driven insights."code-links: - text: "Project Repo" href: https://github.com/ericMossotti/Bike_Sharecode-fold: truecode-copy: hovercode-overflow: wrapcode-tools: truecode-link: truetoc: truetoc-location: lefttoc-depth: 5number-sections: truelink-external-newwindow: truesmooth-scroll: truefig-responsive: trueecho: truecitation-location: margincitations-hover: truelink-citations: truecsl: csl/apa.cslzotero: true#callout-appearance: simplelightbox: autolicense: CC BY-SAfunding: "The author(s) received no specific funding for this work."---```{r, include = FALSE}knitr::opts_chunk$set(message = FALSE, warning = FALSE)```------------------------------------------------------------------------# Intro:::::::: {#offcanvas1 .offcanvas .offcanvas-end tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {#offcanvasLabel .h5 .offcanvas-title}Download Files and DB Creation:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: dl_toDirectory#| file: "Scripts/dl_toDirectory.R"#| code-summary: "A view the `dl_toDirectory.R` script used below."#| eval: false``````{r}#| label: csv_toDB#| file: "Scripts/csv_toDB.R"#| code-summary: "A view of the `csv_toDB.R` script used below."#| eval: false``````{r}#| label: importOrConnect#| code-summary: "Execute initial download and removal of incomplete observations, then<br>store original data and complete data separately.<br><br>Or, just connect to existing db,<br>then source frequently used scripts used later." #| tidy: trueif (exists("dbconn") ==FALSE&&dir.exists("db") ==FALSE) {source("Scripts/dl_toDirectory.R")dl_toDirectory (durls =sprintf("https://divvy-tripdata.s3.amazonaws.com/%d-divvy-tripdata.zip",202301:202312),tmpzip_dir ="tempZips", tmpzip_paths =sprintf("tempZips/%d-divvy-tripdata.zip",202301:202312),tmpfile_dir ="tempFiles",tmpfile_paths =sprintf("tempFiles/%d-divvy-tripdata.csv",202301:202312),tmpfile_names =sprintf("%d-divvy-tripdata.csv",202301:202312) )source("Scripts/csv_toDB.R")csv_toDB (tmpfile_dir ="tempFiles",tmpfile_paths =sprintf("tempFiles/%d-divvy-tripdata.csv",202301:202312),db_dir ="db", database_path ="db/data.db", original_path ="db/original_data.db", complete_path ="db/complete_data.db" )} else { database_path <-"db/data.db" original_path <-"db/original_data.db" complete_path <-"db/complete_data.db" filtered_path <-"db/filtered_data.db" dbconn <- DBI::dbConnect(duckdb::duckdb(), dbdir = database_path, read_only =FALSE)}source("Scripts/tabler.R")source("Scripts/plotter.R")source("Scripts/transformData.R")source("Scripts/chisqTest.R")```:::::::::::::::## StakeholdersThe primary stakeholders in this analysis are Divvy, Lyft (the parent company of Divvy), and the City of Chicago Department of Transportation. The analysis aims to provide these stakeholders with data-driven insights to enhance the Divvy bike-sharing service, better serving the residents of Chicago and its users. The initial rationale behind Divvy's implementation included improving air quality, promoting economic recovery, and reducing traffic congestion within the city. [@aboutdi]## Data Import::::: {.d-grid .gap-0}::: {.callout-tip width="auto" title="Off-Canvas Buttons"}Code processing steps are accessible via buttons like the one below. Drop-down code summaries and tables therein add context and transparency regarding the presented findings to enhance understanding.:::::: {.btn .btn-outline-danger .mb-3 type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas1" aria-controls="offcanvas"}Data import scripts and initial analysis setup decision::::::::## Source::: p-1The raw 2023 dataset was imported from Divvy Data. [@divvyda]::::::: {.d-flex overflow-x="auto"}::: column-screen-inset```{r}#| label: tbl-raw#| tbl-cap: "Unaltered data (other than being combined and formatted for presentation)."#| cap-location: top#| tidy: true# List of column labels to feed tabler() and add_multiple_footnotes()location_list <- dplyr::tbl(dbconn, original_path) |>dplyr::collect() |>colnames() |>as.list()# A simple list of footnotes to feed tabler() and add_multiple_footnotes().note_list <-list("Anonymized trip identifier.", "The bicycle type.", "Starting date-time (to the second).","Ending date-time (to the second).","Station name of where the trip started.","Station ID of where the trip started.","Station name of where the trip ended.","Station ID of where the trip ended.","Latitude associated with the starting location.","Longitude associated with the starting location.","Latitude associated with the ending location.","Longitude associated with the ending location.","If someone is an annual subscriber or not.")dplyr::tbl(dbconn, original_path) |>dplyr::collect() |>dplyr::slice_head(n =10) |>tabler(title ="What data did we start with?",source_note = gt::md("**Source**: Divvy Data"),note_list = note_list,location_list = location_list,noteColumns =TRUE,label_n =NULL) |>gt::tab_options(table.font.size = gt::pct(75),footnotes.multiline =FALSE)```::::::::::::::: {#offcanvas100 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Glimpse:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: glimpseRaw#| code-summary: A 'glimpse' function output of the raw data with data type information.#| tidy: truedplyr::tbl(dbconn, original_path) |>dplyr::collect() |>tibble::glimpse() ```::::::::::::::::::: {.d-grid .gap-0}::: {.btn .btn-outline-warning .mb-3 .mt-3 type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas100" aria-controls="offcanvas"}R console output of the raw data:::::::## DesignAnother worthy objective of this analysis is to achieve reproducibility and efficiency. To facilitate future research and enable subsequent analyst teams to build upon this work, the project aimed to provide adequate code documentation and adhere to best practices regarding clean and modular code.For instance, certain design decisions were incorporated to eliminate the need for re-downloading and re-processing data. For analysts conducting analysis over an extended period, such as days or months, on this dataset, it is now possible to simply reconnect to the single database file containing all the original data, including tables generated throughout the analysis process, following the initial download and subsequent processing.The underlying code incorporates an if-else decision, which includes a source code script responsible for handling the initial data processing and establishing the database filesystem. Opting for a persistent DuckDB filesystem (as opposed to a purely in-memory solution) appeared optimal in terms of simplicity, cost-effectiveness of SQL database queries, and retaining progress made over extended periods. [@whyduck]To streamline the process, reduce code duplication, and maintain consistent formatting throughout the project, reusable functions were developed for generating most of the tables and figures. These functions are located in the "Scripts" folder within the working directory. Their modular design not only simplifies the implementation of formatting changes but also facilitates the integration of additional code snippets when necessary. For instance, certain plots might require limiting the range of the axes, which can be achieved by combining these functions with appropriate code addendum. By leveraging these functions, the project benefits from reduced redundancy, improved efficiency, and cohesive formatting across all visualizations and data representations.## Initial Database Table List::: {.d-flex .justify-content-center .tableScroller}```{r}#| label: tbl-dbList#| tbl-cap: "The list of tables created in data.db so far."#| tidy: true# Since we already know the paths, we'll manually define these # Otherwise, might have to delete db folder and start overdbList <-list(Paths =c("db/original_data.db", "db/complete_data.db") ) |>as.data.frame()dbList |>tabler(title = gt::md("Which tables have<br>been created so far?"),note_list =list(gt::md("Tables in `db/data.db` at this stage")),location_list =list("Paths"),noteColumns =TRUE,source_note = gt::md("**Source**: `db/data.db`"),label_n =NULL,align_parameter ="left") |>gt::cols_align(align ="left")```:::::: column-margin{#fig-dbList_1}A view of the filesystem directory. Notice there are not separate files for tables. The file, data.db, in this view does not represent a table name but the name of the database.:::# Tidying## Duplicates:::::::: {#offcanvas2 .offcanvas .offcanvas-end tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Code to Remove Duplicates:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: originalNobs#| code-summary: First, record original observations from the raw data.#| tidy: true# Need to save this count for the summary table lateroriginal_nobs <- dplyr::tbl(dbconn, original_path) |>dplyr::collect() |>nrow()``````{r}#| label: duplicates_gt#| code-summary: "Create a table containing the duplicated observations."#| tidy: true# This is a separate table used to analyze the observations# returned as not distinct (n > 1). This adds an extra column, labeled "n".dupeTable <- dplyr::tbl(dbconn, complete_path) |>dplyr::select(started_at:end_station_name) |># Counts of unique rows added for column 'n'dplyr::add_count(started_at, ended_at, start_station_name, end_station_name) |># Only observations that have been duplicated 1 or more times are shown.dplyr::filter(n >1) |># To see all rows, not just one row for each obs.dplyr::ungroup() |>dplyr::arrange(started_at) |>dplyr::collect()``````{r}#| label: duplicateObs count#| code-summary: "Record a count of distinct duplicates and total observations."#| tidy: truedistinctCopiesCount <- dupeTable |>dplyr::distinct(n) |>as.integer()duplicateObs <-length(dupeTable[[1]])``````{r}#| label: undupedTable#| code-summary: "Create a table of the now unduplicated observations seen earlier."#| tidy: true# The issue is, we need to get rid of not all of these rows, but just the extra duplicate observations.# If there were 2 rows of duplicates, one would want to end up with 1 row after removing the extras.undupedTable <-dupeTable |>dplyr::distinct(started_at, start_station_name, ended_at, end_station_name)``````{r}#| label: incorrect distinct obs count#| code-summary: Record a count of the incorrect observations.#| tidy: true# Run an incorrect count on how many rows or observations there are in the dataset.count_incorrectDists <- dplyr::tbl(dbconn, complete_path) |>dplyr::distinct(dplyr::pick("ride_id")) |>dplyr::count(name ="Incorrect Distinct Observations") |>dplyr::collect() |>as.integer()``````{r}#| label: count_correctDists count#| code-summary: "Record a count of the correct observations."#| tidy: true# For the correct count of obscount_correctDists <- dplyr::tbl(dbconn, complete_path) |>dplyr::distinct(dplyr::pick("started_at","start_station_name","ended_at","end_station_name")) |>dplyr::count() |>dplyr::collect() |>as.integer()``````{r}#| label: writeUnduplicated#| code-summary: "Lastly, write the unduplicated data to the database."#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/data_unduped.db"))) {dplyr::tbl(dbconn, complete_path) |>dplyr::distinct(started_at,start_station_name,ended_at,end_station_name,.keep_all =TRUE) |>dplyr::arrange(started_at) |>dplyr::collect() |>duckdb::dbWriteTable(conn = dbconn,name ="db/data_unduped.db",overwrite =TRUE)}```:::::::::::::::::: {.mt-2 .mb-2}A crucial question arises: How can one identify and handle duplicate data? This section covers the process of checking for duplicates and selectively removing them while exercising caution. It is essential to recognize that the presence of unique values in a single column does not necessarily guarantee the uniqueness of each observation or row.While all values in the **ride_id** column were found to be unique, not all observations were truly distinct. To verify the uniqueness of each observation, additional columns such as **start_time**, **end_time**, **start_station**, and **end_station** were utilized. These columns provide more granular information, including the precise starting and ending times down to the second, as well as the starting and ending locations. It was assumed that observations with identical starting and ending date-times and stations, despite having different rider IDs, were potentially erroneous duplicates.:::::: tableScroller```{r}#| label: tbl-duplicates#| tbl-cap: "Duplicates Table"gtDupes <-dupeTable |>dplyr::group_by(started_at) |>gt::gt(rowname_col ="row",groupname_col ="started_at",row_group_as_column =TRUE) |>gt::tab_style(style =list( gt::cell_text(weight ="bold", align ="center"), gt::cell_borders(sides =c("bottom")) ),locations = gt::cells_column_labels(gt::everything())) |>gt::tab_style(style =list(gt::cell_borders(sides =c("left", "right"), color ="transparent"), gt::cell_text(align ="center", v_align ="middle")), locations = gt::cells_body(gt::everything())) |>gt::data_color(columns = start_station_name,target_columns = gt::everything(),method ="auto",palette ="basetheme::brutal") |>gt::tab_header(title ="A view of duplicated observations", subtitle ="Grouping follows the starting date-time value") |>gt::tab_options(heading.title.font.weight ="bolder",heading.subtitle.font.weight ="lighter",heading.align ="center",table.background.color ="transparent",table.font.color ="SeaShell",table.font.size = gt::pct(75),) |>gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`") )gtDupes```:::::: {.mt-3 .mb-3}Although the cause of such duplication errors is unknown, it could be assumed that one person checked out multiple bikes simultaneously. In that scenario, each bike would be assigned a unique **ride_id**. However, this occurrence was relatively rare, happening only **18** times over the course of a year. Since there is only one duplicate for each instance, it raises concerns and warrants further investigation. It is possible that trips could be grouped where one person pays for another rider's fare. However, if that were the case, it raises the question of why there is always precisely one duplicate.In @tbl-duplicates, duplicate observations are listed and grouped by color for visual clarity. In contrast, @tbl-unduplicated presents the data after removing the extra copy of each duplicate observation while preserving the unique observations. Of the duplicates identified, each had one extra copy. It was noted that the number of rows in the duplicates table is 36. Each duplicated observation has one duplicate, where **n** (the count) is always 2. Therefore, the expected number of observations to be removed was 18. A complication arose in determining how to remove not all observations but only the extra duplicate observation from each group.:::::: tableScroller```{r}#| label: tbl-unduplicated#| tbl-cap: Un-duplicated Tablegt_undupes <- undupedTable |>dplyr::collect() |>dplyr::group_by(started_at) |>gt::gt(rowname_col ="row",groupname_col ="started_at",row_group_as_column =TRUE) |>gt::fmt_number(decimals =0) |>gt::tab_style(style =list(gt::cell_text(weight ="bold", align ="center"),gt::cell_borders(sides =c("bottom"))),locations = gt::cells_column_labels(gt::everything())) |>gt::tab_style(style =list(gt::cell_borders(sides =c("left", "right")),gt::cell_text(align ="center", v_align ="middle")),locations = gt::cells_body(gt::everything())) |>gt::data_color(columns = start_station_name,target_columns = gt::everything(),method ="auto",palette ="basetheme::brutal") |>gt::tab_header(title ="After duplicates were removed", subtitle ="Same grouping") |>gt::tab_options(heading.title.font.weight ="bolder",heading.subtitle.font.weight ="lighter",heading.align ="center",table.background.color ="transparent",table.font.color ="SeaShell",table.font.size = gt::pct(75)) |>gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`"))gt_undupes```:::::: {.mb-2 .mt-2}To ensure the accurate removal of duplicates, the count of distinct n-values (representing the number of occurrences) for the un-duplicated table was computed, confirming the expected 18 unique instances. Subsequently, the total number of observations in the dataset was recorded, initially standing at 4,331,707. After removing the identified duplicate observations, the correct count of observations was 4,331,689. In summary, 18 additional observations were successfully removed, aligning with the expected number of duplicates identified earlier. These steps are documented in @tbl-observationHistory for reference.By carefully analyzing the count of distinct n-values and the total occurrences before and after reduplication, it was ensured that only the precise number of duplicate observations was removed, preserving the integrity of the unique data while eliminating the identified duplicates. This meticulous approach to data cleaning is crucial for maintaining data quality and reliability throughout the analysis process.::::::: {.d-grid .gap-0}::: {.btn .btn-outline-info .mt-4 .mb-3 type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas2" aria-controls="offcanvas"}Code used for handling duplicates in this section:::::::## Outliers::::::::: {#offcanvas33 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Transform and Filter the Database:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: filterDecisions#| code-summary: "If you happen to be re-using this code - <br>this is so you do not have to re-download<br>or re-filter after making further adjustments."#| tidy: truefiltered_path <-"db/filtered_data.db"# Do we still need to filter the database?if (duckdb::dbExistsTable(dbconn, filtered_path) ==FALSE) {source("Scripts/transFilter.R")transFilter(conxn = dbconn, oldPath ="db/data_unduped.db", newPath = filtered_path)}```:::::: flex-code```{r}#| label: filterScript#| code-summary: "This would execute if the conditions were met<br>to filter the db/data.db database table"#| file: "Scripts/transFilter.R"#| eval: false```:::::::::::::::::Observations deemed erroneous or irrelevant for identifying usage trends among members and casual users were filtered out. Keeping track of these errors is a good practice, as they might provide insights into the differences in how members and casuals utilize the service.Trips with negative duration were flagged as errors and removed. Additionally, trips lasting less than a minute but greater than zero were noted and removed, as they could potentially skew the derived statistics. These extremely short trips might be attributed to users briefly trying out the service before committing or quickly realizing their dissatisfaction with it. While some observations seemed nonsensical, most of the data was retained.Consistent with the previous approach, an **if-else** decision was employed to facilitate testing. An external database filtering script was utilized to streamline the code within the main Quarto document. The resulting filtered data served as the foundation for subsequent analysis and table generation.::: {.p-2 .flex-code}```{r}#| label: countFiltered#| code-summary: To get a count of the new total observations after filtering.#| tidy: truecount_filtered <- dplyr::tbl(dbconn, filtered_path) |>dplyr::select(ride_id) |>dplyr::distinct() |>dplyr::count() |>dplyr::collect() |>as.integer()```:::::: {.d-flex .justify-content-center}```{r}#| label: tbl-observationHistory#| tbl-cap: "Documenting observation count history."# To see the history of obs in our dataset.summaryProcessTable <- tidyr::tribble(~" ",~" ","Original ",original_nobs,"Complete Observations ",count_incorrectDists,"Duplicates ",(count_incorrectDists - count_correctDists),"Filtered ",(count_correctDists - count_filtered),"Total Corrected ",count_filtered) |>gt::gt(rownames_to_stub =FALSE) |>gt::tab_header(title ="Documenting transformation results", subtitle = gt::md("Noting the change<br>in overall observations")) |>gt::cols_label(" "="n") |>gt::tab_footnote(footnote = gt::md("Row counts throughout the cleaning steps."),locations = gt::cells_column_labels(columns =" ")) |>gt::tab_style(style =list(gt::cell_borders(sides ="bottom"),gt::cell_text(align ="left",stretch ="semi-expanded")),locations = gt::cells_body(gt::everything())) |>gt::tab_style(gt::cell_text(align ="center",stretch ="semi-expanded"),locations =list(gt::cells_title(groups =c("title", "subtitle")),gt::cells_column_labels(gt::everything()))) |>gt::fmt_number(decimals =0) |>gt::tab_options(column_labels.font.weight ="bold",table.background.color ="transparent",table.font.color ="SeaShell",row.striping.background_color ="gray10",row.striping.include_table_body =TRUE) |>gt::tab_source_note(source_note = gt::md("**Sources**: `db/data_original.db`, `db/data_complete.db`,<br>`db/data_unduped.db`, `db/data_filtered.db`"))summaryProcessTable```::::::: {.d-grid .gap-0}::: {.btn .btn-outline-success .mb-3 type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas33" aria-controls="offcanvas"}Outlier and additional transformations:::::::# Exploratory Analysis::::::::::: {#offcanvas8979 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}EDA Scripts:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::::: offcanvas-body::: flex-code```{r}#| label: tablerScript#| code-summary: Reduces duplicate code and improves conistency in table formatting across this report.#| file: "Scripts/tabler.R"#| eval: false ```:::::: flex-code```{r}#| label: plotterScript#| code-summary: Provides a useful script for consistent formatting of many of the plots. #| file: "Scripts/plotter.R"#| eval: false ```:::::: flex-code```{r}#| label: transformDataScript#| eval: false#| file: "Scripts/transformData.R"#| code-summary: Performs many of the more backend query transformations I needed for this document. ```:::::: flex-code```{r}#| label: chisqTestScript#| eval: false#| file: "Scripts/chisqTest.R"#| code-summary: Helpfully, this contributes to the Chi-Squared tables by allowing me to add the chi-squared statistic and degrees of freedom information in the footnotes.```::::::::::::::::::::::::: {.d-grid .gap-0}::: {.btn .btn-outline-warning .mt-3 .mb-3 type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas8979" aria-controls="offcanvas"}Scripts used frequently in this section:::::::## MethodsThis section explains certain methodologies that might be less obvious to the reader.### Chi-Square#### Set up {.unnumbered}**Null Hypothesis (H0)** There is no association between membership status and travel behavior.**Alternative Hypothesis (H1)** There is an association between membership status and travel behavior.**Significance level**, $(\alpha) = 0.05$#### Collect data {.unnumbered}Tables are transformed and stored for later use. Where needed, the appropriate table is extracted from the database for analysis.#### Expected frequencies {.unnumbered}:::{.p-2 style="font-size: 125%; text-align: center"}$\text Expected_{for\ each\ cell} = (\text Row\ Total \times Column\ Total) / \text Grand\ Total$:::#### Chi-square statistic {.unnumbered}:::{.p-2 style="font-size: 125%; text-align: center"}$\chi^2=\sum[(O-E)^2/E]$:::$O$ - Observed frequency$E$ - Expected frequency$\sum$ - Sum over all cells in the table#### Degrees of freedom {.unnumbered}:::{.p-1 style="font-size: 125%; text-align: center"}$\text df = (rows - 1) \times (columns - 1)$:::#### The critical value {.unnumbered}The critical value can be determined from the chi-square distribution table or software using $α$ and the df.#### The p-value {.unnumbered}The p-value is calculated from the chi-square statistic and the df. When calculating the p-value, statistical software or functions (like the ones later in this report) internally use this integral::::{.p-1 style="font-size: 125%; text-align: center"}$\text p-value = \int[\chi^2\ \ to\ \infty]\ f(x,\ df)dx$:::$\chi^2$ is the calculated chi-square statistic (the lower bound of integration)$\infty$ is the upper bound of integration$f(x,\ df)$ is the chi-square PDF (probability density function)$x$ is the variable of integration#### Interpreting results {.unnumbered}Our calculated chi-square statistic is compared to the critical value (greater or less than). We then would be checking for consistency by comparing the p-value to the significance level value. Then, we either reject, where $p < \alpha$ , or fail to reject, where $p>\alpha$ , the null hypothesis. This informs us if there is a significant association between the membership status and the given travel behavior parameter.#### Calculation Process {.unnumbered}The software used to generate the Chi-Square tables in this report makes it so a human doesn't need to work with the integrals directly. The software calculates the chi-square statistic and df from the data and uses that information as input to the chi-square calculator, which handles the integration to return the p-value.### Binary Logistic Regression#### Model Fitting {.unnumbered}The logistic regression model is fitted using Maximum Likelihood Estimation (MLE) through an Iteratively Reweighted Least Squares (IRLS) algorithm. This process determines the optimal coefficients $(\beta)$ for the predictors.#### Odds Ratio Calculation {.unnumbered}For each predictor variable, the odds ratio is calculated as $exp(β)$, where $\beta$ is the coefficient estimated by the model. This represents the change in odds of the outcome for a one-unit increase in the predictor.#### Statistical Significance {.unnumbered}p-values are computed for each predictor using the Wald test. This test calculates a z-statistic, $(\beta/SE(\beta))$, and compares it to a standard normal distribution to determine the probability of observing such a result under the null hypothesis.#### Results Compiled As {.unnumbered}- Predictor variables (Characteristics)- Odds Ratios (OR)- p-values- Sometimes confidence intervals for the ORs (not shown in this example)#### Reference Category {.unnumbered}For categorical predictors, one category is set as the reference (marked with a dash in the OR column). This dashed-out value is very close to 1.00. This process transforms the raw model output into an interpretable format, allowing readers to quickly assess the direction, magnitude, and significance of each predictor's effect on the outcome variable.## Membership {#sec-membership}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas771" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas772" aria-controls="offcanvas"}Table Preview:::::::::::::::::::::: {#offcanvas771 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Database Code:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: writeMembership#| code-summary: Write to .db#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/membership.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(member_casual) |>dplyr::arrange(member_casual) |>dplyr::collect() |>duckdb::dbWriteTable(conn = dbconn,name ="db/membership.db",overwrite =TRUE)}```::::::::::::::::::::::: {#offcanvas772 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1]]"}```{r}#| label: tbl-kableMembership#| tbl-cap: Kable output#| tidy: truedplyr::tbl(dbconn, "db/membership.db") |>dplyr::collect() |>head() |>kableExtra::kable()```::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-memberTotals and @fig-overallFreq give an idea of the overall member to casual travel distribution.:::::: plot-container```{r}#| label: fig-overallFreq#| fig-cap: "Plotting the overall membership distribution"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/membership.db", select_cols ="member_casual",group_cols ="member_casual", doWeights =TRUE) |>plotter(x_col = member_casual, y_col = n, geomType ="column", title ="How often do members travel?",x_label ="Rider Groups", y_label ="n")gplot```:::::: {.table-container .column-margin}```{r}#| label: tbl-memberTotals#| tbl-cap: "Tabularizing membership distribution across all observations"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/membership.db", select_cols ="member_casual",group_cols ="member_casual", doWeights =TRUE) |>tabler (title =NULL,note_list =list("membership status", "occurrences"),location_list =c("member_casual", "n"),source_note = gt::md("**Source**: `db/membership.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::::::::## Cycle Types {#sec-btypes}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas771" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas772" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas871 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: write bType to duckdb#| code-summary: Write bType.db to the database.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(rideable_type, member_casual) |>dplyr::arrange(rideable_type, member_casual) |>dplyr::collect() |>dplyr::mutate(rideable_type = forcats::as_factor(rideable_type)) |>duckdb::dbWriteTable(conn = dbconn,name ="db/bType.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: btypeTransform#| code-summary: Transform and write as a weighted binary table for modeling.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType_wb.db"))) {transformData(conn = dbconn,path ="db/bType.db",select_cols =c("rideable_type", "member_casual"),group_cols =c("rideable_type", "member_casual"),binary_col ="member_casual",zero_val ="casual",one_val ="member",doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/bType_wb.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas872 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2]]"}```{r}#| label: tbl-kableBtype#| tbl-cap: Kable output#| tidy: truedplyr::tbl(dbconn, "db/bType.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-kableBtypeW#| tbl-cap: Kable output#| tidy: truedplyr::tbl(dbconn, "db/bType_wb.db") |>dplyr::collect() |>head() |>kableExtra::kable()```:::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-btypeTotal and @fig-btypeTotal present the overall frequency distribution for choice of bicycle.:::```{r}#| label: fig-btypeTotal#| fig-cap: "The overall distribution of trips taken by bicycle choice"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/bType.db", select_cols ="rideable_type",group_cols ="rideable_type", doWeights =TRUE) |>plotter(x_col = rideable_type, y_col = n, geomType ="column", title =paste0("Which bicycles are preferred?"),x_label ="Type", y_label ="n")gplot```::: {.table-container .column-margin}```{r}#| label: tbl-btypeTotal#| tbl-cap: "The overall distribution of bicycle choice"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/bType.db", select_cols ="rideable_type",group_cols ="rideable_type", doWeights =TRUE) |>tabler( title =NULL, note_list =list("bicycle type","occurrences"),location_list =list("rideable_type", "n"),source_note = gt::md("**Source**: `db/bType.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::### [Comparative]{.panel-tabset-label}:::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3Likewise, @tbl-btypeGroups and @fig-btypeGroups summarize the choice of bicycle with the addtion of a grouped sum by the membership status.:::```{r}#| label: fig-btypeGroups#| fig-cap: "The membership distribution of bicycle choice"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/bType.db", select_cols =c("rideable_type", "member_casual"),group_cols =c("rideable_type", "member_casual"),doWeights =TRUE) |>plotter(title =paste0("Which bicycles are members choosing?"),x_label ="Type",y_label ="n",x_col = rideable_type, y_col = n, group_col = member_casual,geomType ="column", is_colGroup =TRUE,color_col ="black",colPosition ="dodge",colGroup_palette ="Paired")gplot```::: {.table-container .column-margin}```{r}#| label: tbl-btypeGroups#| tbl-cap: "The membership distribution of bicycle choice"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/bType.db", select_cols =c("rideable_type", "member_casual"),group_cols =c("rideable_type", "member_casual"),doWeights =TRUE) |>tabler(title =NULL, groupName ="rideable_type", note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/bType.db`"),noteRows =TRUE,#hide_column_labels = TRUE,row_index =1,isStub =TRUE,stub_label ="Type",stub_note ="bicycle type") |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Chi-Square]{.panel-tabset-label}::: p-3@tbl-btypeChiSquare presents a Chi-Square analysis of bicycle type usage among casual users and members. In summary, members show a higher preference for classic bikes *(65%)* compared to casual users *(59%)*. Casual users have a higher proportion of electric bike usage *(41%)* compared to members *(35%)*.There is a statistically significant association between bicycle type and membership status *(p \< 0.001)*. The very low p-value indicates strong evidence against the null hypothesis of no association between bicycle type and membership status. This suggests that the choice of bicycle type is not independent of membership status.The large $\chi^2$ value (14762.37) with just 1 degree of freedom (calculated as ***\[rows - 1\]*** \* *\[columns - 1\])*\*\* results in the very small p-value *(\< 0.001)*. This combination strongly suggests that the difference in bicycle preference between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.:::```{r}#| label: bikesChiResult#| code-summary: Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.#| tidy: truedata_tibble <-dplyr::tbl(dbconn, "db/bType.db") |>dplyr::select(rideable_type, member_casual) |>dplyr::collect()chiResult <-chisqTest(data = data_tibble, variable ="rideable_type", by ="member_casual")``````{r}#| label: tbl-btypeChiSquare#| tbl-cap: "Testing indepdence of bicycle choice and membership variables"#| tidy: truechi_table <-tabler(title = gt::md("Chi-Square:<br>The signficance of bicycle choice and membership"),source_note = gt::md("**Source**: `db/bType.db`"),label =list(rideable_type ="Bicycle Type"),by = member_casual,isSummary =TRUE,chiVar ="rideable_type",chiBy ="member_casual",tbl_name = data_tibble,chi_result = chiResult) chi_table```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-btypeModel presents the results of a binary logistic regression analyzing the relationship between bicycle type and membership status. The analysis compares classic bikes and electric bikes, with classic bikes serving as the reference category.The odds of membership for users of electric bikes were *0.76* times the odds for users of classic bikes. The difference in membership likelihood between electric and classic bike users is highly statistically significant *(p \< 0.001)*.:::```{r}#| label: btypeModel#| code-summary: "Predicting the log-odds of being a member versus being a casual user."#| tidy: truemodel <-dplyr::tbl(dbconn, "db/bType_wb.db") |>glm(formula = member_casual ~ rideable_type, weights = n, family = binomial)```::: table-container```{r}#| label: tbl-btypeModel#| tbl-cap: "Modeling the probability of member travel by the choice of bicycle."#| tidy: trueregression_tbl <- model |>gtsummary::tbl_regression(label =list(rideable_type ="Bicycle Choice"), conf.int =FALSE, exponentiate =TRUE)regression_tbl |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' bicycle choices"),source_note = gt::md("**Source**: `db/bType_wb.db`"),isBinary =TRUE)```::::::::::::::::::::::## Duration {#sec-duration}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas671" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas672" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas671 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: write duration to duckdb#| code-summary: Write to .db.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(trip_time, member_casual) |>dplyr::arrange(trip_time, member_casual) |>dplyr::collect() |>dplyr::mutate(trip_time =round(trip_time, digits =2),mins =round(trip_time, digits =0),mins = forcats::as_factor(mins)) |>dplyr::arrange(trip_time, member_casual) |>duckdb::dbWriteTable(conn = dbconn,name ="db/duration.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: durationWeightedQuantiles#| code-summary: Query, transform, and write weighted quartile data. #| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration_wq.db"))) {transformData(conn = dbconn,path ="db/duration.db",select_cols =c("trip_time", "member_casual"),group_cols =c("trip_time", "member_casual"),binary_col ="member_casual",pred_col ="trip_time",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (1.02 - 5.73]","Q2 (5.73 - 9.55]","Q3 (9.55 - 16.13]","Q4 (16.13 - 475.22]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/duration_wq.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas672 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2]]"}```{r}#| label: tbl-kableDuration#| tbl-cap: Kable output#| tidy: truedplyr::tbl(dbconn, "db/duration.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-kableDurationWQ#| tbl-cap: Kable output#| tidy: truedplyr::tbl(dbconn, "db/duration_wq.db") |>dplyr::collect() |>head() |>kableExtra::kable()```:::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-2@tbl-triptimeTotals and @fig-triptimeTotals give an idea of the overall travel duration distribution. The data were rounded to the nearest minute for these figures.:::```{r}#| label: fig-triptimeTotals#| fig-cap: "The overall distribution of trip durations"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/duration.db", select_cols ="mins",group_cols ="mins", doWeights =TRUE) |>plotter(x_col =as.integer(mins), y_col = n,geomType ="column", title =paste0("How long are the trips?"),x_label ="Minutes", y_label ="n",color_col ="black") +ggplot2::scale_x_continuous(limits =c(0, 60),breaks =seq(0, 60, by =5),guide = ggplot2::guide_axis(n.dodge =1, angle =45))gplot +ggplot2::theme(axis.text = ggplot2::element_text(size =8))```::: {.table-container .column-margin}```{r}#| label: tbl-triptimeTotals#| tbl-cap: "The overall distribution of trip durations"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/duration.db", select_cols ="mins",group_cols ="mins", doWeights =TRUE) |>tabler( title =NULL, note_list =list("trip duration", "occurrences"),location_list =list("mins", "n"),source_note = gt::md("**Source**: `db/duration.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::### [Comparative]{.panel-tabset-label}:::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-2Likewise, @tbl-triptimeCompare and @fig-triptimeCompare summarize the travel duration distribution, where frequencies were summed by membership status.:::```{r}#| label: fig-triptimeCompare#| fig-cap: "The membership distribution of trip durations"#| tidy: truegplot <-dplyr::tbl(dbconn, "db/duration.db") |>dplyr::select(mins, member_casual) |>dplyr::filter(as.integer(mins) <=100) |>dplyr::collect() |>transformData(conn =NULL, path =NULL,select_cols =c("mins", "member_casual"),group_cols =c("mins", "member_casual"), doWeights =TRUE,isDF =TRUE) |>plotter(title =paste0("How long do members ride for?"),x_label ="Minutes",y_label ="n",x_col = mins, y_col = n, group_col = member_casual,geomType ="column",is_colGroup =TRUE,colPosition = ggplot2::position_stack(reverse =TRUE),color_col ="black") +ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge =1, angle =45), limits = forcats::as_factor(seq(0, 60)),breaks = forcats::as_factor(seq(0, 60, by =5)))gplot +ggplot2::theme(axis.text = ggplot2::element_text(size =8))```::: {.table-container .column-margin}```{r}#| label: tbl-triptimeCompare#| tbl-cap: "The membership distribution of trip durations"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/duration.db", select_cols =c("mins", "member_casual"),group_cols =c("mins", "member_casual"), doWeights =TRUE) |>tabler(title =NULL, groupName ="mins", label_n ="n",note_list =list("membership status","occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/duration.db`"),isStub =TRUE,stub_label ="Time",stub_note ="trip duration (minutes)",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Summary Stats]{.panel-tabset-label}::: p-3@tbl-durationSummary sheds light on the variability, range, and quartile information about the duration data.:::```{r}#| label: tbl-durationSummary#| tbl-cap: "The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel duration."#| tidy: falsegtTable <- dplyr::tbl(dbconn, "db/duration.db") |>dplyr::select(mins, member_casual) |>dplyr::mutate(mins =as.numeric(mins)) |>dplyr::collect() |>gtsummary::tbl_summary(by = member_casual,type = mins ~"continuous2",label =list(mins ~"Duration (mins)"),digits =list(mins ~c(1, 1)),statistic =gtsummary::all_continuous() ~c("{median} ({p25}, {p75})", "{mean} ({sd})", "{min}, {max}")) |>gtsummary::italicize_levels() |>tabler(title = gt::md("Summary Stats:<br>Member travel duration"),source_note = gt::md("**Source**: `db/duration.db`"),isBinary =TRUE)gtTable```#### [Histogram]{.panel-tabset-label}::: p-3To accompany the summary table, two plots, @fig-durationCasualHistogram and @fig-durationMemberHistogram, present a highly granular view of the duration dataset (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.:::```{r}#| label: durationQuartiles#| code-summary: "Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/duration.db") |>dplyr::select(trip_time, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/duration.db") |>dplyr::select(trip_time, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(qdf_member$trip_time, probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(qdf_casual$trip_time, probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-durationCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of casual travel duration."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' travel duration"),x_label ="Minutes",y_label ="n",x_col = trip_time, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",limits =c(0,100),breaks =seq(0, 100, by =5),binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_casual)gplot``````{r}#| label: fig-durationMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of member travel duration."#| tidy: true# tried 2.5 for binwidth but not sure if want to keep gplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' travel duration"),x_label ="Minutes",y_label ="n",x_col = trip_time, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",limits =c(0,100),breaks =seq(0, 100, by =5),binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_member)gplot```#### [Density]{.panel-tabset-label}::: p-3Supplementing the histogram, @fig-durationDensity shows a density plot comparing the duration of trips for ***casual*** users and ***members***. The x-axis represents time in minutes, limited to a range of 0 to 100 for presentation purposes, while the y-axis shows the density (a measure of relative frequency).The **member** group (darker blue) has a higher and narrower peak compared to the **casual** group (lighter blue). This indicates that members tend to have more concentrated distribution of a session duration around their most common length. The **casual** group appears to have a slightly fatter tail, extending further to the right than the *member* group. This suggests that casual users might occasionally have longer sessions than members, even if it's less common.Both groups exhibit right-skewed distributions, with a peak near the left side and a long tail extending to the right. This suggests that for both groups, a shorter duration is more common, while a longer duration occurs less frequently but can extend quite far. They seem to have their peak density around 5-10 minutes, with members peaking slightly earlier than casual users. There is significant overlap between the two distributions, indicating that while there are differences, there is also considerable similarity in duration patterns between casual users and members.:::```{r}#| label: fig-durationDensity#| fig-cap: "Visualizing the probability-density distribution between duration and membership variables."#| tidy: truegplot <-transformData(conn = dbconn, path ="db/duration.db",select_cols =c("trip_time", "member_casual")) |>plotter(title ="Visualizing relative frequency\n across travel duration groups",x_label =paste0("Minutes"),x_col = trip_time, group_col = member_casual,geomType ="column",angle =45,color_col ="black",density_alpha =0.75,isDensity =TRUE,is_colGroup =TRUE,breaks =seq(0, 100, by =5),limits =c(0, 100))gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modeldurationQ presents the results of a binary logistic regression model, analyzing the relationship between ride duration and membership status. The analysis divides ride duration into quartiles, with ***Q1 (1.02 - 5.73 minutes)*** serving as the reference category.Compared to *Q1*, the odds of being a member versus a casual rider varied significantly across the other duration quartiles ( *p \< 0.001* for all comparisons). With ***Q2 (5.73 - 9.55 minutes)***, the odds of membership were 0.38 times as high as with *Q1*. This indicates a substantial decrease (62%) in the likelihood of membership for slightly longer rides. With ***Q3 (9.55 - 16.13 minutes)***, the odds of membership were 0.08 times as high as in *Q1*. This shows a dramatic decrease (92%) in membership likelihood for medium-length rides. With ***Q4 (16.13 - 475.22 minutes)***, the odds of membership were 0.06 times as high as in *Q1*. This represents an even more pronounced decrease (94%) in membership likelihood for the longest rides.:::```{r}#| label: modeldurationQ#| code-summary: Query, process, and create model R object for hour based on quartile range.#| tidy: truemodel <-dplyr::tbl(dbconn, "db/duration_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)```::: table-container```{r}#| label: tbl-modeldurationQ#| tbl-cap: "Modeling the probability of members' travel by the duration."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Duration Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel durations"),source_note = gt::md("**Source**: `db/duration_wq.db`"),isBinary =TRUE)```::::::::::::::::::::::::::## Month {#sec-moy}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas571" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas572" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas571 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: writeMonths#| code-summary: Write moy.db to the database.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(started_at, member_casual) |>dplyr::arrange(started_at) |>dplyr::collect() |>dplyr::mutate(member_casual =factor(member_casual, levels =c("casual", "member")),abbMonths = lubridate::month(started_at, label =TRUE, abbr =TRUE),abbMonths = forcats::as_factor(abbMonths)) |>duckdb::dbWriteTable(conn = dbconn,name ="db/moy.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: monthsWeightedQuantiles#| code-summary: Query, transform, and write weighted quartile data.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy_wq.db"))) {transformData(conn = dbconn,path ="db/moy.db",select_cols =c("started_at", "member_casual"),group_cols =c("started_at", "member_casual"),binary_col ="member_casual",pred_col ="started_at",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (Jan 01 - May 20]","Q2 (May 20 - Jul 21]","Q3 (Jul 21 - Sep 18]","Q4 (Sep 18 - Dec 31]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/moy_wq.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas572 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2]]"}```{r}#| label: tbl-kableMoy#| tbl-cap: Kable output for months of the year#| tidy: truedplyr::tbl(dbconn, "db/moy.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-kableMoyW#| tbl-cap: Kable output for weighted months of the year.#| tidy: truedplyr::tbl(dbconn, "db/moy_wq.db") |>dplyr::collect() |>head() |>kableExtra::kable()```:::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-monthTotals and @fig-monthTotals shows the overall monthly travel distribution.:::```{r}#| label: fig-monthTotals#| fig-cap: "The overall distribution of trips taken by month"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/moy.db", select_cols ="abbMonths",group_cols ="abbMonths", doWeights =TRUE) |>plotter(x_col = abbMonths, y_col = n,geomType ="column", title =paste0("Which months do people tend to ride?"),x_label =paste0("Months","\n","(2023)"),y_label ="n")gplot```::: {.table-container .column-margin}```{r}#| label: tbl-monthTotals#| tbl-cap: "The overall distribution of trips taken by month"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/moy.db", select_cols ="abbMonths",group_cols ="abbMonths", doWeights =TRUE) |>tabler(title =NULL,note_list =list("month of year", "occurrences"),location_list =c("abbMonths", "n"),source_note = gt::md("**Source**: `db/moy.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::### [Comparative]{.panel-tabset-label}:::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3Similarly, @tbl-monthCompare and @fig-monthCompare summarize the monthly travel, where frequencies were summed by membership status.:::```{r}#| label: fig-monthCompare#| fig-cap: "The membership distribution of trips taken by month"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/moy.db", select_cols =c("abbMonths", "member_casual"),group_cols =c("abbMonths", "member_casual"), doWeights =TRUE) |>plotter(title =paste0("Which months do members tend to travel?"), x_label ="Months",y_label ="n",x_col = abbMonths, y_col = n, group_col = member_casual,geomType ="column", isFaceted =TRUE,is_colGroup =TRUE)gplot```::: {.table-container .column-margin}```{r}#| label: tbl-monthCompare#| tbl-cap: "The membership distribution of trips taken by month"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/moy.db", select_cols =c("abbMonths", "member_casual"),group_cols =c("abbMonths", "member_casual"), doWeights =TRUE) |>tabler(title =NULL, groupName ="abbMonths", location = n,label_n ="n",note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/moy.db`"),isStub =TRUE,stub_label ="Month",stub_note ="months of the year (abbreviated)",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Chi-Square]{.panel-tabset-label}::: p-3@tbl-chiMonths presents a Chi-Square of monthly travel among members and casual users. In summary, members exhibit higher usage for periods Jan - Mar and Oct - Dec. Both groups prefer the months of Apr - Sep, where casuals show a higher interest, proportionally for their respective group.There is a statistically significant association between monthly travel and membership status *(p \< 0.001)*. The very low p-value indicates strong evidence against the null hypothesis of no association between monthly travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.The large $\chi^2$ value (71802.26) with 11 degrees of freedom (calculated as ***\[rows - 1\]*** **\[columns - 1\])**\* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.:::```{r}#| label: monthsChiResult#| code-summary: Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.#| tidy: truedata_tibble <-dplyr::tbl(dbconn, "db/moy.db") |>dplyr::select(abbMonths, member_casual) |>dplyr::arrange(abbMonths, member_casual) |>dplyr::collect()chiResult <-chisqTest(data = data_tibble, variable ="abbMonths", by ="member_casual")``````{r}#| label: tbl-chiMonths#| tbl-cap: "Testing the independence between month and membership variables"#| tidy: truechi_table <-tabler(title = gt::md("Chi-Square:<br>The signficance of monthly travel on membership"),source_note = gt::md("**Source**: `db/moy.db`"),label =list(abbMonths ="Month"),by = member_casual,isSummary =TRUE,tbl_name = data_tibble,chi_result = chiResult)chi_table```#### [Histogram]{.panel-tabset-label}::: p-3Two plots, @fig-monthCasualHistogram and @fig-monthMemberHistogram, present a highly granular view of the monthly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.:::```{r}#| label: monthlyQuartiles#| code-summary: "Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/moy.db") |>dplyr::select(started_at, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/moy.db") |>dplyr::select(started_at, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(qdf_member$started_at, probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(qdf_casual$started_at, probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-monthCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of monthly casual travel."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' monthly travel"),x_label ="Months",y_label ="n",x_col = started_at, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="1 month", date_labels ="%b", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_casual,qformat ="%b-%d")gplot``````{r}#| label: fig-monthMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of monthly member travel."#| tidy: truegplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' monthly travel"),x_label ="Months",y_label ="n",x_col = started_at, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="1 month", date_labels ="%b", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_member,qformat ="%b-%d")gplot```#### [Density]{.panel-tabset-label}::: p-3To visualize monthly users through the lens of their respective concentrations, see @fig-monthDensity. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).The casual group has higher peaks near the middle of the distribution. The member group has higher peaks near the left and right sides of the distribution. Both of the groups exhibit somewhat normal, multi-modal distributions. They overlap each other significantly, suggesting similarity in travel patterns.:::```{r}#| label: fig-monthDensity#| fig-cap: "Visualizing the probability-density distribution of months by membership."#| tidy: truegplot <-dplyr::tbl(dbconn, "db/moy.db") |>dplyr::collect() |>plotter(title ="Visualizing relative frequency\nacross monthly travel groups",x_label =paste0("Months"),x_col = started_at, group_col = member_casual,geomType ="other",angle =45,color_col ="black",density_alpha =0.75,isTime =TRUE,date_breaks ="1 month",date_labels ="%b",)gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modelMonthsQ presents the results of a binary logistic regression analyzing the relationship between months of the year and membership status. The analysis divides the year into quartiles, with ***Q1 (January 01 - May 20)*** serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p \< 0.001 for all comparisons).With ***Q2 (May 20 - Jul 21)***, the odds of membership were 0.57 times as high as in Q1. This indicates a substantial decrease (43%) in the likelihood of membership during late spring and early summer. With ***Q3 (Jul 21 - Sep 18)***, the odds of membership were 0.58 times as high as in Q1. This shows a similar decrease (42%) in membership likelihood during late summer and early fall, nearly identical to Q2. And with ***Q4 (Sep 18 - Dec 31)***, the odds of membership were 0.87 times as high as in Q1. While still lower than Q1, this represents a less pronounced decrease (13%) in membership likelihood during fall and early winter.:::```{r}#| label: modelMonthsQ#| code-summary: Query, process and create model R object for hour based on quartile range. #| tidy: truemodel <-dplyr::tbl(dbconn, "db/moy_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)```::: table-container```{r}#| label: tbl-modelMonthsQ#| tbl-cap: "Modeling the probability of monthly member travel."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Months Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of monthly member travel"),source_note = gt::md("**Source**: `db/moy.db`"),isBinary =TRUE)```::::::::::::::::::::::::::## Day {#sec-dow}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas471" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas472" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas471 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: writeDow#| code-summary: Write dow.db to the database.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(started_at, member_casual) |>dplyr::arrange(started_at) |>dplyr::collect() |>dplyr::mutate(wkdays = lubridate::wday(started_at, week_start =7),member_casual =factor(member_casual, levels =c("casual", "member")),started_at =update(started_at,year =2024,month =9,day = wkdays),abbDays = lubridate::wday(started_at, label =TRUE, abbr =TRUE),abbDays = forcats::as_factor(abbDays)) |>duckdb::dbWriteTable(conn = dbconn,name ="db/dow.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: writeDaysWQ#| code-summary: Query, transform, and write weighted quartile data.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow_wq.db"))) {transformData(conn = dbconn,path ="db/dow.db",select_cols =c("started_at", "member_casual"),group_cols =c("started_at", "member_casual"),binary_col ="member_casual",pred_col ="started_at",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (Sun 12:00 am - Mon 11:40 am]","Q2 (Mon 11:40 am - Wed 05:14 am]","Q3 (Wed 05:14 am - Fri 12:19 pm]","Q4 (Fri 12:19 pm - Sat 11:59 pm]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/dow_wq.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas472 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1]]"}```{r}#| label: tbl-kableDay#| tbl-cap: Days of the week#| tidy: truedplyr::tbl(dbconn, "db/dow.db") |>dplyr::collect() |>head() |>kableExtra::kable()```::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-wkdayTotals and @fig-wkdayTotals show an aggregated distribution of the overall daily travel.:::```{r}#| label: fig-wkdayTotals#| fig-cap: "The overall distribution of trips taken by day of the week"#| tidy: true# Values were too similar to visualize differences, see coord_cartesion()gplot <-transformData(conn = dbconn, path ="db/dow.db", select_cols ="abbDays",group_cols ="abbDays", doWeights =TRUE) |>plotter(x_col = abbDays, y_col = n, geomType ="column", title =paste0("Which days\ndo people tend to ride?"),x_label ="Days of the Week",y_label ="n") +ggplot2::coord_cartesian(ylim =c(4.5*10^5, NA))gplot```::: {.table-container .column-margin}```{r}#| label: tbl-wkdayTotals#| tbl-cap: "The overall distribution of trips taken by day of the week"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/dow.db", select_cols ="abbDays",group_cols ="abbDays", doWeights =TRUE) |>tabler( title =NULL,note_list =list("day of the week", "occurrences"),location_list =list("abbDays", "n"),source_note = gt::md("**Source**: `db/dow.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::### [Comparative]{.panel-tabset-label}::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3Similarly, @tbl-wkdayCompare and @fig-wkdayCompare summarize the duration distribution, grouping sums by membership status.:::```{r}#| label: fig-wkdayCompare#| fig-cap: "The membership distribution of trips taken by day of the week"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/dow.db", select_cols =c("abbDays", "member_casual"),group_cols =c("abbDays", "member_casual"),doWeights =TRUE) |>plotter(title ="Day Groups",x_label ="Days",y_label ="n",x_col = abbDays, y_col = n, group_col = member_casual,geomType ="column", isFaceted =TRUE,is_colGroup =TRUE)gplot```::: {.table-container .column-margin}```{r}#| label: tbl-wkdayCompare#| tbl-cap: "The membership distribution of trips taken by day of the week"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/dow.db", select_cols =c("abbDays", "member_casual"),group_cols =c("abbDays", "member_casual"),doWeights =TRUE) |>tabler(title =NULL,groupName ="abbDays", label_n ="n",note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/dow.db`"),isStub =TRUE,stub_label ="Day",stub_note ="days of the week (abbreviated)",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Chi-Square]{.panel-tabset-label}::: p-3In @tbl-chiDays, present a Chi-Square of daily travel among members and casual users. Members show higher usage Mon - Thu compared to casual users. The pattern is more regular compared to casual users. Casual users show higher usage Fri - Sun compared to members. Their preferences fall within the weekend.There is a statistically significant association between daily travel and membership status *(p \< 0.001)*. The very low p-value indicates strong evidence against the null hypothesis of no association between daily travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.The large $\chi^2$ value (76305.71) with 6 degrees of freedom (calculated as ***\[rows - 1\] \* \[columns - 1\])*** results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.:::```{r}#| label: daysChiResult#| code-summary: Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.#| tidy: truedata_tibble <-dplyr::tbl(dbconn, "db/dow.db") |>dplyr::select(abbDays, member_casual) |>dplyr::arrange(abbDays, member_casual) |>dplyr::collect()chiResult <-chisqTest(data = data_tibble, variable ="abbDays", by ="member_casual")``````{r}#| label: tbl-chiDays#| tbl-cap: "Testing the independence between month and membership variables"#| tidy: truetabler(title = gt::md("Chi-Square:<br>The signficance of daily travel on membership"),source_note = gt::md("**Source**: `db/moy.db`"),label =list(abbDays ="Day"),by = member_casual,isSummary =TRUE,tbl_name = data_tibble,chi_result = chiResult)```#### [Histogram]{.panel-tabset-label}::: p-3Two plots, @fig-dayCasualHistogram and @fig-dayMemberHistogram, present a highly granular view of the daily data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.:::```{r}#| label: dayQuartiles#| code-summary: "Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/dow.db") |>dplyr::select(started_at, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/dow.db") |>dplyr::select(started_at, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(qdf_member$started_at, probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(qdf_casual$started_at, probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-dayCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of daily casual travel."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' daily travel"),x_label ="Days",y_label ="n",x_col = started_at, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="1 day", date_labels ="%a", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_casual,qformat ="%a %I %p") gplot``````{r}#| label: fig-dayMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of daily member travel."#| tidy: truegplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' daily travel"),x_label ="Days",y_label ="n",x_col = started_at, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="1 day", date_labels ="%a", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_member,qformat ="%a %I %p") gplot```#### [Density]{.panel-tabset-label}::: p-3To visualize daily users through the lens of their respective concentrations, see @fig-dayDensity The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).The casual group has higher peaks near the left and right sides of the distribution. Their highest peak appears to be on Saturday. The member group has higher peaks near the middle of the distribution. They also show less day-to-day variability during weekdays compared to casual users.On weekdays, both groups show a bimodal distribution, with two peaks each day. On weekends, there is a unimodal distribution. They overlap each other significantly, suggesting similarity in travel patterns.:::```{r}#| label: fig-dayDensity#| fig-cap: "Visualizing the probability-density distribution of the day by membership variables."#| tidy: truegplot <-dplyr::tbl(dbconn, "db/dow.db") |>dplyr::select(started_at, member_casual) |>dplyr::collect() |>plotter(title ="Visualizing relative frequency\nacross daily travel groups",x_label =paste0("Day"),x_col = started_at, group_col = member_casual,geomType ="other",angle =45,color_col ="black",density_alpha =0.75,isTime =TRUE,date_breaks ="1 day",date_labels ="%a")gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modelDaysQ presents the results of a binary logistic regression analyzing the relationship between days of the week and membership status. The analysis divides the week into quartiles, with Q1 (Sunday 12:00 am - Monday 11:40 am) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p \< 0.001 for all comparisons).With Q2 (Monday 11:40 am - Wednesday 05:14 am), the odds of membership were 1.52 times higher than in Q1. This suggests a substantial increase in the likelihood of members riding during the early part of the work week.With Q3 (Wednesday 05:14 am - Friday 12:19 pm), the odds of membership were 1.36 times higher than in Q1. This indicates a continued higher likelihood of membership during the latter part of the work week, though slightly lower than Q2. With Q4 (Friday 12:19 pm - Saturday 11:59 pm), the odds of membership were 0.80 times as high as in Q1. This represents a significant decrease in the likelihood of membership during the weekend period.:::```{r}#| label: modelDaysQ#| code-summary: Query, process, and create a model object for hour based on quartile range. #| tidy: truemodel <-dplyr::tbl(dbconn, "db/dow_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)``````{r}#| label: tbl-modelDaysQ#| tbl-cap: "Modeling the probability of daily member travel."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Weekday Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of daily member travel"),source_note = gt::md("**Source**: `db/dow_wq.db`"),isBinary =TRUE)```:::::::::::::::::::::## Hour {#sec-hod}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas371" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas372" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas371 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: writeHod#| code-summary: Write to .db#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(started_at, member_casual) |>dplyr::arrange(started_at) |>dplyr::collect() |>dplyr::mutate(started_at_time =update(started_at,year =2023,month =1,day =1),hr = stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="hour"), "%I %p")),hrMin = stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="minute"),"%I:%M %p")),hrminSec = stringr::str_to_lower(format(lubridate::round_date(started_at, unit ="second"), "%r")),hr = forcats::as_factor(hr),hrMin = forcats::as_factor(hrMin)) |>dplyr::select(member_casual:hrminSec) |>duckdb::dbWriteTable(conn = dbconn,name ="db/hod.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: hoursWeightedQuantiles#| code-summary: Query, transform, and write weighted quartile data to hod_wq.db.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod_wq.db"))) {transformData(conn = dbconn,path ="db/hod.db",select_cols =c("started_at_time", "member_casual"),group_cols =c("started_at_time", "member_casual"),binary_col ="member_casual",pred_col ="started_at_time",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (12:00 am - 10:59 am]","Q2 (10:59 am - 03:24 pm]","Q3 (03:24 pm - 06:05 pm]","Q4 (06:05 pm - 11:59 pm]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/hod_wq.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas372 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2],[3]]"}```{r}#| label: tbl-hodKable#| tbl-cap: Kable output of hod.db#| tidy: truedplyr::tbl(dbconn, "db/hod.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-hoursKableWQ#| tbl-cap: Kable output of hod_wq.db#| tidy: truedplyr::tbl(dbconn, "db/hod_wq.db") |>dplyr::collect() |>head() |>kableExtra::kable()```::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-hourTotals and @fig-hourTotals give an aggregated distribution of the overall hourly travel distribution.:::```{r}#| label: fig-hourTotals#| fig-cap: "The overall distribution of trips taken by hour of the day"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/hod.db", select_cols ="hr",group_cols ="hr", doWeights =TRUE) |>plotter(x_col = hr, y_col = n,geomType ="column", title =paste0("Which hours do people tend to ride?"),x_label ="Hour", y_label ="n",) +ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge =1, angle =45)) +ggplot2::theme(axis.text = ggplot2::element_text(size =8))gplot```::: {.table-container .column-margin}```{r}#| label: tbl-hourTotals#| tbl-cap: "The overall distribution of trips taken by hour of the day"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/hod.db", select_cols ="hr",group_cols ="hr", doWeights =TRUE) |>tabler( title =NULL,note_list =list("hour of the day", "occurrences"),location_list =list("hr", "n"),source_note = gt::md("**Source**: `db/hod.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1)```:::### [Comparative]{.panel-tabset-label}::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3Similarly, @tbl-hourMembership and @fig-hourCompare summarize the distribution, grouping the sums by membership.:::```{r}#| label: fig-hourCompare#| fig-cap: "The membership distribution of trips taken by hour of the day"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/hod.db", select_cols =c("hr", "member_casual"),group_cols =c("hr", "member_casual"),doWeights =TRUE) |>plotter(title =paste0("Which hours of the day do members ride?"),x_label ="Hour of Day",y_label ="n",x_col = hr, y_col = n, group_col = member_casual,geomType ="column", isFaceted =TRUE,is_colGroup =TRUE) +ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge =1, angle =45)) +ggplot2::theme(axis.text = ggplot2::element_text(size =8))gplot```::: {.table-container .column-margin}```{r}#| label: tbl-hourMembership#| tbl-cap: "The membership distribution of trips taken by hour of the day"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/hod.db", select_cols =c("hr", "member_casual"),group_cols =c("hr", "member_casual"),doWeights =TRUE) |>tabler(title =NULL,groupName ="hr", location = n,label_n ="n",note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/hod.db`"),isStub =TRUE,stub_label ="Hour",stub_note ="hour of the day (12-hour clock)",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Chi-Square]{.panel-tabset-label}::: p-3@tbl-chiHours presents a Chi-Square of hourly travel among members and casual users. In summary:Compared to casual users, members show relatively higher service usage at intervals \[5:30am - 9:30am\] and \[4:30pm - 6:30pm\]. Casual users show relatively higher rates at intervals \[10:30am - 4:30pm\] and \[6:30pm - 5:30am\].There is a statistically significant association between hourly travel and membership status *(p \< 0.001)*. The very low p-value indicates strong evidence against the null hypothesis of no association between hourly travel and membership status. The frequency of hourly travel is not likely to be independent of membership status.The large $\chi^2$ value (72733.38) with 23 degrees of freedom (calculated as ***\[rows - 1\]*** **\[columns - 1\])**\* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.:::```{r}#| label: hoursChiResult#| code-summary: Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.#| tidy: truedata_tibble <-transformData(conn = dbconn, path ="db/hod.db", select_cols =c("hr", "member_casual"))chiResult <-chisqTest(data = data_tibble, variable ="hr", by ="member_casual")``````{r}#| label: tbl-chiHours#| tbl-cap: "Testing the independence between hour and membership variables"#| tidy: truetabler(title = gt::md("Chi-Square:<br>The signficance of hourly travel on membership"),source_note = gt::md("**Source**: `db/hod.db`"),label =list(hr ="Hour"),by = member_casual,isSummary =TRUE,tbl_name = data_tibble,chi_result = chiResult)```#### [Histogram]{.panel-tabset-label}::: p-3Two plots, @fig-hourCasualHistogram and @fig-hourMemberHistogram, present a highly granular view of the hourly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.:::```{r}#| label: hourQuartiles#| code-summary: "Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/hod.db") |>dplyr::select(started_at_time, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/hod.db") |>dplyr::select(started_at_time, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(qdf_member$started_at_time, probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(qdf_casual$started_at_time, probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-hourCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of hourly casual travel."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' hourly travel"),x_label =paste0("Hours\n", "(12-hour clock)"),y_label ="n",x_col = started_at_time, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="2 hour", date_labels ="%I %p", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_casual,qformat ="%I:%M %p") gplot +ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(.9)))``````{r}#| label: fig-hourMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of hourly member travel."#| tidy: truegplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' hourly travel"),x_label =paste0("Hours\n", "(12-hour clock)"),y_label ="n",x_col = started_at_time, geomType ="column", isHistogram =TRUE,isTimeHist =TRUE,date_breaks ="2 hour", date_labels ="%I %p", angle =45,color_col ="black",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),quartiles = quartiles_member,qformat ="%I:%M %p") gplot +ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(.9)))```#### [Density]{.panel-tabset-label}::: p-3To visualize hourly users through the lens of their respective concentrations, see @fig-hourDensity. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).The member group parallels the daily bimodal patterns seen in daily density plot. The casual group likewise, but to less extent parallels the bimodal patterns seen in the daily density plot. The two groups, largely, are overlapping, with the highest densities for both falling around 5:00pm.:::```{r}#| label: fig-hourDensity#| fig-cap: "Visualizing the probability-density distribution of the hour by membership."#| tidy: truegplot <-dplyr::tbl(dbconn, "db/hod.db") |>dplyr::collect() |>plotter(title ="Visualizing relative frequency\nacross hourly travel groups",x_label =paste0("Hours", "\n", "(12-hour clock)"),x_col = started_at_time, group_col = member_casual,geomType ="other",angle =45,color_col ="black",density_alpha =0.75,isTime =TRUE,date_breaks ="1 hour",date_labels ="%I %p",)gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modelHourQ presents the results of a binary logistic regression analyzing the relationship between hour of the day and membership status. The analysis divides the day into quartiles, with Q1 (12:00 am - 10:59 am) serving as the reference category. In summary:Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p \< 0.001 for all comparisons). Specifically, the odds of membership were 1.44 times as high in Q2 (10:59 am - 03:24 pm), 1.04 times as high in Q3 (03:24 pm - 06:05 pm), and 0.97 times as high in Q4 (06:05 pm - 11:59 pm).These results reveal a non-linear relationship between time of day and membership status. The highest likelihood of membership occurs during $Q2$, corresponding to midday hours. There is a slight increase in membership likelihood during Q3 (late afternoon) compared to the reference period, while evening hours (Q4) show a slight decrease in membership likelihood.:::```{r}#| label: modelHourQ#| code-summary: Query hod_wq.db, process, and create model R object for hour based on quartile range. #| tidy: truemodel <-dplyr::tbl(dbconn, "db/hod_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)``````{r}#| label: tbl-modelHourQ#| tbl-cap: "Modeling the probability of hourly member travel."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Hour Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of hourly member travel"),source_note = gt::md("**Source**: `db/hod_wq.db`"),isBinary =TRUE)```:::::::::::::::::::::## Distance {#sec-distance}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas271" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas272" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas271 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: writeDistance#| code-summary: Write distance.db to the database.#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(miles, member_casual) |>dplyr::arrange(miles) |>dplyr::collect() |>dplyr::mutate(milesR = miles,milesR = dplyr::case_when(milesR >=1~round(milesR, digits =0),miles <1~round(signif(milesR, 3), digits =1)),milesR = forcats::as_factor(milesR)) |>duckdb::dbWriteTable(conn = dbconn,name ="db/distance.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: writeDistanceWeightedQuantiles#| code-summary: Query, transform, and write weighted quartile data. #| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance_wq.db"))) {transformData(conn = dbconn,path ="db/distance.db",select_cols =c("miles", "member_casual"),group_cols =c("miles", "member_casual"),binary_col ="member_casual",pred_col ="miles",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (0.10 - 0.63]", "Q2 (0.63 - 1.02]", "Q3 (1.02 - 1.76]", "Q4 (1.76 - 20.5]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn,name ="db/distance_wq.db",overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas272 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2]]"}```{r}#| label: tbl-dbhod#| tbl-cap: Distance#| tidy: truedplyr::tbl(dbconn, "db/distance.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-hoursWQ#| tbl-cap: Distance, Weighted Quantiles#| tidy: truedplyr::tbl(dbconn, "db/distance_wq.db") |>dplyr::collect() |>head() |>kableExtra::kable()```::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-milesTotals and @fig-milesTotals show the overall distribution of distances traveled.:::```{r}#| label: fig-milesTotals#| fig-cap: "The overall distribution of trips taken by the estimated distance traveled"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/distance.db", select_cols ="milesR",group_cols ="milesR", doWeights =TRUE) |>plotter(x_col = milesR, y_col = n,geomType ="column", title =paste0("How far do people ride?"),x_label ="Miles", y_label ="n") gplot +ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge =1, angle =45)) +ggplot2::theme(axis.text = ggplot2::element_text(size =7))```::: {.table-container .column-margin}```{r}#| label: tbl-milesTotals#| tbl-cap: "The overall distribution of trips taken by the esimated distance traveled"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/distance.db", select_cols ="milesR",group_cols ="milesR", doWeights =TRUE) |>tabler( title =NULL,note_list =list("distance (miles)", "occurrences"),location_list =list("milesR", "n"),source_note = gt::md("**Source**: `db/distance.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1,label_parameter ="milesR",align_parameter ="right")```:::### [Comparative]{.panel-tabset-label}::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3@tbl-milesCompare and @fig-membershipMiles likewise summarize the distance distribution by membership status.:::```{r}#| label: fig-membershipMiles#| fig-cap: "The membership distribution of trips taken by the estimated distance traveled"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/distance.db", select_cols =c("milesR", "member_casual"),group_cols =c("milesR", "member_casual"),doWeights =TRUE) |>plotter(title =paste0("How far do members travel?"),x_label ="Miles",y_label ="n",x_col = milesR, y_col = n, group_col = member_casual,geomType ="column", isFaceted =TRUE,is_colGroup =TRUE)gplot +ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge =1, angle =45)) +ggplot2::theme(axis.text = ggplot2::element_text(size =7))```::: {.table-container .column-margin}```{r}#| label: tbl-milesCompare#| tbl-cap: "The membership distribution of trips taken by the estimated distance traveled"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/distance.db", select_cols =c("milesR", "member_casual"),group_cols =c("milesR", "member_casual"),doWeights =TRUE) |>tabler(title =NULL, groupName ="milesR", location = n,label_n ="n",note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/distance.db`"),isStub =TRUE,stub_label ="Distance",stub_note ="miles traveled per trip",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Summary Stats]{.panel-tabset-label}::: p-3@tbl-distanceSummary sheds light on the variability, range, and quartile information about the distance to membership data.:::```{r}#| label: tbl-distanceSummary#| tbl-cap: "The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel distance."#| tidy: falsegTable <-dplyr::tbl(dbconn, "db/distance.db") |>dplyr::select(miles, member_casual) |>dplyr::collect() |>gtsummary::tbl_summary(by = member_casual,type = miles ~"continuous2",label =list(miles ~"Distance (miles)"),digits =list(miles ~c(2, 2)),statistic =gtsummary::all_continuous() ~c("{median} ({p25}, {p75})", "{mean} ({sd})", "{min}, {max}")) |>gtsummary::italicize_levels() |>tabler(title = gt::md("Summary Stats:<br>Member travel distance"),source_note = gt::md("**Source**: `db/distance.db`"),isBinary =TRUE)gTable```#### [Histogram]{.panel-tabset-label}::: p-3Two plots, @fig-distanceCasualHistogram and @fig-distanceMemberHistogram, present a highly granular view of the distance data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.:::```{r}#| label: distanceQuartiles#| code-summary: "Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/distance.db") |>dplyr::select(miles, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/distance.db") |>dplyr::select(miles, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(as.numeric(qdf_member$miles), probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(as.numeric(qdf_casual$miles), probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-distanceCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of casual travel distance."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' travel distance"),x_label ="Miles",y_label ="n",x_col = miles, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),limits =c(0.1, 5),breaks =seq(0.1, 5, by =0.5),quartiles = quartiles_casual) gplot``````{r}#| label: fig-distanceMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of member travel distance."#| tidy: truegplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' travel distance"),x_label ="Miles",y_label ="n",x_col = miles, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),#binwidth = 0.2,limits =c(0.1, 5),breaks =seq(1, 5, by =1),quartiles = quartiles_member) gplot```#### [Density]{.panel-tabset-label}::: p-3To visualize monthly users through the lens of their respective concentrations, see @fig-distanceDensity. The y-axis shows the density (a measure of relative frequency).The member group has a narrower, taller spike in between the 0.1 and 1 mile distance. There is less of a concentration than the casual group between the 1-3 mile distance. The casual group displays a broader spike with much overlap to the member group, but peaks closer to the 1 mile mark than for members. There is greater concentration than members between the 1-3 miles distance, but less around the 0.1-1 mile distance. Both groups overlap nearly completely from around the 3 mile mark and onwards.:::```{r}#| label: fig-distanceDensity#| fig-cap: "Visualizing the probability-density distribution of distance by membership."#| tidy: truegplot <-dplyr::tbl(dbconn, "db/distance.db") |>dplyr::select(miles, member_casual) |>dplyr::collect() |>plotter(title ="Visualizing relative frequency\nacross travel distance groups",x_label ="Miles",x_col = miles, group_col = member_casual,geomType ="column",angle =45,color_col ="black",density_alpha =0.75,isDensity =TRUE,is_colGroup =TRUE,breaks =seq(0, 11, by =1),limits =c(0.1, 11))gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modeldistanceQ presents the odds ratios for membership status across distance quartiles, with Q1 serving as the reference category. In summary:Compared to Q1, the odds of being a member versus a casual rider were significantly lower in all other quartiles (p \< 0.001 for all comparisons). Specifically, the odds of membership were 0.63 times as high in Q2, 0.59 times as high in Q3, and 0.65 times as high in Q4.These results indicate an inverse relationship between ride distance and membership status, with members generally associated with shorter ride distances. Interestingly, the lowest odds of membership were observed in Q3, rather than Q4, suggesting a non-linear relationship between distance and membership likelihood.:::```{r}#| label: modeldistanceQ#| code-summary: Query, process, and create model R object for hour based on quartile range. #| tidy: truemodel <-dplyr::tbl(dbconn, "db/distance_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)``````{r}#| label: tbl-modeldistanceQ#| tbl-cap: "Modeling the probability of members' travel by distance."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Distance Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel distance"),source_note = gt::md("**Source**: `db/distance_wq.db`"),isBinary =TRUE)```:::::::::::::::::::::## Speed {#sec-speed}:::::: {.d-grid .mt-3}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-dark type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas171" aria-controls="offcanvas"}Database Operations:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas172" aria-controls="offcanvas"}Table Preview::::::::::::::::::::::: {#offcanvas171 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}DB Operations:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}::::::::::::: offcanvas-body::: flex-code```{r}#| label: writeSpeed#| code-summary: "Write to the database"#| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed.db"))) {dplyr::tbl(dbconn, filtered_path) |>dplyr::select(mph, member_casual) |>dplyr::collect() |>dplyr::mutate(mphR =round(mph, digits =0)) |>dplyr::arrange(mph, member_casual) |>duckdb::dbWriteTable(conn = dbconn,name ="db/speed.db",overwrite =TRUE)}```:::::: flex-code```{r}#| label: speedWeightedQuantiles#| code-summary: Query, transform, and write weighted quartile data. #| tidy: trueif (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed_wq.db"))) {transformData(conn = dbconn,path ="db/speed.db",select_cols =c("mph", "member_casual"),group_cols =c("mph", "member_casual"),binary_col ="member_casual",pred_col ="mph",ntile_col ="quartile",zero_val ="casual",one_val ="member",qtile_levels =c("Q1 (1.0 - 5.4]", "Q2 (5.4 - 7.0]", "Q3 (7.0 - 8.6]", "Q4 (8.6 - 20]"),doQuantile =TRUE,doWeights =TRUE) |>duckdb::dbWriteTable(conn = dbconn, name ="db/speed_wq.db", overwrite =TRUE)}```::::::::::::::::::::::::: {#offcanvas172 .offcanvas .offcanvas-start tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: {layout="[[1,2]]"}```{r}#| label: tbl-kableSpeed#| tbl-cap: Speed#| tidy: truedplyr::tbl(dbconn, "db/speed.db") |>dplyr::collect() |>head() |>kableExtra::kable()``````{r}#| label: tbl-kableSpeedWQ#| tbl-cap: Speed with weighted quartile groups.#| tidy: truedplyr::tbl(dbconn, "db/speed_wq.db") |>dplyr::collect() |>head() |>kableExtra::kable()```::::::::::::::::::::::::::: panel-tabset### [Overall]{.panel-tabset-label}::: p-3@tbl-mphTotals and @fig-mphTotals show the overall distribution of travel speeds.:::```{r}#| label: fig-mphTotals#| fig-cap: "The overall distribution of trips taken by the estimated average speed"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/speed.db", select_cols ="mphR",group_cols ="mphR", doWeights =TRUE) |>plotter(x_col = mphR, y_col = n,geomType ="column", title =paste0("How fast do people ride?"),x_label ="Miles per Hour", y_label ="n")gplot```::: {.table-container .column-margin}```{r}#| label: tbl-mphTotals#| tbl-cap: "The overall distribution of trips taken by the estimated average speed"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/speed.db", select_cols ="mphR",group_cols ="mphR", doWeights =TRUE) |>tabler(title =NULL,note_list =list("travel speed (miles per hour)", "occurrences"),location_list =list("mphR", "n"),source_note = gt::md("**Source**: `db/speed.db`"),noteRows =TRUE,hide_column_labels =TRUE,row_index =1,label_parameter ="mphR",align_parameter ="right")```:::### [Comparative]{.panel-tabset-label}::::::::: panel-tabset#### [Frequency]{.panel-tabset-label}::: p-3@tbl-mphCompare and @fig-mphCompare summarize the travel speed distribution by membership.:::```{r}#| label: fig-mphCompare#| fig-cap: "The membership distribution of trips taken by the estimated average speed"#| tidy: truegplot <-transformData(conn = dbconn, path ="db/speed.db", select_cols =c("mphR", "member_casual"),group_cols =c("mphR", "member_casual"),doWeights =TRUE) |>plotter(title =paste0("How fast do members travel?"),x_label ="Miles per Hour",y_label ="n",x_col = mphR, y_col = n, color_col = member_casual,geomType ="column",is_colGroup =TRUE,isFaceted =TRUE)gplot```::: {.table-container .column-margin}```{r}#| label: tbl-mphCompare#| tbl-cap: "The membership distribution of trips taken by the estimated average speed"#| tbl-cap-location: top#| tidy: truetransformData(conn = dbconn, path ="db/speed.db", select_cols =c("mphR", "member_casual"),group_cols =c("mphR", "member_casual"),doWeights =TRUE) |>tabler(title =NULL, groupName ="mphR", location = n,label_n ="n",note_list =list("membership status", "occurrences"),location_list =list("member_casual", "n"),source_note = gt::md("**Source**: `db/speed.db`"),isStub =TRUE,stub_label ="Speed",stub_note ="estimated average speed traveled per trip (mph)",row_index =1,noteRows =TRUE) |>gt::cols_label(member_casual =" ", n =" ")```:::#### [Summary Stats]{.panel-tabset-label}::: p-3@tbl-speedSummary gives the reader some idea of the variability, range, and quartile information about the distance data.:::```{r}#| label: tbl-speedSummary#| tbl-cap: "The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel speed."#| tidy: falsedplyr::tbl(dbconn, "db/speed.db") |>dplyr::select(mph, member_casual) |>dplyr::collect() |>gtsummary::tbl_summary(by = member_casual,type = mph ~"continuous2",label =list(mph ~"Speed"),digits =list(mph ~c(1, 1)),statistic =gtsummary::all_continuous() ~c("{median} ({p25}, {p75})", "{mean} ({sd})", "{min}, {max}")) |>gtsummary::italicize_levels() |>tabler(title = gt::md("Summary Stats:<br>Member travel speed"),source_note = gt::md("**Source**: `db/speed.db`"),isBinary =TRUE)```#### [Histogram]{.panel-tabset-label}::: p-3Two plots, @fig-speedCasualHistogram and @fig-speedMemberHistogram, present a highly granular view of the travel speed data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.:::```{r}#| label: speedQuartiles#| code-summary: "Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data."#| tidy: trueqdf_member <-dplyr::tbl(dbconn, "db/speed.db") |>dplyr::select(mph, member_casual) |>dplyr::filter(member_casual =="member") |>dplyr::collect()qdf_casual <-dplyr::tbl(dbconn, "db/speed.db") |>dplyr::select(mph, member_casual) |>dplyr::filter(member_casual =="casual") |>dplyr::collect()quartiles_member <-quantile(qdf_member$mph, probs =c(0.25, 0.5, 0.75))quartiles_casual <-quantile(qdf_casual$mph, probs =c(0.25, 0.5, 0.75))``````{r}#| label: fig-speedCasualHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of casual travel speed."#| tidy: truegplot <-qdf_casual |>plotter(title =paste0("The quartile", "\n", "distribution of casuals' travel speed"),x_label ="Miles per Hour",y_label ="n",x_col = mph, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),breaks =seq(1, 20, by =2),limits =c(1, 20),quartiles = quartiles_casual) gplot``````{r}#| label: fig-speedMemberHistogram#| fig-cap: "A more detailed look that illustrates the quartile distribution of member travel speed."#| tidy: truegplot <-qdf_member |>plotter(title =paste0("The quartile", "\n", "distribution of members' travel speed"),x_label ="Miles per Hour",y_label ="n",x_col = mph, geomType ="column", isHistogram =TRUE,angle =45,color_col ="transparent",vline_color ="lightyellow",vline_size =0.5,low ="blue",high ="red",binwidth = \(x) 2*IQR(x) / (length(x)^(1/3)),#binwidth = 0.5,breaks =seq(1, 20, by =2),limits =c(1, 20),quartiles = quartiles_member) gplot```#### [Density]{.panel-tabset-label}::: p-3To visualize travel speed patterns through the lens of their respective concentrations, see @fig-speedDensity. The y-axis shows the density (a measure of relative frequency).The member group has a narrower, taller spike centering around 7mph. There is higher concentration than the casual group between \~ 5-15mph. The casual group has a broader, shorter spike centering closer to 6mph. There is higher concentration than the member group between \~ 1-5mph. Alongside significant overlap, both groups exhibit a unimodal distribution.:::```{r}#| label: fig-speedDensity#| fig-cap: "Visualizing the probability-density distribution of speed by membership."#| tidy: truegplot <-dplyr::tbl(dbconn, "db/speed.db") |>dplyr::select(mph, member_casual) |>dplyr::collect() |>plotter(title ="Visualizing relative frequency\nacross travel speed groups",x_label ="Miles per Hour",x_col = mph, group_col = member_casual,geomType ="column",angle =45,color_col ="black",density_alpha =0.75,isDensity =TRUE,is_colGroup =TRUE,breaks =seq(1, 20, by =1),limits =c(1, 20))gplot```#### [Binary Logistic Regression]{.panel-tabset-label}::: p-3@tbl-modelspeedQ presents the odds ratios for membership status across speed quartiles, with Q1 serving as the reference category. In summary:Compared to Q1, the odds of being a member versus a casual rider were significantly higher in all other quartiles (p \< 0.001 for all comparisons). Specifically, the odds of membership were 2.09 times higher in Q2, 2.50 times higher in Q3, and 2.69 times higher in Q4.These results suggest a strong positive association between riding speed and membership status, with the likelihood of membership increasing monotonically across speed quartiles.:::```{r}#| label: modelspeedQ#| code-summary: Query, process, and create model R object for hour based on quartile range. #| tidy: truemodel <-dplyr::tbl(dbconn, "db/speed_wq.db") |>dplyr::collect() |>glm(formula = member_casual ~ quartile, family = binomial,weights = n)``````{r}#| label: tbl-modelspeedQ#| tbl-cap: "Modeling the probability of members' travel by speed."#| tidy: truemodel |>gtsummary::tbl_regression(label =list(quartile ="Speed Ranges"), conf.int =FALSE, exponentiate =TRUE) |>tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel speed"),source_note = gt::md("**Source**: `db/speed_wq.db`"),isBinary =TRUE)```:::::::::::::::::::::## Summarizing TabsetsThe EDA (exploratory data analysis) sections employ various statistical methods to uncover patterns in user behavior and preferences. A chi-square analysis reveals a significant association between bicycle type and membership status (p \< 0.001). Binary logistic regression further quantifies this relationship, showing that electric bike users have lower odds of being members compared to classic bike users. @sec-btypesTrip duration analysis, also utilizing binary logistic regression, uncovers a notable trend: the odds of membership decrease substantially as trip duration increases. This model, using quartiles of trip duration, indicates that members generally take shorter, more concentrated trips, while casual users are more likely to engage in longer rides. @sec-durationSeasonal trends emerge when examining monthly ridership patterns through another logistic regression model. The odds of membership fluctuate throughout the year, with the highest proportion of members riding during the colder months and early spring. As the weather warms, there's a noticeable shift towards more casual ridership, as evidenced by lower odds ratios in the summer months. @sec-moyWeekly and daily patterns, analyzed using similar regression techniques, provide further insights into user behavior. Weekdays, @sec-dow, particularly during typical commuting hours, @sec-hod, see higher odds of member rides. In contrast, weekends and evenings show decreased odds of membership, suggesting an increased likelihood of casual ridership during these times.The analysis of trip distances, again using logistic regression with distance quartiles, reveals an inverse relationship with membership status. Members are more likely to take shorter trips, while casual users tend to embark on longer journeys. This aligns with the duration findings and reinforces the idea that members use the service for quick, routine travel. @sec-durationInterestingly, trip speed shows a strong positive association with membership status in the logistic regression model. The odds of membership increase monotonically across speed quartiles, indicating that members generally ride at higher speeds compared to casual users. @sec-speedThese findings, derived from a combination of chi-square tests for independence and multiple binary logistic regression models, paint a picture of two distinct user groups: members who typically use the bike share for short, fast, routine trips during weekdays, and casual users who tend to take longer, slower rides, often during leisure hours or weekends.Contingency tables and visualizations, including density plots and histograms, supplement these statistical analyses, providing a comprehensive view of the data distribution across various parameters such as bike type, trip duration, time of day, and day of the week.The robust statistical approach, combining hypothesis testing (chi-square) with predictive modeling (logistic regression), provides strong evidence for the observed patterns in user behavior. These insights could prove valuable for tailoring marketing strategies, optimizing bike distribution, and enhancing service offerings to better serve both user segments.## Geographic Data### Traffic Flow {#sec-epiflow}::: p-1@fig-epiflowNetwork presents an intriguing bird's-eye view of trip behaviors through an interactive *epiflows* graph. \]@moraga\] This R package used for creating this graph was re-purposed from its original intent for visualizing the spread of disease. This visualization employs a network of nodes (circles) connected by lines, where the thickness of the lines roughly corresponds to the volume of trips between the nodes, with thicker lines indicating a higher number of trips. The top 34 most frequently traveled stations are depicted in this visual network diagram.The interactive nature of the epiflows allows users to click on individual nodes and lines to access more detailed information. Additionally, a drop-down window provides further exploration capabilities, enabling users to delve deeper into the data.These stations represent the most active locations within the system. Fortunately, @sec-mapview explores a potential approach to gain insights into the typical high-traffic station locations and the underlying reasons behind their elevated activity levels.::::::::::: {#offcanvas13 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Creating an EpiFlow:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: createFlows dataframe#| code-summary: "First, creates the frequency of trips taken to and from pairs of stations.<br>We are only going to look deeper into the top 50 most traveled pairs."#| tidy: trueflowData <- dplyr::tbl(dbconn, filtered_path) |>dplyr::select(start_station_name, end_station_name) |>dplyr::group_by(start_station_name, end_station_name) |>dplyr::summarize(n =n()) |>dplyr::ungroup() |>dplyr::arrange(desc(n)) |>dplyr::rename("from_station"= start_station_name, "to_station"= end_station_name) |>dplyr::collect() |>dplyr::slice_head(n =50)``````{r}#| label: location stats#| code-summary: "Second, we need statistics but also to combine the statistics for every unique station name." #| tidy: truelocationData <- dplyr::tbl(dbconn, filtered_path) |>dplyr::select(start_station_name,end_station_name,started_at,ended_at,trip_time) |>dplyr::group_by(start_station_name, end_station_name) |>dplyr::mutate("trip_time"=round(trip_time, digits =0)) |>dplyr::summarize("trip_count"= dplyr::n(),"first_date"=min(started_at),"last_date"=max(ended_at),) |>dplyr::ungroup() |>dplyr::rename("from_station"= start_station_name, "to_station"= end_station_name) |>dplyr::arrange(desc(trip_count)) |>dplyr::collect()# Need to combine all names to single column and recalculate# or retain other stats.locationData_pivoted <- locationData |>tidyr::pivot_longer(cols =1:2, values_to ="allNames") |>dplyr::group_by(allNames) |>dplyr::summarize("trips_toAndfrom"=sum(trip_count),first_date =min(first_date),last_date =max(last_date),) |>dplyr::arrange(trips_toAndfrom)``````{r}#| label: MakeEpiflows#| code-summary: Third, creates epiflow objects, which take in a pair of dataframes and creates the flows between them.#| tidy: true# for all the pairsef_test <- epiflows::make_epiflows(flows = flowData,locations = locationData_pivoted,num_cases ="trips_toAndfrom")```::::::::::::::::::::::: {#offcanvas14 .offcanvas .offcanvas-end tabindex="1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Tables:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: summaryFlowData#| code-summary: First, just a quick view of the flow data table we made earlier.#| title: Flow Data View#| tidy: trueflowData``````{r}#| label: pivotedLocations#| code-summary: Second, another quick view, but for thethe location data we pivoted earlier.#| title: Pivoted Location Data#| tidy: truelocationData_pivoted |>dplyr::arrange(desc(trips_toAndfrom))```:::::::::::::::::: {.article style="color: Black"}```{r}#| label: fig-epiflowNetwork#| fig-cap: EpiFlow Network#| echo: false#| tidy: trueepiflows::vis_epiflows(ef_test)```::::::::: {.d-grid .gap-0}::::: {.btn-group role="group" aria-label="third"}::: {.btn .btn-primary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas13" aria-controls="offcanvas"}Code Steps:::::: {.btn .btn-secondary type="button" data-bs-toggle="offcanvas" data-bs-target="#offcanvas14" aria-controls="offcanvas"}Table Preview::::::::::::::### Checking the Map {#sec-mapview}::: p-1This section was made possible thanks to the latitude and longitude coordinates data provided alongside the stations names. Coming from the epiflow diagram, this should help make the data less abstract. The accordion below expands and collapses four *OpenStreet* maps found in the callout section below. These maps were split for viewing logistics. They contain from the epiflow in the section above. These maps are interactive, so the default views are zoom-able and movable. The transparent burst buttons enable snappy zooming-in of the station groups.::::::::::: {#offcanvas20 .offcanvas .offcanvas-start tabindex="-1" aria-labelledby="offcanvas" style="width: auto"}::::: offcanvas-header::: {.h5 .offcanvas-title}Code for Mapping:::::: {.btn-close type="button" data-bs-dismiss="offcanvas" ariaLabel="Close"}:::::::::::: offcanvas-body::: flex-code```{r}#| label: mapData#| code-summary: "Processing 'flowData' created earlier to include geolocation data for mapview plots."#| tidy: true# All distinct stations in one columnnames <- flowData |>dplyr::select(from_station, to_station) |>tidyr::pivot_longer(cols =1:2,names_to =NULL,values_to ="station_names") |>dplyr::distinct()# The important geo-coordinates corresponding to station namesmapData <- dplyr::tbl(dbconn, filtered_path, check_from =FALSE) |>dplyr::select(start_station_name,start_lat,start_lng,end_station_name,end_lat,end_lng)# Filter to include all observations that match the station names listed in 'names'. We need the geo-coordinates alongside the names.mapData1 <- mapData |>dplyr::collect() |># Filter, but through a vector of conditions.dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>dplyr::select(start_station_name:start_lng)# Had to split 'mapData' into two and pivot into a single table.mapData2 <- mapData |>dplyr::collect() |>dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>dplyr::select(end_station_name:end_lng)# Nice groupingstations_groupMap <- dplyr::bind_rows(mapData1, mapData2) |>dplyr::select(start_station_name, start_lat, start_lng) |>dplyr::rename("station_names"= start_station_name,"lat"= start_lat,"lng"= start_lng) |>dplyr::distinct() |>dplyr::group_by(station_names)# Setting seed for samplingset.seed(113)# Taking 10 random samples from each station_name groupsampled_stations <- stations_groupMap |>dplyr::slice_sample(n =10) |>tidyr::drop_na()``````{r}#| label: mapColors#| code-summary: "Creates a map coloring palette excluding grays."#| tidy: true# All of the r-colorsallPalette <-colors()# The grays are vast so we don't want those watering down the samples.colorfulPal <- purrr::discard(allPalette, stringr::str_detect(allPalette, "gr(a|e)y"))# When we sample the colors, 10 should be slightly more than needed.n_colors <-10``````{r}#| label: mapViewer#| code-summary: First, sourcing the script needed to generate the maps and creating the list of vectors used as input. These vectors are the slices of the top most traveled stations.#| tidy: trueslicerVector <-list(c(1:9), c(10:18), c(19:27), c(28:34))source("Scripts/mapViewer.R")``````{r}#| file: "Scripts/mapViewer.R"#| eval: false#| code-summary: "The script used to generate the maps."#| label: mapViewerScript```:::::::::::::::::::::::::::::::::::::: {#accordionParent .accordion .mt-3 .mb-3}::::::: accordion-item:::: {#headingOne .accordion-header}::: {.accordion-button .collapsed type="button" data-bs-toggle="collapse" data-bs-target="#collapseOne" aria-expanded="true" aria-controls="collapseOne" style="background-color: #222"}Benson Ave & Church St ... Ellis Ave & 60th St::::::::::: {#collapseOne .accordion-collapse .collapse aria-labelledby="headingOne" data-bs-parent="#accordionParent"}::: accordion-body```{r}#| label: fig-map1#| fig-cap: "Benson Ave & Church St - Ellis Ave & 60th St"#| tidy: trueset.seed(240)randomColors <-sample(colorfulPal, n_colors)mapViewer(slicerVector[[1]])```::::::::::::::::::::: accordion-item:::: {#headingTwo .accordion-header}::: {.accordion-button .collapsed type="button" data-bs-toggle="collapse" data-bs-target="#collapseTwo" aria-expanded="false" aria-controls="collapseTwo" style="background-color: #222"}Greenview Ave & Fullteron Ave ... Loomis Ave & Lexington St::::::::::: {#collapseTwo .accordion-collapse .collapse aria-labelledby="headingTwo" data-bs-parent="#accordionParent"}::: accordion-body```{r}#| label: fig-map2#| fig-cap: "Greenview Ave & Fullteron Ave - Loomis Ave & Lexington St"#| tidy: trueset.seed(241)randomColors <-sample(colorfulPal, n_colors)mapViewer(slicerVector[[2]])```::::::::::::::::::::: accordion-item:::: {#headingThree .accordion-header}::: {.accordion-button .collapsed type="button" data-bs-toggle="collapse" data-bs-target="#collapseThree" aria-expanded="false" aria-controls="collapseThree" style="background-color: #222"}Michigan Ave & Oak St ... State St & 33rd St::::::::::: {#collapseThree .accordion-collapse .collapse aria-labelledby="headingThree" data-bs-parent="#accordionParent"}::: accordion-body```{r}#| label: fig-map3#| fig-cap: "Michigan Ave & Oak St - State St & 33rd St"#| tidy: trueset.seed(242)randomColors <-sample(colorfulPal, n_colors)mapViewer(slicerVector[[3]])```::::::::::::::::::::: accordion-item:::: {#headingFour .accordion-header}::: {.accordion-button .collapsed type="button" data-bs-toggle="collapse" data-bs-target="#collapseFour" aria-expanded="false" aria-controls="collapseFour" style="background-color: #222"}Street Dr & Grand Ave ... Woodlawn Ave & 55th St::::::::::: {#collapseFour .accordion-collapse .collapse aria-labelledby="headingFour" data-bs-parent="#accordionParent"}::: accordion-body```{r}#| label: fig-map4#| fig-cap: "Street Dr & Grand Ave - Woodlawn Ave & 55th St"#| tidy: trueset.seed(243)randomColors <-sample(colorfulPal, n_colors)mapViewer(slicerVector[[4]])```:::::::::::::::::::::::::::::::::::::### Summarzing Geographic EDA::: p-1For example, suppose the user selects *University Ave & 57th St* in the epiflow visualization. This intersection happens to be at the heart of the University of Chicago campus. The natural next question is: where does the traffic to and from this location typically flow? By selecting one of the other nodes highlighted with flows directing away from the previous node, the user can identify *Kimbark Ave and 53rd St*. As seen in the map view, this location is situated adjacent to the *Vue 53 Apartments* complex. By analyzing such connections between nodes, the user can gain insights into common routes and destinations originating from a particular point of interest, potentially revealing patterns related to student housing, campus facilities, or other points of interest in the vicinity.The data suggests individual members utilize the service multiple times weekly. However, further analysis is needed to determine if a significantly larger volume of unique individuals are annual members. Verifying associations between specific locations and higher or lower traffic could be a next step. Preliminary observations indicate universities, shopping centers, major companies, and nearby apartment complexes tend to have the highest ridership volumes.To improve membership, addressing factors deterring individuals from becoming annual members could be key. These may include a lack of stations within walking distance of residences or destinations, or concerns over electric bicycle battery life and charging station availability, potentially explaining their lower utilization compared to classic bikes. Offering trial periods could allow casual users to experience the service's reliability and convenience, encouraging conversion to annual memberships.:::## Updated Database Tables List::: table-container```{r}#| label: tbl-dbList2#| tbl-cap: "The list of tables created by the end of the analysis."#| tidy: truedbList2 <-duckdb::dbListTables(dbconn) |>as.data.frame() |>tabler(title = gt::md("Which tables have<br>been created so far?"),note_list =list(gt::md("Tables in `db/data.db` at this stage")),location_list =list("duckdb::dbListTables(dbconn)"),noteColumns =TRUE,source_note = gt::md("**Source**: `db/data.db`"),label_n =NULL) |>gt::cols_label("duckdb::dbListTables(dbconn)"="Table Paths") |>gt::cols_align(align ="left")dbList2```:::::: column-margin{#fig-dbList_2}Notice the size difference from the previous image. The database is still represented in the data folder as one file.:::## Export the Final DatabaseFor future use, it might save a lot of time to not have to re-download the original data or recreate tables. The database is exported and compressed to parquet files.```{r}#| label: exporter#| code-summary: "A view of the data exporting script."#| file: "Scripts/dbExporter.R"#| eval: false``````{r}#| label: runExporter#| code-summary: "Execute above script with a custom SQL query string parameter to the `db_exported` directory."#| tidy: true#| output: falseif(dir.exists("db_exported") ==FALSE) {source("Scripts/dbExporter.R") queryString <-paste0("EXPORT DATABASE 'db_exported' (","FORMAT PARQUET, ","COMPRESSION ZSTD",")")dbExporter(dbdir ="db/data.db",query = queryString)}```# Conclusion::: p-1These findings and recommendations are based on robust statistical analyses, including chi-square tests, binary logistic regression models, and visualization techniques. They provide a data-driven foundation for enhancing the Divvy bike-sharing service and better serving the residents of Chicago.:::## Key Findings::: p-1- Membership status significantly influences bike usage patterns: - Members prefer classic bikes over electric bikes. - Casual users have a higher electric bike usage compared to members. - Members typically take shorter, faster trips. - Casual users tend to engage in longer, slower rides.- Temporal patterns reveal distinct user behaviors: - Weekdays and typical commuting hours see higher member activity. - Weekends and evenings show increased casual ridership. - Membership likelihood is highest during colder months and early spring. - Summer months see a shift towards more casual ridership.- Trip characteristics vary by user type: - Members are associated with shorter ride distances. - Trip speed shows a strong positive association with membership status. - The likelihood of membership decreases as trip duration increases.- High-traffic stations are often near universities, shopping centers, major companies, and apartment complexes.- The large sample size (nearly 4 million users) allows for high statistical significance in observed differences.:::## Recommendations1. Tailor marketing strategies to target potential members for short, frequent trips, especially for commuting purposes.2. Optimize bike distribution to meet the demand for classic bikes among members and electric bikes among casual users.3. Implement promotional campaigns during summer months to convert casual users to members.4. Consider offering trial periods to allow casual users to experience the benefits of membership.5. Investigate factors deterring individuals from becoming annual members, such as station proximity to residences or destinations.6. Address potential concerns over electric bicycle battery life and charging station availability.7. Focus on improving service near high-traffic areas like universities, shopping centers, and residential complexes.8. Develop targeted strategies to encourage casual users of longer, leisure rides to consider membership benefits.9. Utilize the epiflows visualization tool to identify and optimize popular routes and destinations.10. Continue to collect and analyze data to refine understanding of user behavior and preferences over time.```{r}#| eval: false#| include: false# If you need to drop any tables without deleting the entire database.source("Scripts/duckDrops.R")```