Bike-Sharing in the Streets of Chicago

Author

Eric Mossotti

Published

May 23, 2024

Modified

Oct 14, 2024

Objective

Communicating reproducible, data-driven insight into the questions relating but not limited to, increasing annual membership subscriptions.

1 Intro

A view the dl_toDirectory.R script used below.

# ----
# CC BY-SA, Eric Mossotti
# ----
# Description ----
# 
# Downloads files from the cloud, calls the file relocating function and cleans
# up the environment.
# 
# ----
source("Scripts/unz_relocate.R")

dl_toDirectory <- function (
        durls,
        tmpzip_dir,
        tmpfile_dir,
        tmpzip_paths,
        tmpfile_paths,
        tmpfile_names
        ) {

  # Create a directory to store the temporary files
  dir.create(tmpzip_dir)
  
  # A simple way to download and relocate those files from the working directory to the file-folder paths created earlier.
  curl::multi_download(durls, 
                       destfiles = tmpzip_paths)
  
  # Create tempFile directory
  dir.create(tmpfile_dir)
  
  # Execute sourced custom function from the unz_relocate.R file
  unz_relocate(fPaths = tmpfile_paths,
               zPaths = tmpzip_paths,
               fNames = tmpfile_names)
  
  # To remove the directory and contents thereof after having finished using
  unlink(tmpzip_dir, 
         recursive = TRUE)
  
}

A view of the csv_toDB.R script used below.

# ----
# CC BY-SA, Eric Mossotti
#  ----
# Description ----
# 
# Reads the downloaded files into a dataframe. Then writes the df to local 
# storage and cleans up the environment.
# 
# ---- 

library(duckdb)

csv_toDB <- function (
        tmpfile_dir,
        tmpfile_paths,
        db_dir,
        database_path,
        original_path,
        complete_path
) {
    # Would like to read in all unzipped files to a single dataframe
    dataTibble <-
        purrr::map(tmpfile_paths[1:12], 
                   arrow::read_csv_arrow) |>
        purrr::list_rbind()
    
    # duckDB for data storage ----
    
    # Folder to hold database files
    dir.create(db_dir)
    
    # Initialize a duckDB database connection
    dbconn <- DBI::dbConnect(duckdb::duckdb(), 
                             dbdir = database_path, 
                             read_only = FALSE)
    
    # Original data table
    dataTibble |>
        as.data.frame() |>
        duckdb::dbWriteTable(conn = dbconn,
                             name = original_path,
                             overwrite = TRUE)
    
    # Clean data and working environment ----
    rm(dataTibble)
    
    # Complete observations write
    dplyr::tbl(dbconn, original_path) |>
        dplyr::collect() |>
        tidyr::drop_na() |>
        duckdb::dbWriteTable(conn = dbconn,
                             name = complete_path,
                             overwrite = TRUE)
    
    unlink(tmpfile_dir, recursive = TRUE)
}

Execute initial download and removal of incomplete observations, then
store original data and complete data separately.

Or, just connect to existing db,
then source frequently used scripts used later.

if (exists("dbconn") == FALSE && dir.exists("db") == FALSE) {

    source("Scripts/dl_toDirectory.R")

    dl_toDirectory(durls = sprintf("https://divvy-tripdata.s3.amazonaws.com/%d-divvy-tripdata.zip",
        202301:202312), tmpzip_dir = "tempZips", tmpzip_paths = sprintf("tempZips/%d-divvy-tripdata.zip",
        202301:202312), tmpfile_dir = "tempFiles", tmpfile_paths = sprintf("tempFiles/%d-divvy-tripdata.csv",
        202301:202312), tmpfile_names = sprintf("%d-divvy-tripdata.csv", 202301:202312))

    source("Scripts/csv_toDB.R")

    csv_toDB(tmpfile_dir = "tempFiles", tmpfile_paths = sprintf("tempFiles/%d-divvy-tripdata.csv",
        202301:202312), db_dir = "db", database_path = "db/data.db", original_path = "db/original_data.db",
        complete_path = "db/complete_data.db")
} else {
    database_path <- "db/data.db"
    original_path <- "db/original_data.db"
    complete_path <- "db/complete_data.db"
    filtered_path <- "db/filtered_data.db"

    dbconn <- DBI::dbConnect(duckdb::duckdb(), dbdir = database_path, read_only = FALSE)
}

source("Scripts/tabler.R")
source("Scripts/plotter.R")
source("Scripts/transformData.R")
source("Scripts/chisqTest.R")

1.1 Stakeholders

The primary stakeholders in this analysis are Divvy, Lyft (the parent company of Divvy), and the City of Chicago Department of Transportation. The analysis aims to provide these stakeholders with data-driven insights to enhance the Divvy bike-sharing service, better serving the residents of Chicago and its users. The initial rationale behind Divvy’s implementation included improving air quality, promoting economic recovery, and reducing traffic congestion within the city. (About Divvy)

About divvy: Company & history | divvy bikes. https://divvybikes-marketing-staging.lyft.net/about

1.2 Data Import

Off-Canvas Buttons

Code processing steps are accessible via buttons like the one below. Drop-down code summaries and tables therein add context and transparency regarding the presented findings to enhance understanding.

Data import scripts and initial analysis setup decision

1.3 Source

The raw 2023 dataset was imported from Divvy Data. (Divvy Data)

Divvy data. https://divvybikes.com/system-data

Code

# List of column labels to feed tabler() and add_multiple_footnotes()
location_list <- dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    colnames() |>
    as.list()

# A simple list of footnotes to feed tabler() and add_multiple_footnotes().
note_list <- list("Anonymized trip identifier.", "The bicycle type.", "Starting date-time (to the second).",
    "Ending date-time (to the second).", "Station name of where the trip started.",
    "Station ID of where the trip started.", "Station name of where the trip ended.",
    "Station ID of where the trip ended.", "Latitude associated with the starting location.",
    "Longitude associated with the starting location.", "Latitude associated with the ending location.",
    "Longitude associated with the ending location.", "If someone is an annual subscriber or not.")

dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    dplyr::slice_head(n = 10) |>
    tabler(title = "What data did we start with?", source_note = gt::md("**Source**: Divvy Data"),
        note_list = note_list, location_list = location_list, noteColumns = TRUE,
        label_n = NULL) |>
    gt::tab_options(table.font.size = gt::pct(75), footnotes.multiline = FALSE)

Table 1: Unaltered data (other than being combined and formatted for presentation).

ride_id¹	rideable_type²	started_at³	ended_at⁴	start_station_name⁵	start_station_id⁶	end_station_name⁷	end_station_id⁸	start_lat⁹	start_lng¹⁰	end_lat¹¹	end_lng¹²	member_casual¹³
What data did we start with?
F96D5A74A3E41399	electric_bike	2023-01-21 20:05:42	2023-01-21 20:16:33	Lincoln Ave & Fullerton Ave	TA1309000058	Hampden Ct & Diversey Ave	202480.0	41.92407	-87.64628	41.93000	-87.64000	member
13CB7EB698CEDB88	classic_bike	2023-01-10 15:37:36	2023-01-10 15:46:05	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79957	-87.59475	41.80983	-87.59938	member
BD88A2E670661CE5	electric_bike	2023-01-02 07:51:57	2023-01-02 08:05:11	Western Ave & Lunt Ave	RP-005	Valli Produce - Evanston Plaza	599	42.00857	-87.69048	42.03974	-87.69941	casual
C90792D034FED968	classic_bike	2023-01-22 10:52:58	2023-01-22 11:01:44	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79957	-87.59475	41.80983	-87.59938	member
3397017529188E8A	classic_bike	2023-01-12 13:58:01	2023-01-12 14:13:20	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79957	-87.59475	41.80983	-87.59938	member
58E68156DAE3E311	electric_bike	2023-01-31 07:18:03	2023-01-31 07:21:16	Lakeview Ave & Fullerton Pkwy	TA1309000019	Hampden Ct & Diversey Ave	202480.0	41.92607	-87.63886	41.93000	-87.64000	member
2F7194B6012A98D4	electric_bike	2023-01-15 21:18:36	2023-01-15 21:32:36	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79955	-87.59462	41.80983	-87.59938	member
DB1CF84154D6A049	classic_bike	2023-01-25 10:49:01	2023-01-25 10:58:22	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79957	-87.59475	41.80983	-87.59938	member
34EAB943F88C4C5D	electric_bike	2023-01-25 20:49:47	2023-01-25 21:02:14	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79959	-87.59467	41.80983	-87.59938	member
BC8AB1AA51DA9115	classic_bike	2023-01-06 16:37:19	2023-01-06 16:49:52	Kimbark Ave & 53rd St	TA1309000037	Greenwood Ave & 47th St	TA1308000002	41.79957	-87.59475	41.80983	-87.59938	member
Source: Divvy Data
¹ Anonymized trip identifier. ² The bicycle type. ³ Starting date-time (to the second). ⁴ Ending date-time (to the second). ⁵ Station name of where the trip started. ⁶ Station ID of where the trip started. ⁷ Station name of where the trip ended. ⁸ Station ID of where the trip ended. ⁹ Latitude associated with the starting location. ¹⁰ Longitude associated with the starting location. ¹¹ Latitude associated with the ending location. ¹² Longitude associated with the ending location. ¹³ If someone is an annual subscriber or not.

A ‘glimpse’ function output of the raw data with data type information.

dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    tibble::glimpse()

Rows: 5,719,877
Columns: 13
$ ride_id            <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
$ rideable_type      <chr> "electric_bike", "classic_bike", "electric_bike", "…
$ started_at         <dttm> 2023-01-21 20:05:42, 2023-01-10 15:37:36, 2023-01-…
$ ended_at           <dttm> 2023-01-21 20:16:33, 2023-01-10 15:46:05, 2023-01-…
$ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
$ start_station_id   <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
$ end_station_name   <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
$ end_station_id     <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
$ start_lat          <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957, 4…
$ start_lng          <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.594…
$ end_lat            <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983, 4…
$ end_lng            <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.599…
$ member_casual      <chr> "member", "member", "casual", "member", "member", "…

R console output of the raw data

1.4 Design

Another worthy objective of this analysis is to achieve reproducibility and efficiency. To facilitate future research and enable subsequent analyst teams to build upon this work, the project aimed to provide adequate code documentation and adhere to best practices regarding clean and modular code.

For instance, certain design decisions were incorporated to eliminate the need for re-downloading and re-processing data. For analysts conducting analysis over an extended period, such as days or months, on this dataset, it is now possible to simply reconnect to the single database file containing all the original data, including tables generated throughout the analysis process, following the initial download and subsequent processing.

The underlying code incorporates an if-else decision, which includes a source code script responsible for handling the initial data processing and establishing the database filesystem. Opting for a persistent DuckDB filesystem (as opposed to a purely in-memory solution) appeared optimal in terms of simplicity, cost-effectiveness of SQL database queries, and retaining progress made over extended periods. (Why DuckDB)

Why DuckDB. https://duckdb.org/why_duckdb.html

To streamline the process, reduce code duplication, and maintain consistent formatting throughout the project, reusable functions were developed for generating most of the tables and figures. These functions are located in the “Scripts” folder within the working directory. Their modular design not only simplifies the implementation of formatting changes but also facilitates the integration of additional code snippets when necessary. For instance, certain plots might require limiting the range of the axes, which can be achieved by combining these functions with appropriate code addendum. By leveraging these functions, the project benefits from reduced redundancy, improved efficiency, and cohesive formatting across all visualizations and data representations.

1.5 Initial Database Table List

Code

# Since we already know the paths, we'll manually define these Otherwise, might
# have to delete db folder and start over

dbList <- list(Paths = c("db/original_data.db", "db/complete_data.db")) |>
    as.data.frame()

dbList |>
    tabler(title = gt::md("Which tables have<br>been created so far?"), note_list = list(gt::md("Tables in `db/data.db` at this stage")),
        location_list = list("Paths"), noteColumns = TRUE, source_note = gt::md("**Source**: `db/data.db`"),
        label_n = NULL, align_parameter = "left") |>
    gt::cols_align(align = "left")

Table 2: The list of tables created in data.db so far.

Paths¹
Which tables have been created so far?
db/original_data.db
db/complete_data.db
Source: `db/data.db`
¹ Tables in `db/data.db` at this stage

A view of the filesystem directory. Notice there are not separate files for tables. The file, data.db, in this view does not represent a table name but the name of the database.

2 Tidying

2.1 Duplicates

First, record original observations from the raw data.

# Need to save this count for the summary table later
original_nobs <- dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    nrow()

Create a table containing the duplicated observations.

# This is a separate table used to analyze the observations returned as not
# distinct (n > 1). This adds an extra column, labeled 'n'.
dupeTable <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::select(started_at:end_station_name) |>
    # Counts of unique rows added for column 'n'
dplyr::add_count(started_at, ended_at, start_station_name, end_station_name) |>
    # Only observations that have been duplicated 1 or more times are shown.
dplyr::filter(n > 1) |>
    # To see all rows, not just one row for each obs.
dplyr::ungroup() |>
    dplyr::arrange(started_at) |>
    dplyr::collect()

Record a count of distinct duplicates and total observations.

distinctCopiesCount <- dupeTable |>
    dplyr::distinct(n) |>
    as.integer()

duplicateObs <- length(dupeTable[[1]])

Create a table of the now unduplicated observations seen earlier.

# The issue is, we need to get rid of not all of these rows, but just the extra
# duplicate observations.

# If there were 2 rows of duplicates, one would want to end up with 1 row after
# removing the extras.
undupedTable <- dupeTable |>
    dplyr::distinct(started_at, start_station_name, ended_at, end_station_name)

Record a count of the incorrect observations.

# Run an incorrect count on how many rows or observations there are in the
# dataset.
count_incorrectDists <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::distinct(dplyr::pick("ride_id")) |>
    dplyr::count(name = "Incorrect Distinct Observations") |>
    dplyr::collect() |>
    as.integer()

Record a count of the correct observations.

# For the correct count of obs
count_correctDists <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::distinct(dplyr::pick("started_at", "start_station_name", "ended_at", "end_station_name")) |>
    dplyr::count() |>
    dplyr::collect() |>
    as.integer()

Lastly, write the unduplicated data to the database.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/data_unduped.db"))) {
    dplyr::tbl(dbconn, complete_path) |>
        dplyr::distinct(started_at, start_station_name, ended_at, end_station_name,
            .keep_all = TRUE) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/data_unduped.db", overwrite = TRUE)
}

A crucial question arises: How can one identify and handle duplicate data? This section covers the process of checking for duplicates and selectively removing them while exercising caution. It is essential to recognize that the presence of unique values in a single column does not necessarily guarantee the uniqueness of each observation or row.

While all values in the ride_id column were found to be unique, not all observations were truly distinct. To verify the uniqueness of each observation, additional columns such as start_time, end_time, start_station, and end_station were utilized. These columns provide more granular information, including the precise starting and ending times down to the second, as well as the starting and ending locations. It was assumed that observations with identical starting and ending date-times and stations, despite having different rider IDs, were potentially erroneous duplicates.

Code

gtDupes <- 
dupeTable |>
dplyr::group_by(started_at) |>
gt::gt(
rowname_col = "row",
groupname_col = "started_at",
row_group_as_column = TRUE
) |>
gt::tab_style(style = list(
    gt::cell_text(weight = "bold", align = "center"),
    gt::cell_borders(sides = c("bottom"))
    ),
    locations = gt::cells_column_labels(gt::everything())) |>
gt::tab_style(
    style = list(gt::cell_borders(
        sides = c("left", "right"), 
        color = "transparent"), 
        gt::cell_text(align = "center", v_align = "middle")), 
    locations = gt::cells_body(gt::everything())) |> 
gt::data_color(
        columns = start_station_name,
        target_columns = gt::everything(),
        method = "auto",
        palette = "basetheme::brutal") |> 
gt::tab_header(title = "A view of duplicated observations", 
               subtitle = "Grouping follows the starting date-time value") |>
gt::tab_options(
heading.title.font.weight = "bolder",
heading.subtitle.font.weight = "lighter",
heading.align = "center",
table.background.color = "transparent",
table.font.color = "SeaShell",
table.font.size = gt::pct(75),
) |>
gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`")
                    )
gtDupes

Table 3: Duplicates Table

A view of duplicated observations
Grouping follows the starting date-time value
	ended_at	start_station_name	start_station_id	end_station_name	n
2023-02-19 12:10:52	2023-02-19 12:24:04	Orleans St & Merchandise Mart Plaza	TA1305000022	Green St & Randolph St*	2
2023-02-19 12:10:52	2023-02-19 12:24:04	Orleans St & Merchandise Mart Plaza	TA1305000022	Green St & Randolph St*	2
2023-04-15 15:56:18	2023-04-15 16:01:54	Kedzie Ave & 45th St	342	Fairfield Ave & 44th St	2
2023-04-15 15:56:18	2023-04-15 16:01:54	Kedzie Ave & 45th St	342	Fairfield Ave & 44th St	2
2023-04-21 09:45:26	2023-04-21 10:01:23	Michigan Ave & 8th St	623	Wentworth Ave & Cermak Rd*	2
2023-04-21 09:45:26	2023-04-21 10:01:23	Michigan Ave & 8th St	623	Wentworth Ave & Cermak Rd*	2
2023-07-08 18:22:07	2023-07-08 18:37:31	Milwaukee Ave & Grand Ave	13033	Bissell St & Armitage Ave*	2
2023-07-08 18:22:07	2023-07-08 18:37:31	Milwaukee Ave & Grand Ave	13033	Bissell St & Armitage Ave*	2
2023-07-08 18:58:14	2023-07-08 19:08:47	Clark St & Schiller St	TA1309000024	Bissell St & Armitage Ave*	2
2023-07-08 18:58:14	2023-07-08 19:08:47	Clark St & Schiller St	TA1309000024	Bissell St & Armitage Ave*	2
2023-07-09 18:00:17	2023-07-09 18:40:49	Wells St & Hubbard St	TA1307000151	Fort Dearborn Dr & 31st St*	2
2023-07-09 18:00:17	2023-07-09 18:40:49	Wells St & Hubbard St	TA1307000151	Fort Dearborn Dr & 31st St*	2
2023-07-10 20:10:41	2023-07-10 20:19:59	Clark St & Newport St	632	Lincoln Ave & Roscoe St*	2
2023-07-10 20:10:41	2023-07-10 20:19:59	Clark St & Newport St	632	Lincoln Ave & Roscoe St*	2
2023-07-15 10:48:09	2023-07-15 10:58:22	Racine Ave & Wrightwood Ave	TA1309000059	Bissell St & Armitage Ave*	2
2023-07-15 10:48:09	2023-07-15 10:58:22	Racine Ave & Wrightwood Ave	TA1309000059	Bissell St & Armitage Ave*	2
2023-07-15 19:38:51	2023-07-15 19:55:04	Avondale Ave & Irving Park Rd	15624	Public Rack - Hamlin Ave & Fullerton Ave	2
2023-07-15 19:38:51	2023-07-15 19:55:04	Avondale Ave & Irving Park Rd	15624	Public Rack - Hamlin Ave & Fullerton Ave	2
2023-07-23 11:41:36	2023-07-23 12:07:13	Burnham Harbor	15545	Fort Dearborn Dr & 31st St*	2
2023-07-23 11:41:36	2023-07-23 12:07:13	Burnham Harbor	15545	Fort Dearborn Dr & 31st St*	2
2023-07-25 18:08:47	2023-07-25 18:21:45	Wabash Ave & Roosevelt Rd	TA1305000002	Wentworth Ave & Cermak Rd*	2
2023-07-25 18:08:47	2023-07-25 18:21:45	Wabash Ave & Roosevelt Rd	TA1305000002	Wentworth Ave & Cermak Rd*	2
2023-07-26 21:10:55	2023-07-26 21:28:44	Burnham Harbor	15545	Fort Dearborn Dr & 31st St*	2
2023-07-26 21:10:55	2023-07-26 21:28:44	Burnham Harbor	15545	Fort Dearborn Dr & 31st St*	2
2023-08-05 19:40:37	2023-08-05 20:08:35	Morgan St & Lake St*	chargingstx4	Morgan St & Lake St*	2
2023-08-05 19:40:37	2023-08-05 20:08:35	Morgan St & Lake St*	chargingstx4	Morgan St & Lake St*	2
2023-08-12 17:46:53	2023-08-12 17:57:40	Calumet Ave & 18th St	13102	Wentworth Ave & Cermak Rd*	2
2023-08-12 17:46:53	2023-08-12 17:57:40	Calumet Ave & 18th St	13102	Wentworth Ave & Cermak Rd*	2
2023-08-17 12:23:50	2023-08-17 12:37:45	DuSable Lake Shore Dr & Monroe St	13300	Streeter Dr & Grand Ave	2
2023-08-17 12:23:50	2023-08-17 12:37:45	DuSable Lake Shore Dr & Monroe St	13300	Streeter Dr & Grand Ave	2
2023-09-03 14:55:59	2023-09-03 15:58:21	Bissell St & Armitage Ave*	chargingstx1	Bissell St & Armitage Ave*	2
2023-09-03 14:55:59	2023-09-03 15:58:21	Bissell St & Armitage Ave*	chargingstx1	Bissell St & Armitage Ave*	2
2023-09-25 17:38:05	2023-09-25 17:52:29	Fairbanks Ct & Grand Ave	TA1305000003	DuSable Lake Shore Dr & North Blvd	2
2023-09-25 17:38:05	2023-09-25 17:52:29	Fairbanks Ct & Grand Ave	TA1305000003	DuSable Lake Shore Dr & North Blvd	2
2023-10-10 13:22:51	2023-10-10 13:29:37	Loomis St & Lexington St	13332	Morgan St & Polk St	2
2023-10-10 13:22:51	2023-10-10 13:29:37	Loomis St & Lexington St	13332	Morgan St & Polk St	2
Source: `db/data_complete.db`

Although the cause of such duplication errors is unknown, it could be assumed that one person checked out multiple bikes simultaneously. In that scenario, each bike would be assigned a unique ride_id. However, this occurrence was relatively rare, happening only 18 times over the course of a year. Since there is only one duplicate for each instance, it raises concerns and warrants further investigation. It is possible that trips could be grouped where one person pays for another rider’s fare. However, if that were the case, it raises the question of why there is always precisely one duplicate.

In Table 3, duplicate observations are listed and grouped by color for visual clarity. In contrast, Table 4 presents the data after removing the extra copy of each duplicate observation while preserving the unique observations. Of the duplicates identified, each had one extra copy. It was noted that the number of rows in the duplicates table is 36. Each duplicated observation has one duplicate, where n (the count) is always 2. Therefore, the expected number of observations to be removed was 18. A complication arose in determining how to remove not all observations but only the extra duplicate observation from each group.

Code

gt_undupes <- undupedTable |>
dplyr::collect() |>
dplyr::group_by(started_at) |>
gt::gt(
rowname_col = "row",
groupname_col = "started_at",
row_group_as_column = TRUE
) |>
gt::fmt_number(decimals = 0) |>
gt::tab_style(
style = list(
gt::cell_text(weight = "bold", align = "center"),
gt::cell_borders(sides = c("bottom"))
),
locations = gt::cells_column_labels(gt::everything())
) |>
gt::tab_style(
style = list(
gt::cell_borders(sides = c("left", "right")),
gt::cell_text(align = "center", v_align = "middle")
),
locations = gt::cells_body(gt::everything())
) |>
gt::data_color(
columns = start_station_name,
target_columns = gt::everything(),
method = "auto",
palette = "basetheme::brutal"
) |>
gt::tab_header(title = "After duplicates were removed", subtitle = "Same grouping") |>
gt::tab_options(
heading.title.font.weight = "bolder",
heading.subtitle.font.weight = "lighter",
heading.align = "center",
table.background.color = "transparent",
table.font.color = "SeaShell",
table.font.size = gt::pct(75)
) |>
gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`"))

gt_undupes

Table 4: Un-duplicated Table

After duplicates were removed
Same grouping
	start_station_name	ended_at	end_station_name
2023-02-19 12:10:52	Orleans St & Merchandise Mart Plaza	2023-02-19 12:24:04	Green St & Randolph St*
2023-04-15 15:56:18	Kedzie Ave & 45th St	2023-04-15 16:01:54	Fairfield Ave & 44th St
2023-04-21 09:45:26	Michigan Ave & 8th St	2023-04-21 10:01:23	Wentworth Ave & Cermak Rd*
2023-07-08 18:22:07	Milwaukee Ave & Grand Ave	2023-07-08 18:37:31	Bissell St & Armitage Ave*
2023-07-08 18:58:14	Clark St & Schiller St	2023-07-08 19:08:47	Bissell St & Armitage Ave*
2023-07-09 18:00:17	Wells St & Hubbard St	2023-07-09 18:40:49	Fort Dearborn Dr & 31st St*
2023-07-10 20:10:41	Clark St & Newport St	2023-07-10 20:19:59	Lincoln Ave & Roscoe St*
2023-07-15 10:48:09	Racine Ave & Wrightwood Ave	2023-07-15 10:58:22	Bissell St & Armitage Ave*
2023-07-15 19:38:51	Avondale Ave & Irving Park Rd	2023-07-15 19:55:04	Public Rack - Hamlin Ave & Fullerton Ave
2023-07-23 11:41:36	Burnham Harbor	2023-07-23 12:07:13	Fort Dearborn Dr & 31st St*
2023-07-25 18:08:47	Wabash Ave & Roosevelt Rd	2023-07-25 18:21:45	Wentworth Ave & Cermak Rd*
2023-07-26 21:10:55	Burnham Harbor	2023-07-26 21:28:44	Fort Dearborn Dr & 31st St*
2023-08-05 19:40:37	Morgan St & Lake St*	2023-08-05 20:08:35	Morgan St & Lake St*
2023-08-12 17:46:53	Calumet Ave & 18th St	2023-08-12 17:57:40	Wentworth Ave & Cermak Rd*
2023-08-17 12:23:50	DuSable Lake Shore Dr & Monroe St	2023-08-17 12:37:45	Streeter Dr & Grand Ave
2023-09-03 14:55:59	Bissell St & Armitage Ave*	2023-09-03 15:58:21	Bissell St & Armitage Ave*
2023-09-25 17:38:05	Fairbanks Ct & Grand Ave	2023-09-25 17:52:29	DuSable Lake Shore Dr & North Blvd
2023-10-10 13:22:51	Loomis St & Lexington St	2023-10-10 13:29:37	Morgan St & Polk St
Source: `db/data_complete.db`

To ensure the accurate removal of duplicates, the count of distinct n-values (representing the number of occurrences) for the un-duplicated table was computed, confirming the expected 18 unique instances. Subsequently, the total number of observations in the dataset was recorded, initially standing at 4,331,707. After removing the identified duplicate observations, the correct count of observations was 4,331,689. In summary, 18 additional observations were successfully removed, aligning with the expected number of duplicates identified earlier. These steps are documented in Table 5 for reference.

By carefully analyzing the count of distinct n-values and the total occurrences before and after reduplication, it was ensured that only the precise number of duplicate observations was removed, preserving the integrity of the unique data while eliminating the identified duplicates. This meticulous approach to data cleaning is crucial for maintaining data quality and reliability throughout the analysis process.

Code used for handling duplicates in this section

2.2 Outliers

If you happen to be re-using this code -
this is so you do not have to re-download
or re-filter after making further adjustments.

filtered_path <- "db/filtered_data.db"

# Do we still need to filter the database?
if (duckdb::dbExistsTable(dbconn, filtered_path) == FALSE) {
    source("Scripts/transFilter.R")
    transFilter(conxn = dbconn, oldPath = "db/data_unduped.db", newPath = filtered_path)
}

This would execute if the conditions were met
to filter the db/data.db database table

# ----
# Author: Eric Mossotti
# CC-BY SA
# ----
# Performs data transformations and filters to enforce consistency across
# all of the analyses. It also will simplify query syntax in extended analyses.
# Filtering can be based on flexible, but sensible criteria.
# ----
transFilter <- function(conxn, oldPath, newPath) {
    dplyr::tbl(conxn, oldPath) |>
        dplyr::collect() |>
        dplyr::mutate(
            "trip_time" = lubridate::time_length(lubridate::interval(started_at, ended_at), 
                                                 unit = "minute"),
            miles = geosphere::distGeo(
                p1 = cbind(start_lng, start_lat),
                p2 = cbind(end_lng, end_lat)
            ) / 1000 * 0.62137119,
            mph = (miles / (trip_time / 60)),
            .keep = "all",
        ) |> 
        # Floor rationale -  less than 0.1 miles are distances easily walked
        # Speed ceiling rationale - pro cyclists average around 20 mph
        # Speed floor rationale - accounts for trips possibly spent idling
        # rideable_type rationale - docked_bike stopped being recorded as a distinct
        # category within the time being analyzed (2023)
        # docked_bike was phased out and not much info about what it means
        dplyr::filter(
            trip_time > 1,
            trip_time < 480,
            rideable_type != "docked_bike",
            miles >= 0.1,
            mph <= 20,
            mph >= 1
        ) |>
        duckdb::dbWriteTable(conn = conxn,
                             name = newPath,
                             overwrite = TRUE)
}

Observations deemed erroneous or irrelevant for identifying usage trends among members and casual users were filtered out. Keeping track of these errors is a good practice, as they might provide insights into the differences in how members and casuals utilize the service.

Trips with negative duration were flagged as errors and removed. Additionally, trips lasting less than a minute but greater than zero were noted and removed, as they could potentially skew the derived statistics. These extremely short trips might be attributed to users briefly trying out the service before committing or quickly realizing their dissatisfaction with it. While some observations seemed nonsensical, most of the data was retained.

Consistent with the previous approach, an if-else decision was employed to facilitate testing. An external database filtering script was utilized to streamline the code within the main Quarto document. The resulting filtered data served as the foundation for subsequent analysis and table generation.

To get a count of the new total observations after filtering.

count_filtered <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(ride_id) |>
    dplyr::distinct() |>
    dplyr::count() |>
    dplyr::collect() |>
    as.integer()

Code

# To see the history of obs in our dataset.
summaryProcessTable <- tidyr::tribble(
~ " ",
~ "  ",
"Original   ",
original_nobs,
"Complete Observations   ",
count_incorrectDists,
"Duplicates   ",
(count_incorrectDists - count_correctDists),
"Filtered     ",
(count_correctDists - count_filtered),
"Total Corrected   ",
count_filtered
) |>
gt::gt(rownames_to_stub = FALSE) |>
gt::tab_header(title = "Documenting transformation results", 
subtitle = gt::md("Noting the change<br>in overall observations")) |>
gt::cols_label("  " = "n") |>
gt::tab_footnote(
footnote = gt::md("Row counts throughout the cleaning steps."),
locations = gt::cells_column_labels(columns = "  ")
) |>
gt::tab_style(
style = list(
gt::cell_borders(sides = "bottom"),
gt::cell_text(
align = "left",
stretch = "semi-expanded"
)
),
locations = gt::cells_body(gt::everything())
) |>
gt::tab_style(
gt::cell_text(
align = "center",
stretch = "semi-expanded"),
locations = list(
gt::cells_title(groups = c("title", "subtitle")),
gt::cells_column_labels(gt::everything())
)
) |>
gt::fmt_number(decimals = 0) |>
gt::tab_options(
column_labels.font.weight = "bold",
table.background.color = "transparent",
table.font.color = "SeaShell",
row.striping.background_color = "gray10",
row.striping.include_table_body = TRUE
) |>
gt::tab_source_note(source_note = gt::md("**Sources**: `db/data_original.db`, `db/data_complete.db`,<br>`db/data_unduped.db`, `db/data_filtered.db`"))

summaryProcessTable

Table 5: Documenting observation count history.

	n¹
Documenting transformation results
Noting the change in overall observations
Original	5,719,877
Complete Observations	4,331,707
Duplicates	18
Filtered	434,291
Total Corrected	3,897,398
Sources: `db/data_original.db`, `db/data_complete.db`, `db/data_unduped.db`, `db/data_filtered.db`
¹ Row counts throughout the cleaning steps.

Outlier and additional transformations

3 Exploratory Analysis

Reduces duplicate code and improves conistency in table formatting across this report.

# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# Reduces duplicate code for generating publication ready tables.
# ----
tabler <- function(tbl_name,
                   source_note = NULL,
                   title = NULL,
                   subtitle = NULL,
                   groupName = NULL,
                   by = NULL,
                   label = NULL,
                   label_n = "n",
                   label_member = "member_casual",
                   location = NULL,
                   footnote = NULL,
                   footer_column = NULL,
                   value_columns = gt::everything(),
                   decimals = 0,
                   font_color = "seashell",
                   bg_color = "transparent",
                   vline_color = "gray20",
                   hline_color = "gray20",
                   hide_column_labels = FALSE,
                   isSummary = FALSE,
                   isBinary = FALSE,
                   note_list = NULL,
                   location_list = NULL,
                   noteColumns = FALSE,
                   noteRows = FALSE,
                   chi_result = NULL,
                   chiVar = NULL,
                   chiBy = NULL,
                   stub_label = NULL,
                   stub_note = NULL,
                   stubName = NULL,
                   isStub = FALSE,
                   row_index = NULL,
                   label_parameter = NULL,
                   align_parameter = "center"
                   ) {
    # Need to account for grouped tables and non-grouped tables
    # Grouped ----
    if (!is.null(groupName)) {
        tbl <- gt::gt(tbl_name,
                      groupname_col = groupName,
                      row_group_as_column = TRUE) |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_stubhead(label = stub_label) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table.margin.left = 0,
                table.margin.right = 0,
                table_body.vlines.color = "transparent",
                table_body.hlines.color = hline_color,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
                column_labels.hidden = hide_column_labels
            ) |>
            gt::tab_source_note(source_note = {{ source_note }})
    }
    
    # Chi Square ----
    if (isSummary == TRUE) {
        tbl <- tbl_name |>
            gtsummary::tbl_summary(by = {{ by }}, label = label) |>
            gtsummary::modify_footnote(
                gtsummary::all_stat_cols() ~ paste0(
                    "n (%);   χ²  = ",
                    round(chi_result$statistic, 2),
                    ";   df = ",
                    chi_result$parameter
                )
            ) |>
            gtsummary::add_p() |>
            gtsummary::as_gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = "bottom"),
                    gt::cell_text(
                        align = "left",
                        stretch = "semi-expanded",
                        whitespace = "break-spaces"
                    )
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                gt::cell_text(align = "center", stretch = "semi-expanded"),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle")),
                    gt::cells_column_labels(gt::everything())
                )
            ) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table_body.vlines.color = vline_color,
                table_body.hlines.color = hline_color,
                column_labels.hidden = hide_column_labels,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    
    # Binary Logistic ----
    if (isBinary == TRUE) {
        tbl <- tbl_name |>
            gtsummary::as_gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = "bottom"),
                    gt::cell_text(align = "left", stretch = "semi-expanded")
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                gt::cell_text(align = "center", stretch = "semi-expanded"),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle")),
                    gt::cells_column_labels(gt::everything())
                )
            ) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table_body.vlines.color = vline_color,
                table_body.hlines.color = hline_color,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.hidden = hide_column_labels,
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    # non - grouped, non - binary, non - summary ----
    if (is.null(groupName) &&
        isFALSE(isBinary) &&
        isFALSE(isSummary)) {
        tbl <- tbl_name |>
            gt::gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = c("right", "left"), color = vline_color),
                    gt::cell_text(stretch = "semi-expanded")
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(
                        sides = c("right", "left", "top"),
                        color = vline_color
                    ),
                    gt::cell_text(align = "center", stretch = "semi-expanded")
                ),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle"))
                )
            ) |>
            gt::tab_style(
                style = list(
                    gt::cell_text(align = align_parameter, stretch = "semi-expanded")
                ),
                locations = list(gt::cells_column_labels(columns = label_parameter))
            )
        # Need to avoid this part for non-frequency tables?
        if (!is.null(label_n)) {
            tbl <- tbl |> gt::tab_style(
                style = list(
                    gt::cell_text(align = "right", stretch = "semi-expanded")
                ),
                locations = list(gt::cells_column_labels(columns = label_n))
            )
        }
        tbl <- tbl |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table.margin.left = 0,
                table.margin.right = 0,
                table_body.vlines.color = "transparent",
                table_body.hlines.color = "transparent",
                column_labels.vlines.color = vline_color,
                column_labels.hidden = hide_column_labels,
                heading.border.lr.color = "transparent",
                container.padding.x = 0,
                container.padding.y = 0,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    # To add footnotes to the various table types ----
    if (!is.null(note_list) && !is.null(location_list)) {
        tbl <- tbl |>
            add_multiple_footnotes(note_list,
                                   location_list,
                                   noteColumns,
                                   noteRows,
                                   row_index)
    }
    
    if (isTRUE(isStub)) {
        tbl <- tbl |>
            gt::tab_stubhead(label = stub_label) |>
            gt::cols_align(columns = groupName, align = "center") |>
            gt::tab_footnote(
                footnote = stub_note,
                location = gt::cells_stubhead(),
                placement = "auto"
            )
    }
    return(tbl)
}


# Footnotes function ----
add_multiple_footnotes <- function(tbl,
                                   note_list,
                                   location_list,
                                   noteColumns,
                                   noteRows,
                                   row_index) {
    if (length(note_list) != length(location_list)) {
        stop("The lengths of note_list and location_list must be equal.")
    }
    
    if (isTRUE(noteColumns)) {
        for (i in seq_along(note_list)) {
            tbl <- tbl |>
                gt::tab_footnote(
                    footnote = note_list[[i]],
                    location = gt::cells_column_labels(columns = location_list[[i]]),
                    placement = "left"
                )
        }
    }
    
    if (isTRUE(noteRows)) {
        for (i in seq_along(note_list)) {
            tbl <- tbl |>
                gt::tab_footnote(
                    footnote = note_list[[i]],
                    location = gt::cells_body(columns = location_list[[i]], rows = row_index),
                    placement = "auto"
                )
        }
    }
    
    return(tbl)
}

Provides a useful script for consistent formatting of many of the plots.

#  ----
# Author: Eric Mossotti
# CC BY-SA
#  ----
# To help reduce duplicate code and implement a consistent theme throughout a
# markdown project. This doesn't account for all plot types possible yet.
# ----
plotter <- function(data,
                    x_col,
                    y_col,
                    group_col = NULL,
                    color_col = NULL,
                    title = NULL,
                    subtitle = NULL,
                    x_label = NULL,
                    y_label = NULL,
                    geomType = NULL,
                    lineGroup_palette = "Paired",
                    colGroup_palette = "Paired",
                    line_palette = "YlOrRd",
                    col_palette = "YlOrRd",
                    facetCol_palette = "YlGnBu",
                    fill_col = "YlOrRd",
                    axis_ticks_color = "lavenderblush",
                    bg_color = "#222222",
                    text_color = "seashell",
                    grid_color = "grey30",
                    x_lim = c(NA, NA),
                    isFaceted = FALSE,
                    is_lineGroup = FALSE,
                    is_colGroup = FALSE,
                    isHistogram = FALSE,
                    isDensity = FALSE,
                    breaks = NULL,
                    limits = NULL,
                    angle = ggplot2::waiver(),
                    n.dodge = 1,
                    binwidth = NULL,
                    areaFill = NULL,
                    density_fill = ggplot2::waiver(),
                    density_color = ggplot2::waiver(),
                    alpha = ggplot2::waiver(),
                    isROC = FALSE,
                    roc_color = NULL,
                    vline_color = NULL,
                    vline_size = NULL,
                    density_alpha = ggplot2::waiver(),
                    dnorm_color = ggplot2::wavier(),
                    bins = NULL,
                    histo_breaks = NULL,
                    low = NULL,
                    high = NULL,
                    isTime = FALSE,
                    date_breaks = ggplot2::waiver(),
                    date_labels = ggplot2::waiver(),
                    date_minor_breaks = ggplot2::waiver(),
                    lineColor = NULL,
                    colPosition = "stack",
                    labels = ggplot2::waiver(),
                    isTimeHist = FALSE,
                    quartiles = NULL,
                    qformat = NULL) {
    # line ----
    if (geomType == "line") {
        if (is_lineGroup == TRUE) {
            plot <- data |> ggplot2::ggplot(mapping = ggplot2::aes(
                x = {{ x_col }},
                y = {{ y_col }},
                color = {{ color_col }}
            )) +
                ggplot2::geom_line(show.legend = TRUE) +
                ggplot2::scale_color_brewer(palette = lineGroup_palette, name = "")
        } else {
            plot <- data |> ggplot2::ggplot(mapping = ggplot2::aes(x = {{ x_col }}, y = {{ y_col }})) +
                ggplot2::geom_line(show.legend = TRUE, color = lineColor)
        }
    }
    # col ----
    else if (geomType == "column") {
        # non-grouped, non-faceted, non-density, non-histogram ----
        if (isFALSE(is_colGroup) &&
            isFALSE(isDensity) &&
            isFALSE(isDensity) &&
            isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ y_col }}
                )) +
                ggplot2::geom_col(show.legend = FALSE) +
                ggplot2::scale_fill_distiller(palette = col_palette)
        }
        # grouped, non - faceted, non - density, non - histogram----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_col(
                    show.legend = TRUE,
                    position = colPosition,
                    color = color_col
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
        # grouped, faceted, non-density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ y_col }}
                )) +
                ggplot2::geom_col(show.legend = FALSE) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_fill_distiller(palette = facetCol_palette)
        }
        # grouped, faceted, density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ..density..,
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_density(alpha = alpha) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = 1, angle = angle)
                )
        }
        # grouped, non-faceted, density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ..density..,
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    alpha = alpha,
                    breaks = histo_breaks
                ) +
                ggplot2::geom_density(alpha = alpha,
                                      color = density_color,
                                      fill = density_fill) +
                ggplot2::stat_function(fun = dnorm,
                                       args = list(mean = mean({{ x_col }}), sd = sd({{ x_col }}))) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = 1, angle = angle)
                )
        }
        # grouped, non-faceted, density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ggplot2::after_stat(density),
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_density(alpha = density_alpha, color = color_col) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
        # non-grouped, non-faceted, density, non-histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(
                    ggplot2::aes(
                        x = {{ x_col }},
                        y = ..density..,
                        fill = density_fill,
                        color = density_color
                    )
                ) +
                ggplot2::geom_density(show.legend = FALSE) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, non-faceted, non-density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    alpha = alpha,
                    breaks = histo_breaks
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, non-density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_col(show.legend = TRUE, position = "dodge") +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "") +
                ggplot2::scale_x_discrete(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, non-density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    bins = bins,
                    breaks = breaks
                ) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # non-grouped, non-faceted, non-density, histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }})) +
                ggplot2::geom_histogram(
                    show.legend = FALSE,
                    color = color_col,
                    binwidth = binwidth,
                    ggplot2::aes(fill = ..count..)
                )
            if (!isFALSE(isTimeHist)) {
                plot <- plot +
                    ggplot2::scale_x_datetime(
                        date_breaks = date_breaks,
                        date_labels = date_labels,
                        date_minor_breaks = date_minor_breaks,
                        guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle),
                        sec.axis = ggplot2::sec_axis(
                            ~ .,
                            breaks = quartiles,
                            labels = scales::date_format(qformat),
                            guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                        )
                    )
            } else {
                plot <- plot +
                    ggplot2::scale_x_continuous(
                        breaks = breaks,
                        limits = limits,
                        guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle),
                        sec.axis = ggplot2::sec_axis(
                            ~ .,
                            breaks = quartiles,
                            labels = scales::label_number(),
                            guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                        )
                    )
            }
            plot <- plot +
                ggplot2::scale_y_continuous() +
                ggplot2::scale_fill_gradient(low = low, high = high) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::geom_vline(
                    data = data.frame(q = quartiles),
                    ggplot2::aes(xintercept = q, color = factor(q)),
                    linetype = "dashed",
                    size = 1
                ) +
                ggplot2::scale_color_manual(
                    values = c("orange", "green", "purple"),
                    labels = c("25th", "50th", "75th")
                )
        }
        # non-grouped, non-faceted, density, histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, y = ..density..)) +
                ggplot2::geom_histogram(color = color_col,
                                        bins = bins,
                                        breaks = breaks) +
                ggplot2::geom_density(color = density_color,
                                      fill = density_fill,
                                      alpha = density_alpha) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{ x_col }})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, density, histogram ----
        else {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }},
                                             fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    bins = bins,
                    breaks = histo_breaks
                ) +
                ggplot2::geom_density(color = density_color,
                                      fill = density_fill,
                                      alpha = density_alpha) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{ x_col }})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # other misc plot types ----
    } else {
        # pROC objects ----
        if (!isFALSE(isROC)) {
            plot <- data |>
                pROC::ggroc(aes = "linetype", color = roc_color)
        }
        # time-series data ----
        # grouped, non-faceted, density, non-histogram
        if (!isFALSE(isTime)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }},
                                             fill = {{ group_col }})) +
                ggplot2::geom_density(alpha = density_alpha, color = color_col) +
                ggplot2::scale_x_datetime(
                    date_breaks = date_breaks,
                    date_labels = date_labels,
                    date_minor_breaks = date_minor_breaks,
                    labels = labels,
                    guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
    }
    # For the rest of the otherwise likely duplicated plot settings ----
    plot <- plot +
        ggplot2::labs(title = title, subtitle = subtitle, x = x_label, y = y_label) +
        ggplot2::theme(
            panel.background = ggplot2::element_rect(fill = bg_color, color = NA),
            plot.background = ggplot2::element_rect(fill = bg_color, color = NA),
            text = ggplot2::element_text(color = text_color),
            panel.grid = ggplot2::element_blank(),
            axis.title.x = ggplot2::element_text(margin = grid::unit(c(5, 5, 5, 5), "mm")),
            axis.title.y = ggplot2::element_text(margin = grid::unit(c(5, 5, 5, 5), "mm")),
            axis.text.x = ggplot2::element_text(color = "Snow", margin = grid::unit(c(1, 1, 1, 1), "mm")),
            axis.text.y = ggplot2::element_text(color = "Snow", margin = grid::unit(c(2, 2, 2, 2), "mm")),
            axis.ticks = ggplot2::element_line(color = axis_ticks_color),
            axis.ticks.y = ggplot2::element_blank(),
            panel.grid.major.x = ggplot2::element_line(color = grid_color, linetype = "dotted"),
            panel.grid.major.y = ggplot2::element_line(color = grid_color),
            legend.background = ggplot2::element_rect(fill = bg_color),
            legend.title = ggplot2::element_blank()
        )
}

Performs many of the more backend query transformations I needed for this document.

# Author: Eric Mossotti
# CC BY-SA
# ----
# Programmatic db extraction and df transformation, utilizing tidy evaluation.
#
# Allows one to specify which columns to select, which columns to group by for 
# counting, and which column (if any) to transform. It be used for a variety of 
# similar data transformation tasks.
# 
# Results can be weighted to help certain modeling functions run faster while 
# not altering the results one would expect from using non-aggregated datasets 
# as input.
# ---- 
transformData <- function(conn,
                          path,
                          df,
                          select_cols,
                          group_cols = NULL,
                          binary_col = NULL,
                          ntile_col = NULL,
                          pred_col = NULL,
                          zero_val = NULL,
                          one_val = NULL,
                          qtile_levels = NULL,
                          doWeights = FALSE,
                          doQuantile = FALSE,
                          isDF = FALSE) {
    # Weight the data? ----
    if (isTRUE(doWeights)) {
        # Is the data from a df or db? ----
        if (isTRUE(isDF)) {
            freq_data <- df |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::add_count(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::distinct() |>
                dplyr::arrange(dplyr::across(dplyr::all_of(group_cols)))
        } else {
            freq_data <- dplyr::tbl(conn, path) |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::add_count(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::distinct() |>
                dplyr::arrange(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::collect()
        }
    } else {
        # Is the data from a df or db? ----
        if (isTRUE(isDF)) {
            freq_data <- df |>
                dplyr::select(dplyr::all_of(!!select_cols)) |>
                dplyr::arrange(dplyr::across(dplyr::all_of(!!select_cols)))
        } else {
            freq_data <- dplyr::tbl(conn, path) |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::arrange(dplyr::across(dplyr::all_of(select_cols))) |>
                dplyr::collect()
        }
    }
    # Do we want to transform a column to binary for modeling?
    if (!is.null(binary_col) &&
        binary_col %in% names(freq_data)) {
        freq_data <- freq_data |>
            dplyr::mutate(
                # Tidy-selects data with ':=' and bang-bang operators
                !!binary_col := factor(.data[[binary_col]],
                                       levels = c(zero_val, one_val))
            )
    }
    # Do we want to predict anything?
    if (isTRUE(doQuantile)) {
        freq_data <- freq_data |>
            dplyr::mutate(
                !!ntile_col := dplyr::ntile(.data[[pred_col]], n = 4),
                !!ntile_col := factor(.data[[ntile_col]],
                                      levels = c(1, 2, 3, 4),
                                      labels = as.vector(qtile_levels))
            )
    }
    
    return(freq_data)
}

Helpfully, this contributes to the Chi-Squared tables by allowing me to add the chi-squared statistic and degrees of freedom information in the footnotes.

# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# The code for returning chi-square test results as a tibble for use in tables.
# ----
chisqTest <- function(data, variable, by) {
    test_result <- chisq.test(x = data[[variable]], y = data[[by]]) |>
        broom::tidy() |>
        dplyr::select(statistic, parameter, p.value)
    
    return(test_result)
}

Scripts used frequently in this section

3.1 Methods

This section explains certain methodologies that might be less obvious to the reader.

3.1.1 Chi-Square

Set up

Null Hypothesis (H0) There is no association between membership status and travel behavior.

Alternative Hypothesis (H1) There is an association between membership status and travel behavior.

Significance level, \((\alpha) = 0.05\)

Collect data

Tables are transformed and stored for later use. Where needed, the appropriate table is extracted from the database for analysis.

Expected frequencies

\(\text Expected_{for\ each\ cell} = (\text Row\ Total \times Column\ Total) / \text Grand\ Total\)

Chi-square statistic

\(\chi^2=\sum[(O-E)^2/E]\)

\(O\) - Observed frequency

\(E\) - Expected frequency

\(\sum\) - Sum over all cells in the table

Degrees of freedom

\(\text df = (rows - 1) \times (columns - 1)\)

The critical value

The critical value can be determined from the chi-square distribution table or software using \(α\) and the df.

The p-value

The p-value is calculated from the chi-square statistic and the df. When calculating the p-value, statistical software or functions (like the ones later in this report) internally use this integral:

\(\text p-value = \int[\chi^2\ \ to\ \infty]\ f(x,\ df)dx\)

\(\chi^2\) is the calculated chi-square statistic (the lower bound of integration)

\(\infty\) is the upper bound of integration

\(f(x,\ df)\) is the chi-square PDF (probability density function)

\(x\) is the variable of integration

Interpreting results

Our calculated chi-square statistic is compared to the critical value (greater or less than). We then would be checking for consistency by comparing the p-value to the significance level value. Then, we either reject, where \(p < \alpha\) , or fail to reject, where \(p>\alpha\) , the null hypothesis. This informs us if there is a significant association between the membership status and the given travel behavior parameter.

Calculation Process

The software used to generate the Chi-Square tables in this report makes it so a human doesn’t need to work with the integrals directly. The software calculates the chi-square statistic and df from the data and uses that information as input to the chi-square calculator, which handles the integration to return the p-value.

3.1.2 Binary Logistic Regression

Model Fitting

The logistic regression model is fitted using Maximum Likelihood Estimation (MLE) through an Iteratively Reweighted Least Squares (IRLS) algorithm. This process determines the optimal coefficients \((\beta)\) for the predictors.

Odds Ratio Calculation

For each predictor variable, the odds ratio is calculated as \(exp(β)\), where \(\beta\) is the coefficient estimated by the model. This represents the change in odds of the outcome for a one-unit increase in the predictor.

Statistical Significance

p-values are computed for each predictor using the Wald test. This test calculates a z-statistic, \((\beta/SE(\beta))\), and compares it to a standard normal distribution to determine the probability of observing such a result under the null hypothesis.

Results Compiled As

Predictor variables (Characteristics)
Odds Ratios (OR)
p-values
Sometimes confidence intervals for the ORs (not shown in this example)

Reference Category

For categorical predictors, one category is set as the reference (marked with a dash in the OR column). This dashed-out value is very close to 1.00. This process transforms the raw model output into an interpretable format, allowing readers to quickly assess the direction, magnitude, and significance of each predictor’s effect on the outcome variable.

3.2 Membership

Database Operations

Table Preview

Write to .db

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/membership.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(member_casual) |>
        dplyr::arrange(member_casual) |>
        dplyr::collect() |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/membership.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/membership.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 6: Kable output

member_casual
casual
casual
casual
casual
casual
casual

Code

transformData(conn = dbconn, path = "db/membership.db", select_cols = "member_casual",
    group_cols = "member_casual", doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("membership status", "occurrences"), location_list = c("member_casual",
        "n"), source_note = gt::md("**Source**: `db/membership.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)

Table 7: Tabularizing membership distribution across all observations

casual¹	² 1260621
member	2636777
Source: `db/membership.db`
¹ membership status
² occurrences

Overall

Table 7 and Figure 2 give an idea of the overall member to casual travel distribution.

Code

gplot <- transformData(conn = dbconn, path = "db/membership.db", select_cols = "member_casual",
    group_cols = "member_casual", doWeights = TRUE) |>
    plotter(x_col = member_casual, y_col = n, geomType = "column", title = "How often do members travel?",
        x_label = "Rider Groups", y_label = "n")

gplot

Figure 2: Plotting the overall membership distribution

3.3 Cycle Types

Database Operations

Table Preview

Write bType.db to the database.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(rideable_type, member_casual) |>
        dplyr::arrange(rideable_type, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(rideable_type = forcats::as_factor(rideable_type)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/bType.db", overwrite = TRUE)
}

Transform and write as a weighted binary table for modeling.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType_wb.db"))) {
    transformData(conn = dbconn, path = "db/bType.db", select_cols = c("rideable_type",
        "member_casual"), group_cols = c("rideable_type", "member_casual"), binary_col = "member_casual",
        zero_val = "casual", one_val = "member", doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/bType_wb.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/bType.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/bType_wb.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 8: Kable output

rideable_type	member_casual
classic_bike	casual
classic_bike	casual
classic_bike	casual
classic_bike	casual
classic_bike	casual
classic_bike	casual

Table 9: Kable output

rideable_type	member_casual	n
classic_bike	casual	739475
classic_bike	member	1714250
electric_bike	casual	521146
electric_bike	member	922527

Code

transformData(conn = dbconn, path = "db/bType.db", select_cols = "rideable_type",
    group_cols = "rideable_type", doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("bicycle type", "occurrences"), location_list = list("rideable_type",
        "n"), source_note = gt::md("**Source**: `db/bType.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)

Table 10: The overall distribution of bicycle choice

classic_bike¹	² 2453725
electric_bike	1443673
Source: `db/bType.db`
¹ bicycle type
² occurrences

Code

transformData(
conn = dbconn, 
path = "db/bType.db", 
select_cols = c("rideable_type", "member_casual"),
group_cols = c("rideable_type", "member_casual"),
doWeights = TRUE
) |>
tabler(
title = NULL, 
groupName = "rideable_type", 
note_list =  list("membership status", "occurrences"),
location_list = list("member_casual", "n"),
source_note = gt::md("**Source**: `db/bType.db`"),
noteRows = TRUE,
#hide_column_labels = TRUE,
row_index = 1,
isStub = TRUE,
stub_label = "Type",
stub_note = "bicycle type"
) |>       
gt::cols_label(member_casual = " ", n = " ")

Table 11: The membership distribution of bicycle choice

Type¹
classic_bike	casual²	³ 739475
classic_bike	member	1714250
electric_bike	casual	521146
electric_bike	member	922527
Source: `db/bType.db`
¹ bicycle type
² membership status
³ occurrences

Overall
Comparative

Table 10 and Figure 3 present the overall frequency distribution for choice of bicycle.

Code

gplot <- transformData(conn = dbconn, path = "db/bType.db", select_cols = "rideable_type",
    group_cols = "rideable_type", doWeights = TRUE) |>
    plotter(x_col = rideable_type, y_col = n, geomType = "column", title = paste0("Which bicycles are preferred?"),
        x_label = "Type", y_label = "n")

gplot

Figure 3: The overall distribution of trips taken by bicycle choice

Likewise, Table 11 and Figure 4 summarize the choice of bicycle with the addtion of a grouped sum by the membership status.

Code

gplot <- transformData(conn = dbconn, path = "db/bType.db", select_cols = c("rideable_type",
    "member_casual"), group_cols = c("rideable_type", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which bicycles are members choosing?"), x_label = "Type",
        y_label = "n", x_col = rideable_type, y_col = n, group_col = member_casual,
        geomType = "column", is_colGroup = TRUE, color_col = "black", colPosition = "dodge",
        colGroup_palette = "Paired")

gplot

Figure 4: The membership distribution of bicycle choice

Table 12 presents a Chi-Square analysis of bicycle type usage among casual users and members. In summary, members show a higher preference for classic bikes (65%) compared to casual users (59%). Casual users have a higher proportion of electric bike usage (41%) compared to members (35%).

There is a statistically significant association between bicycle type and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between bicycle type and membership status. This suggests that the choice of bicycle type is not independent of membership status.

The large \(\chi^2\) value (14762.37) with just 1 degree of freedom (calculated as [rows - 1] * [columns - 1])** results in the very small p-value (< 0.001). This combination strongly suggests that the difference in bicycle preference between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.

data_tibble <- dplyr::tbl(dbconn, "db/bType.db") |>
    dplyr::select(rideable_type, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "rideable_type", by = "member_casual")

Code

chi_table <- tabler(title = gt::md("Chi-Square:<br>The signficance of bicycle choice and membership"),
    source_note = gt::md("**Source**: `db/bType.db`"), label = list(rideable_type = "Bicycle Type"),
    by = member_casual, isSummary = TRUE, chiVar = "rideable_type", chiBy = "member_casual",
    tbl_name = data_tibble, chi_result = chiResult)

chi_table

Table 12: Testing indepdence of bicycle choice and membership variables

Characteristic	casual N = 1,260,621¹	member N = 2,636,777¹	p-value²
Chi-Square: The signficance of bicycle choice and membership
Bicycle Type			<0.001
classic_bike	739,475 (59%)	1,714,250 (65%)
electric_bike	521,146 (41%)	922,527 (35%)
Source: `db/bType.db`
¹ n (%); χ² = 14762.37; df = 1
² Pearson’s Chi-squared test

Table 13 presents the results of a binary logistic regression analyzing the relationship between bicycle type and membership status. The analysis compares classic bikes and electric bikes, with classic bikes serving as the reference category.

The odds of membership for users of electric bikes were 0.76 times the odds for users of classic bikes. The difference in membership likelihood between electric and classic bike users is highly statistically significant (p < 0.001).

Predicting the log-odds of being a member versus being a casual user.

model <- dplyr::tbl(dbconn, "db/bType_wb.db") |>
    glm(formula = member_casual ~ rideable_type, weights = n, family = binomial)

Code

regression_tbl <- model |>
    gtsummary::tbl_regression(label = list(rideable_type = "Bicycle Choice"), conf.int = FALSE,
        exponentiate = TRUE)

regression_tbl |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' bicycle choices"),
        source_note = gt::md("**Source**: `db/bType_wb.db`"), isBinary = TRUE)

Table 13: Modeling the probability of member travel by the choice of bicycle.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of members’ bicycle choices
Bicycle Choice
classic_bike	—
electric_bike	0.76	<0.001
Source: `db/bType_wb.db`
¹ OR = Odds Ratio

3.4 Duration

Database Operations

Table Preview

Write to .db.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(trip_time, member_casual) |>
        dplyr::arrange(trip_time, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(trip_time = round(trip_time, digits = 2), mins = round(trip_time,
            digits = 0), mins = forcats::as_factor(mins)) |>
        dplyr::arrange(trip_time, member_casual) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/duration.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration_wq.db"))) {
    transformData(conn = dbconn, path = "db/duration.db", select_cols = c("trip_time",
        "member_casual"), group_cols = c("trip_time", "member_casual"), binary_col = "member_casual",
        pred_col = "trip_time", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (1.02 - 5.73]", "Q2 (5.73 - 9.55]", "Q3 (9.55 - 16.13]",
            "Q4 (16.13 - 475.22]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/duration_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/duration_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 14: Kable output

trip_time	member_casual	mins
1.02	casual	1
1.02	casual	1
1.02	casual	1
1.02	casual	1
1.02	member	1
1.02	member	1

Table 15: Kable output

trip_time	member_casual	n	quartile
1.02	casual	4	Q1 (1.02 - 5.73]
1.02	member	153	Q1 (1.02 - 5.73]
1.03	casual	7	Q1 (1.02 - 5.73]
1.03	member	149	Q1 (1.02 - 5.73]
1.05	casual	8	Q1 (1.02 - 5.73]
1.05	member	182	Q1 (1.02 - 5.73]

Code

transformData(conn = dbconn, path = "db/duration.db", select_cols = "mins", group_cols = "mins",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("trip duration", "occurrences"), location_list = list("mins",
        "n"), source_note = gt::md("**Source**: `db/duration.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)

Table 16: The overall distribution of trip durations

1¹	² 11712
2	109700
3	212213
4	282122
5	288960
6	291295
7	264824
8	252680
9	221983
10	206307
11	176905
12	162289
13	139963
14	128094
15	109254
16	100585
17	85815
18	79498
19	69038
20	63928
21	55184
22	51877
23	45435
24	42596
25	37041
26	34652
27	30606
28	28605
29	25154
30	23712
31	21143
32	19696
33	17808
34	16396
35	14606
36	13646
37	12219
38	11851
39	10433
40	9795
41	8683
42	8167
43	7069
44	6685
45	5842
46	5360
47	4589
48	4425
49	3936
50	3726
51	3274
52	3090
53	2859
54	2790
55	2458
56	2351
57	2136
58	2068
59	1897
60	1876
61	1626
62	1578
63	1417
64	1455
65	1256
66	1236
67	1115
68	1167
69	984
70	961
71	908
72	872
73	829
74	835
75	755
76	745
77	657
78	689
79	622
80	632
81	585
82	566
83	500
84	479
85	443
86	478
87	432
88	382
89	416
90	388
91	337
92	354
93	301
94	309
95	260
96	273
97	238
98	266
99	229
100	217
101	241
102	245
103	203
104	215
105	190
106	190
107	169
108	167
109	154
110	148
111	161
112	161
113	160
114	129
115	125
116	147
117	153
118	120
119	106
120	96
121	116
122	85
123	104
124	86
125	99
126	119
127	87
128	84
129	94
130	74
131	62
132	90
133	77
134	64
135	66
136	67
137	84
138	72
139	62
140	60
141	53
142	36
143	46
144	37
145	55
146	44
147	65
148	44
149	54
150	39
151	47
152	43
153	45
154	37
155	36
156	42
157	29
158	33
159	32
160	33
161	27
162	37
163	28
164	26
165	35
166	31
167	30
168	38
169	12
170	26
171	19
172	25
173	20
174	20
175	30
176	21
177	11
178	18
179	13
180	19
181	17
182	13
183	12
184	11
185	17
186	9
187	9
188	10
189	13
190	13
191	5
192	14
193	9
194	11
195	7
196	5
197	3
198	8
199	6
200	8
201	3
202	7
203	8
204	5
205	3
206	4
207	7
208	9
209	4
210	1
211	4
212	5
213	3
214	3
215	1
216	5
217	4
218	1
219	7
220	2
221	1
222	3
223	1
224	2
225	5
226	3
227	6
228	2
229	2
230	2
231	1
232	2
233	2
234	3
235	2
236	3
238	3
239	4
240	1
241	1
242	1
243	2
244	1
246	1
247	1
248	4
249	3
250	3
252	3
254	1
256	4
257	1
259	1
260	2
261	1
262	1
263	1
264	1
265	3
266	7
267	3
268	1
269	3
270	1
271	1
272	1
273	2
274	3
275	1
279	2
281	2
284	1
286	1
287	1
292	1
293	1
294	1
295	1
299	1
301	1
303	1
305	1
312	1
313	1
316	1
320	2
321	1
322	1
325	1
326	1
327	1
333	1
334	1
335	1
336	1
340	4
341	1
346	1
349	1
368	1
369	1
377	1
398	1
406	1
424	1
471	1
475	1
Source: `db/duration.db`
¹ trip duration
² occurrences

Code

transformData(conn = dbconn, path = "db/duration.db", select_cols = c("mins", "member_casual"),
    group_cols = c("mins", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "mins", label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/duration.db`"),
        isStub = TRUE, stub_label = "Time", stub_note = "trip duration (minutes)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 17: The membership distribution of trip durations

Time¹
1	casual²	³ 880
1	member	10832
2	casual	14808
2	member	94892
3	casual	39423
3	member	172790
4	casual	62621
4	member	219501
5	casual	71625
5	member	217335
6	casual	78289
6	member	213006
7	casual	76446
7	member	188378
8	casual	76284
8	member	176396
9	casual	69783
9	member	152200
10	casual	66354
10	member	139953
11	casual	58888
11	member	118017
12	casual	55308
12	member	106981
13	casual	48937
13	member	91026
14	casual	45360
14	member	82734
15	casual	39607
15	member	69647
16	casual	36901
16	member	63684
17	casual	32199
17	member	53616
18	casual	30249
18	member	49249
19	casual	27066
19	member	41972
20	casual	25171
20	member	38757
21	casual	22183
21	member	33001
22	casual	21114
22	member	30763
23	casual	18764
23	member	26671
24	casual	17793
24	member	24803
25	casual	15523
25	member	21518
26	casual	14652
26	member	20000
27	casual	13050
27	member	17556
28	casual	12451
28	member	16154
29	casual	10983
29	member	14171
30	casual	10291
30	member	13421
31	casual	9203
31	member	11940
32	casual	8770
32	member	10926
33	casual	8081
33	member	9727
34	casual	7357
34	member	9039
35	casual	6715
35	member	7891
36	casual	6414
36	member	7232
37	casual	5827
37	member	6392
38	casual	5600
38	member	6251
39	casual	5002
39	member	5431
40	casual	4787
40	member	5008
41	casual	4331
41	member	4352
42	casual	4117
42	member	4050
43	casual	3691
43	member	3378
44	casual	3628
44	member	3057
45	casual	3229
45	member	2613
46	casual	3166
46	member	2194
47	casual	2833
47	member	1756
48	casual	2795
48	member	1630
49	casual	2554
49	member	1382
50	casual	2470
50	member	1256
51	casual	2299
51	member	975
52	casual	2170
52	member	920
53	casual	1971
53	member	888
54	casual	2040
54	member	750
55	casual	1764
55	member	694
56	casual	1742
56	member	609
57	casual	1561
57	member	575
58	casual	1556
58	member	512
59	casual	1492
59	member	405
60	casual	1468
60	member	408
61	casual	1285
61	member	341
62	casual	1263
62	member	315
63	casual	1115
63	member	302
64	casual	1115
64	member	340
65	casual	983
65	member	273
66	casual	1022
66	member	214
67	casual	907
67	member	208
68	casual	954
68	member	213
69	casual	815
69	member	169
70	casual	789
70	member	172
71	casual	758
71	member	150
72	casual	730
72	member	142
73	casual	692
73	member	137
74	casual	692
74	member	143
75	casual	636
75	member	119
76	casual	635
76	member	110
77	casual	556
77	member	101
78	casual	586
78	member	103
79	casual	538
79	member	84
80	casual	531
80	member	101
81	casual	484
81	member	101
82	casual	480
82	member	86
83	casual	431
83	member	69
84	casual	413
84	member	66
85	casual	377
85	member	66
86	casual	407
86	member	71
87	casual	380
87	member	52
88	casual	324
88	member	58
89	casual	361
89	member	55
90	casual	331
90	member	57
91	casual	301
91	member	36
92	casual	302
92	member	52
93	casual	268
93	member	33
94	casual	268
94	member	41
95	casual	232
95	member	28
96	casual	237
96	member	36
97	casual	212
97	member	26
98	casual	238
98	member	28
99	casual	204
99	member	25
100	casual	189
100	member	28
101	casual	213
101	member	28
102	casual	212
102	member	33
103	casual	175
103	member	28
104	casual	195
104	member	20
105	casual	164
105	member	26
106	casual	170
106	member	20
107	casual	150
107	member	19
108	casual	140
108	member	27
109	casual	132
109	member	22
110	casual	129
110	member	19
111	casual	145
111	member	16
112	casual	143
112	member	18
113	casual	143
113	member	17
114	casual	113
114	member	16
115	casual	118
115	member	7
116	casual	132
116	member	15
117	casual	137
117	member	16
118	casual	110
118	member	10
119	casual	92
119	member	14
120	casual	88
120	member	8
121	casual	102
121	member	14
122	casual	76
122	member	9
123	casual	95
123	member	9
124	casual	79
124	member	7
125	casual	86
125	member	13
126	casual	108
126	member	11
127	casual	76
127	member	11
128	casual	78
128	member	6
129	casual	84
129	member	10
130	casual	68
130	member	6
131	casual	54
131	member	8
132	casual	85
132	member	5
133	casual	64
133	member	13
134	casual	56
134	member	8
135	casual	61
135	member	5
136	casual	57
136	member	10
137	casual	74
137	member	10
138	casual	68
138	member	4
139	casual	56
139	member	6
140	casual	56
140	member	4
141	casual	48
141	member	5
142	casual	33
142	member	3
143	casual	42
143	member	4
144	casual	35
144	member	2
145	casual	45
145	member	10
146	casual	41
146	member	3
147	casual	62
147	member	3
148	casual	39
148	member	5
149	casual	52
149	member	2
150	casual	37
150	member	2
151	casual	45
151	member	2
152	casual	41
152	member	2
153	casual	41
153	member	4
154	casual	34
154	member	3
155	casual	34
155	member	2
156	casual	35
156	member	7
157	casual	24
157	member	5
158	casual	27
158	member	6
159	casual	28
159	member	4
160	casual	29
160	member	4
161	casual	22
161	member	5
162	casual	28
162	member	9
163	casual	26
163	member	2
164	casual	23
164	member	3
165	casual	29
165	member	6
166	casual	28
166	member	3
167	casual	25
167	member	5
168	casual	33
168	member	5
169	casual	11
169	member	1
170	casual	25
170	member	1
171	casual	16
171	member	3
172	casual	22
172	member	3
173	casual	19
173	member	1
174	casual	17
174	member	3
175	casual	28
175	member	2
176	casual	19
176	member	2
177	casual	10
177	member	1
178	casual	16
178	member	2
179	casual	13
180	casual	16
180	member	3
181	casual	13
181	member	4
182	casual	11
182	member	2
183	casual	10
183	member	2
184	casual	8
184	member	3
185	casual	13
185	member	4
186	casual	8
186	member	1
187	casual	7
187	member	2
188	casual	9
188	member	1
189	casual	10
189	member	3
190	casual	10
190	member	3
191	casual	5
192	casual	13
192	member	1
193	casual	8
193	member	1
194	casual	9
194	member	2
195	casual	6
195	member	1
196	casual	5
197	casual	3
198	casual	7
198	member	1
199	casual	5
199	member	1
200	casual	8
201	casual	3
202	casual	6
202	member	1
203	casual	8
204	casual	4
204	member	1
205	casual	3
206	casual	4
207	casual	6
207	member	1
208	casual	8
208	member	1
209	casual	4
210	casual	1
211	casual	3
211	member	1
212	casual	4
212	member	1
213	casual	3
214	casual	2
214	member	1
215	casual	1
216	casual	5
217	casual	2
217	member	2
218	casual	1
219	casual	7
220	casual	2
221	casual	1
222	casual	3
223	casual	1
224	casual	2
225	casual	5
226	casual	3
227	casual	6
228	casual	2
229	casual	1
229	member	1
230	casual	2
231	casual	1
232	casual	2
233	casual	2
234	casual	2
234	member	1
235	casual	1
235	member	1
236	casual	2
236	member	1
238	casual	1
238	member	2
239	casual	2
239	member	2
240	casual	1
241	casual	1
242	casual	1
243	casual	2
244	casual	1
246	member	1
247	casual	1
248	casual	3
248	member	1
249	casual	3
250	casual	3
252	casual	3
254	casual	1
256	casual	3
256	member	1
257	member	1
259	casual	1
260	casual	2
261	casual	1
262	member	1
263	casual	1
264	casual	1
265	casual	3
266	casual	5
266	member	2
267	casual	2
267	member	1
268	member	1
269	casual	3
270	casual	1
271	member	1
272	casual	1
273	casual	2
274	casual	3
275	casual	1
279	casual	1
279	member	1
281	casual	2
284	casual	1
286	member	1
287	casual	1
292	casual	1
293	member	1
294	casual	1
295	casual	1
299	casual	1
301	casual	1
303	casual	1
305	casual	1
312	casual	1
313	casual	1
316	casual	1
320	casual	1
320	member	1
321	casual	1
322	member	1
325	member	1
326	casual	1
327	member	1
333	member	1
334	casual	1
335	casual	1
336	member	1
340	casual	1
340	member	3
341	member	1
346	casual	1
349	casual	1
368	casual	1
369	casual	1
377	member	1
398	casual	1
406	member	1
424	member	1
471	casual	1
475	casual	1
Source: `db/duration.db`
¹ trip duration (minutes)
² membership status
³ occurrences

Overall
Comparative

Table 16 and Figure 5 give an idea of the overall travel duration distribution. The data were rounded to the nearest minute for these figures.

Code

gplot <- transformData(conn = dbconn, path = "db/duration.db", select_cols = "mins",
    group_cols = "mins", doWeights = TRUE) |>
    plotter(x_col = as.integer(mins), y_col = n, geomType = "column", title = paste0("How long are the trips?"),
        x_label = "Minutes", y_label = "n", color_col = "black") + ggplot2::scale_x_continuous(limits = c(0,
    60), breaks = seq(0, 60, by = 5), guide = ggplot2::guide_axis(n.dodge = 1, angle = 45))

gplot + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

Figure 5: The overall distribution of trip durations

Likewise, Table 17 and Figure 6 summarize the travel duration distribution, where frequencies were summed by membership status.

Code

gplot <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(mins, member_casual) |>
    dplyr::filter(as.integer(mins) <= 100) |>
    dplyr::collect() |>
    transformData(conn = NULL, path = NULL, select_cols = c("mins", "member_casual"),
        group_cols = c("mins", "member_casual"), doWeights = TRUE, isDF = TRUE) |>
    plotter(title = paste0("How long do members ride for?"), x_label = "Minutes",
        y_label = "n", x_col = mins, y_col = n, group_col = member_casual, geomType = "column",
        is_colGroup = TRUE, colPosition = ggplot2::position_stack(reverse = TRUE),
        color_col = "black") + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45), limits = forcats::as_factor(seq(0, 60)), breaks = forcats::as_factor(seq(0,
    60, by = 5)))

gplot + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

Figure 6: The membership distribution of trip durations

Table 18 sheds light on the variability, range, and quartile information about the duration data.

Code

gtTable <- dplyr::tbl(dbconn, "db/duration.db") |>
dplyr::select(mins, member_casual) |>
dplyr::mutate(
mins = as.numeric(mins)
) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = mins ~ "continuous2",
label = list(mins ~ "Duration (mins)"),
digits = list(
mins ~ c(1, 1)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel duration"),
source_note = gt::md("**Source**: `db/duration.db`"),
isBinary = TRUE
)

gtTable

Table 18: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel duration.

Summary Stats: Member travel duration
Characteristic	casual N = 1,260,621	member N = 2,636,777
Duration (mins)
Median (Q1, Q3)	12.0 (7.0, 20.0)	9.0 (5.0, 14.0)
Mean (SD)	16.5 (15.8)	11.3 (9.2)
Min, Max	1.0, 475.0	1.0, 424.0
Source: `db/duration.db`

To accompany the summary table, two plots, Figure 7 and Figure 8, present a highly granular view of the duration dataset (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.

Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(trip_time, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(trip_time, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$trip_time, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$trip_time, probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel duration"),
        x_label = "Minutes", y_label = "n", x_col = trip_time, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", limits = c(0, 100), breaks = seq(0,
            100, by = 5), binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)), quartiles = quartiles_casual)

gplot

Figure 7: A more detailed look that illustrates the quartile distribution of casual travel duration.

Code

# tried 2.5 for binwidth but not sure if want to keep

gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' travel duration"),
        x_label = "Minutes", y_label = "n", x_col = trip_time, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", limits = c(0, 100), breaks = seq(0,
            100, by = 5), binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)), quartiles = quartiles_member)

gplot

Figure 8: A more detailed look that illustrates the quartile distribution of member travel duration.

Supplementing the histogram, Figure 9 shows a density plot comparing the duration of trips for casual users and members. The x-axis represents time in minutes, limited to a range of 0 to 100 for presentation purposes, while the y-axis shows the density (a measure of relative frequency).

The member group (darker blue) has a higher and narrower peak compared to the casual group (lighter blue). This indicates that members tend to have more concentrated distribution of a session duration around their most common length. The casual group appears to have a slightly fatter tail, extending further to the right than the member group. This suggests that casual users might occasionally have longer sessions than members, even if it’s less common.

Both groups exhibit right-skewed distributions, with a peak near the left side and a long tail extending to the right. This suggests that for both groups, a shorter duration is more common, while a longer duration occurs less frequently but can extend quite far. They seem to have their peak density around 5-10 minutes, with members peaking slightly earlier than casual users. There is significant overlap between the two distributions, indicating that while there are differences, there is also considerable similarity in duration patterns between casual users and members.

Code

gplot <- transformData(conn = dbconn, path = "db/duration.db", select_cols = c("trip_time",
    "member_casual")) |>
    plotter(title = "Visualizing relative frequency\n across travel duration groups",
        x_label = paste0("Minutes"), x_col = trip_time, group_col = member_casual,
        geomType = "column", angle = 45, color_col = "black", density_alpha = 0.75,
        isDensity = TRUE, is_colGroup = TRUE, breaks = seq(0, 100, by = 5), limits = c(0,
            100))

gplot

Figure 9: Visualizing the probability-density distribution between duration and membership variables.

Table 19 presents the results of a binary logistic regression model, analyzing the relationship between ride duration and membership status. The analysis divides ride duration into quartiles, with Q1 (1.02 - 5.73 minutes) serving as the reference category.

Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other duration quartiles ( p < 0.001 for all comparisons). With Q2 (5.73 - 9.55 minutes), the odds of membership were 0.38 times as high as with Q1. This indicates a substantial decrease (62%) in the likelihood of membership for slightly longer rides. With Q3 (9.55 - 16.13 minutes), the odds of membership were 0.08 times as high as in Q1. This shows a dramatic decrease (92%) in membership likelihood for medium-length rides. With Q4 (16.13 - 475.22 minutes), the odds of membership were 0.06 times as high as in Q1. This represents an even more pronounced decrease (94%) in membership likelihood for the longest rides.

Query, process, and create model R object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/duration_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Duration Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel durations"),
        source_note = gt::md("**Source**: `db/duration_wq.db`"), isBinary = TRUE)

Table 19: Modeling the probability of members’ travel by the duration.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of members’ travel durations
Duration Ranges
Q1 (1.02 - 5.73]	—
Q2 (5.73 - 9.55]	0.38	<0.001
Q3 (9.55 - 16.13]	0.08	<0.001
Q4 (16.13 - 475.22]	0.06	<0.001
Source: `db/duration_wq.db`
¹ OR = Odds Ratio

3.5 Month

Database Operations

Table Preview

Write moy.db to the database.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(member_casual = factor(member_casual, levels = c("casual",
            "member")), abbMonths = lubridate::month(started_at, label = TRUE, abbr = TRUE),
            abbMonths = forcats::as_factor(abbMonths)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/moy.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy_wq.db"))) {
    transformData(conn = dbconn, path = "db/moy.db", select_cols = c("started_at",
        "member_casual"), group_cols = c("started_at", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (Jan 01 - May 20]", "Q2 (May 20 - Jul 21]", "Q3 (Jul 21 - Sep 18]",
            "Q4 (Sep 18 - Dec 31]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/moy_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/moy_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 20: Kable output for months of the year

started_at	member_casual	abbMonths
2023-01-01 00:04:07	casual	Jan
2023-01-01 00:04:54	member	Jan
2023-01-01 00:05:43	member	Jan
2023-01-01 00:09:33	member	Jan
2023-01-01 00:09:53	member	Jan
2023-01-01 00:10:45	member	Jan

Table 21: Kable output for weighted months of the year.

started_at	member_casual	n	quartile
2023-01-01 00:04:07	casual	1	Q1 (Jan 01 - May 20]
2023-01-01 00:04:54	member	1	Q1 (Jan 01 - May 20]
2023-01-01 00:05:43	member	1	Q1 (Jan 01 - May 20]
2023-01-01 00:09:33	member	1	Q1 (Jan 01 - May 20]
2023-01-01 00:09:53	member	1	Q1 (Jan 01 - May 20]
2023-01-01 00:10:45	member	1	Q1 (Jan 01 - May 20]

Code

transformData(conn = dbconn, path = "db/moy.db", select_cols = "abbMonths", group_cols = "abbMonths",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("month of year", "occurrences"), location_list = c("abbMonths",
        "n"), source_note = gt::md("**Source**: `db/moy.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)

Table 22: The overall distribution of trips taken by month

Jan¹	² 136886
Feb	136818
Mar	183129
Apr	286343
May	408373
Jun	474756
Jul	502519
Aug	519423
Sep	461164
Oct	374268
Nov	257513
Dec	156206
Source: `db/moy.db`
¹ month of year
² occurrences

Code

transformData(conn = dbconn, path = "db/moy.db", select_cols = c("abbMonths", "member_casual"),
    group_cols = c("abbMonths", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "abbMonths", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/moy.db`"),
        isStub = TRUE, stub_label = "Month", stub_note = "months of the year (abbreviated)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 23: The membership distribution of trips taken by month

Month¹
Jan	casual²	³ 25262
Jan	member	111624
Feb	casual	27111
Feb	member	109707
Mar	casual	39137
Mar	member	143992
Apr	casual	87211
Apr	member	199132
May	casual	139998
May	member	268375
Jun	casual	177745
Jun	member	297011
Jul	casual	194030
Jul	member	308489
Aug	casual	189563
Aug	member	329860
Sep	casual	168881
Sep	member	292283
Oct	casual	114475
Oct	member	259793
Nov	casual	64508
Nov	member	193005
Dec	casual	32700
Dec	member	123506
Source: `db/moy.db`
¹ months of the year (abbreviated)
² membership status
³ occurrences

Overall
Comparative

Table 22 and Figure 10 shows the overall monthly travel distribution.

Code

gplot <- transformData(conn = dbconn, path = "db/moy.db", select_cols = "abbMonths",
    group_cols = "abbMonths", doWeights = TRUE) |>
    plotter(x_col = abbMonths, y_col = n, geomType = "column", title = paste0("Which months do people tend to ride?"),
        x_label = paste0("Months", "\n", "(2023)"), y_label = "n")

gplot

Figure 10: The overall distribution of trips taken by month

Similarly, Table 23 and Figure 11 summarize the monthly travel, where frequencies were summed by membership status.

Code

gplot <- transformData(conn = dbconn, path = "db/moy.db", select_cols = c("abbMonths",
    "member_casual"), group_cols = c("abbMonths", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which months do members tend to travel?"), x_label = "Months",
        y_label = "n", x_col = abbMonths, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE)

gplot

Figure 11: The membership distribution of trips taken by month

Table 24 presents a Chi-Square of monthly travel among members and casual users. In summary, members exhibit higher usage for periods Jan - Mar and Oct - Dec. Both groups prefer the months of Apr - Sep, where casuals show a higher interest, proportionally for their respective group.

There is a statistically significant association between monthly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between monthly travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (71802.26) with 11 degrees of freedom (calculated as [rows - 1] [columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.

data_tibble <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(abbMonths, member_casual) |>
    dplyr::arrange(abbMonths, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "abbMonths", by = "member_casual")

Code

chi_table <- tabler(title = gt::md("Chi-Square:<br>The signficance of monthly travel on membership"),
    source_note = gt::md("**Source**: `db/moy.db`"), label = list(abbMonths = "Month"),
    by = member_casual, isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)

chi_table

Table 24: Testing the independence between month and membership variables

Characteristic	casual N = 1,260,621¹	member N = 2,636,777¹	p-value²
Chi-Square: The signficance of monthly travel on membership
Month			<0.001
Jan	25,262 (2.0%)	111,624 (4.2%)
Feb	27,111 (2.2%)	109,707 (4.2%)
Mar	39,137 (3.1%)	143,992 (5.5%)
Apr	87,211 (6.9%)	199,132 (7.6%)
May	139,998 (11%)	268,375 (10%)
Jun	177,745 (14%)	297,011 (11%)
Jul	194,030 (15%)	308,489 (12%)
Aug	189,563 (15%)	329,860 (13%)
Sep	168,881 (13%)	292,283 (11%)
Oct	114,475 (9.1%)	259,793 (9.9%)
Nov	64,508 (5.1%)	193,005 (7.3%)
Dec	32,700 (2.6%)	123,506 (4.7%)
Source: `db/moy.db`
¹ n (%); χ² = 71802.26; df = 11
² Pearson’s Chi-squared test

Two plots, Figure 12 and Figure 13, present a highly granular view of the monthly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.

Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at, probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' monthly travel"),
        x_label = "Months", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 month", date_labels = "%b",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_casual, qformat = "%b-%d")

gplot

Figure 12: A more detailed look that illustrates the quartile distribution of monthly casual travel.

Code

gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' monthly travel"),
        x_label = "Months", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 month", date_labels = "%b",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_member, qformat = "%b-%d")

gplot

Figure 13: A more detailed look that illustrates the quartile distribution of monthly member travel.

To visualize monthly users through the lens of their respective concentrations, see Figure 14. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The casual group has higher peaks near the middle of the distribution. The member group has higher peaks near the left and right sides of the distribution. Both of the groups exhibit somewhat normal, multi-modal distributions. They overlap each other significantly, suggesting similarity in travel patterns.

Code

gplot <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross monthly travel groups",
        x_label = paste0("Months"), x_col = started_at, group_col = member_casual,
        geomType = "other", angle = 45, color_col = "black", density_alpha = 0.75,
        isTime = TRUE, date_breaks = "1 month", date_labels = "%b", )

gplot

Figure 14: Visualizing the probability-density distribution of months by membership.

Table 25 presents the results of a binary logistic regression analyzing the relationship between months of the year and membership status. The analysis divides the year into quartiles, with Q1 (January 01 - May 20) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).

With Q2 (May 20 - Jul 21), the odds of membership were 0.57 times as high as in Q1. This indicates a substantial decrease (43%) in the likelihood of membership during late spring and early summer. With Q3 (Jul 21 - Sep 18), the odds of membership were 0.58 times as high as in Q1. This shows a similar decrease (42%) in membership likelihood during late summer and early fall, nearly identical to Q2. And with Q4 (Sep 18 - Dec 31), the odds of membership were 0.87 times as high as in Q1. While still lower than Q1, this represents a less pronounced decrease (13%) in membership likelihood during fall and early winter.

Query, process and create model R object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/moy_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Months Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of monthly member travel"),
        source_note = gt::md("**Source**: `db/moy.db`"), isBinary = TRUE)

Table 25: Modeling the probability of monthly member travel.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of monthly member travel
Months Ranges
Q1 (Jan 01 - May 20]	—
Q2 (May 20 - Jul 21]	0.57	<0.001
Q3 (Jul 21 - Sep 18]	0.58	<0.001
Q4 (Sep 18 - Dec 31]	0.87	<0.001
Source: `db/moy.db`
¹ OR = Odds Ratio

3.6 Day

Database Operations

Table Preview

Write dow.db to the database.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(wkdays = lubridate::wday(started_at, week_start = 7), member_casual = factor(member_casual,
            levels = c("casual", "member")), started_at = update(started_at, year = 2024,
            month = 9, day = wkdays), abbDays = lubridate::wday(started_at, label = TRUE,
            abbr = TRUE), abbDays = forcats::as_factor(abbDays)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/dow.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow_wq.db"))) {
    transformData(conn = dbconn, path = "db/dow.db", select_cols = c("started_at",
        "member_casual"), group_cols = c("started_at", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (Sun 12:00 am - Mon 11:40 am]", "Q2 (Mon 11:40 am - Wed 05:14 am]",
            "Q3 (Wed 05:14 am - Fri 12:19 pm]", "Q4 (Fri 12:19 pm - Sat 11:59 pm]"),
        doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/dow_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 26: Days of the week

started_at	member_casual	wkdays	abbDays
2024-09-01 00:04:07	casual	1	Sun
2024-09-01 00:04:54	member	1	Sun
2024-09-01 00:05:43	member	1	Sun
2024-09-01 00:09:33	member	1	Sun
2024-09-01 00:09:53	member	1	Sun
2024-09-01 00:10:45	member	1	Sun

Code

transformData(conn = dbconn, path = "db/dow.db", select_cols = "abbDays", group_cols = "abbDays",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("day of the week", "occurrences"), location_list = list("abbDays",
        "n"), source_note = gt::md("**Source**: `db/dow.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)

Table 27: The overall distribution of trips taken by day of the week

Sun¹	² 486275
Mon	507355
Tue	578095
Wed	585266
Thu	597363
Fri	566008
Sat	577036
Source: `db/dow.db`
¹ day of the week
² occurrences

Code

transformData(conn = dbconn, path = "db/dow.db", select_cols = c("abbDays", "member_casual"),
    group_cols = c("abbDays", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "abbDays", label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/dow.db`"),
        isStub = TRUE, stub_label = "Day", stub_note = "days of the week (abbreviated)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 28: The membership distribution of trips taken by day of the week

Day¹
Sun	casual²	³ 200588
Sun	member	285687
Mon	casual	142860
Mon	member	364495
Tue	casual	152604
Tue	member	425491
Wed	casual	155978
Wed	member	429288
Thu	casual	169117
Thu	member	428246
Fri	casual	189429
Fri	member	376579
Sat	casual	250045
Sat	member	326991
Source: `db/dow.db`
¹ days of the week (abbreviated)
² membership status
³ occurrences

Overall
Comparative

Table 27 and Figure 15 show an aggregated distribution of the overall daily travel.

Code

# Values were too similar to visualize differences, see coord_cartesion()
gplot <- transformData(conn = dbconn, path = "db/dow.db", select_cols = "abbDays",
    group_cols = "abbDays", doWeights = TRUE) |>
    plotter(x_col = abbDays, y_col = n, geomType = "column", title = paste0("Which days\ndo people tend to ride?"),
        x_label = "Days of the Week", y_label = "n") + ggplot2::coord_cartesian(ylim = c(4.5 *
    10^5, NA))

gplot

Figure 15: The overall distribution of trips taken by day of the week

Similarly, Table 28 and Figure 16 summarize the duration distribution, grouping sums by membership status.

Code

gplot <- transformData(conn = dbconn, path = "db/dow.db", select_cols = c("abbDays",
    "member_casual"), group_cols = c("abbDays", "member_casual"), doWeights = TRUE) |>
    plotter(title = "Day Groups", x_label = "Days", y_label = "n", x_col = abbDays,
        y_col = n, group_col = member_casual, geomType = "column", isFaceted = TRUE,
        is_colGroup = TRUE)

gplot

Figure 16: The membership distribution of trips taken by day of the week

In Table 29, present a Chi-Square of daily travel among members and casual users. Members show higher usage Mon - Thu compared to casual users. The pattern is more regular compared to casual users. Casual users show higher usage Fri - Sun compared to members. Their preferences fall within the weekend.

There is a statistically significant association between daily travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between daily travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (76305.71) with 6 degrees of freedom (calculated as [rows - 1] * [columns - 1]) results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.

data_tibble <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(abbDays, member_casual) |>
    dplyr::arrange(abbDays, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "abbDays", by = "member_casual")

Code

tabler(title = gt::md("Chi-Square:<br>The signficance of daily travel on membership"),
    source_note = gt::md("**Source**: `db/moy.db`"), label = list(abbDays = "Day"),
    by = member_casual, isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)

Table 29: Testing the independence between month and membership variables

Characteristic	casual N = 1,260,621¹	member N = 2,636,777¹	p-value²
Chi-Square: The signficance of daily travel on membership
Day			<0.001
Sun	200,588 (16%)	285,687 (11%)
Mon	142,860 (11%)	364,495 (14%)
Tue	152,604 (12%)	425,491 (16%)
Wed	155,978 (12%)	429,288 (16%)
Thu	169,117 (13%)	428,246 (16%)
Fri	189,429 (15%)	376,579 (14%)
Sat	250,045 (20%)	326,991 (12%)
Source: `db/moy.db`
¹ n (%); χ² = 76305.71; df = 6
² Pearson’s Chi-squared test

Two plots, Figure 17 and Figure 18, present a highly granular view of the daily data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at, probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' daily travel"),
        x_label = "Days", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 day", date_labels = "%a",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_casual, qformat = "%a %I %p")

gplot

Figure 17: A more detailed look that illustrates the quartile distribution of daily casual travel.

Code

gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' daily travel"),
        x_label = "Days", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 day", date_labels = "%a",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_member, qformat = "%a %I %p")

gplot

Figure 18: A more detailed look that illustrates the quartile distribution of daily member travel.

To visualize daily users through the lens of their respective concentrations, see Figure 19 The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The casual group has higher peaks near the left and right sides of the distribution. Their highest peak appears to be on Saturday. The member group has higher peaks near the middle of the distribution. They also show less day-to-day variability during weekdays compared to casual users.

On weekdays, both groups show a bimodal distribution, with two peaks each day. On weekends, there is a unimodal distribution. They overlap each other significantly, suggesting similarity in travel patterns.

Code

gplot <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross daily travel groups",
        x_label = paste0("Day"), x_col = started_at, group_col = member_casual, geomType = "other",
        angle = 45, color_col = "black", density_alpha = 0.75, isTime = TRUE, date_breaks = "1 day",
        date_labels = "%a")

gplot

Figure 19: Visualizing the probability-density distribution of the day by membership variables.

Table 30 presents the results of a binary logistic regression analyzing the relationship between days of the week and membership status. The analysis divides the week into quartiles, with Q1 (Sunday 12:00 am - Monday 11:40 am) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).

With Q2 (Monday 11:40 am - Wednesday 05:14 am), the odds of membership were 1.52 times higher than in Q1. This suggests a substantial increase in the likelihood of members riding during the early part of the work week.

With Q3 (Wednesday 05:14 am - Friday 12:19 pm), the odds of membership were 1.36 times higher than in Q1. This indicates a continued higher likelihood of membership during the latter part of the work week, though slightly lower than Q2. With Q4 (Friday 12:19 pm - Saturday 11:59 pm), the odds of membership were 0.80 times as high as in Q1. This represents a significant decrease in the likelihood of membership during the weekend period.

Query, process, and create a model object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/dow_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Weekday Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of daily member travel"),
        source_note = gt::md("**Source**: `db/dow_wq.db`"), isBinary = TRUE)

Table 30: Modeling the probability of daily member travel.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of daily member travel
Weekday Ranges
Q1 (Sun 12:00 am - Mon 11:40 am]	—
Q2 (Mon 11:40 am - Wed 05:14 am]	1.52	<0.001
Q3 (Wed 05:14 am - Fri 12:19 pm]	1.36	<0.001
Q4 (Fri 12:19 pm - Sat 11:59 pm]	0.80	<0.001
Source: `db/dow_wq.db`
¹ OR = Odds Ratio

3.7 Hour

Database Operations

Table Preview

Write to .db

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(started_at_time = update(started_at, year = 2023, month = 1,
            day = 1), hr = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "hour"), "%I %p")), hrMin = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "minute"), "%I:%M %p")), hrminSec = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "second"), "%r")), hr = forcats::as_factor(hr), hrMin = forcats::as_factor(hrMin)) |>
        dplyr::select(member_casual:hrminSec) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/hod.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data to hod_wq.db.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod_wq.db"))) {
    transformData(conn = dbconn, path = "db/hod.db", select_cols = c("started_at_time",
        "member_casual"), group_cols = c("started_at_time", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at_time", ntile_col = "quartile", zero_val = "casual",
        one_val = "member", qtile_levels = c("Q1 (12:00 am - 10:59 am]", "Q2 (10:59 am - 03:24 pm]",
            "Q3 (03:24 pm - 06:05 pm]", "Q4 (06:05 pm - 11:59 pm]"), doQuantile = TRUE,
        doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/hod_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/hod_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 31: Kable output of hod.db

member_casual	started_at_time	hr	hrMin	hrminSec
casual	2023-01-01 00:04:07	12 am	12:04 am	12:04:07 am
member	2023-01-01 00:04:54	12 am	12:05 am	12:04:54 am
member	2023-01-01 00:05:43	12 am	12:06 am	12:05:43 am
member	2023-01-01 00:09:33	12 am	12:10 am	12:09:33 am
member	2023-01-01 00:09:53	12 am	12:10 am	12:09:53 am
member	2023-01-01 00:10:45	12 am	12:11 am	12:10:45 am

Table 32: Kable output of hod_wq.db

started_at_time	member_casual	n	quartile
2023-01-01 00:00:00	casual	6	Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:00	member	6	Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:01	casual	2	Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:01	member	10	Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:02	casual	14	Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:02	member	6	Q1 (12:00 am - 10:59 am]

Code

transformData(conn = dbconn, path = "db/hod.db", select_cols = "hr", group_cols = "hr",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("hour of the day", "occurrences"), location_list = list("hr",
        "n"), source_note = gt::md("**Source**: `db/hod.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)

Table 33: The overall distribution of trips taken by hour of the day

12 am¹	² 52336
01 am	31598
02 am	20378
03 am	10309
04 am	7179
05 am	14866
06 am	61027
07 am	141249
08 am	227964
09 am	199884
10 am	155951
11 am	172204
12 pm	213544
01 pm	223804
02 pm	224467
03 pm	246680
04 pm	313374
05 pm	408606
06 pm	371909
07 pm	277477
08 pm	189106
09 pm	139743
10 pm	112692
11 pm	81051
Source: `db/hod.db`
¹ hour of the day
² occurrences

Code

transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr", "member_casual"),
    group_cols = c("hr", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "hr", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/hod.db`"),
        isStub = TRUE, stub_label = "Hour", stub_note = "hour of the day (12-hour clock)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 34: The membership distribution of trips taken by hour of the day

Hour¹
12 am	casual²	³ 24468
12 am	member	27868
01 am	casual	15391
01 am	member	16207
02 am	casual	10859
02 am	member	9519
03 am	casual	5053
03 am	member	5256
04 am	casual	3164
04 am	member	4015
05 am	casual	3918
05 am	member	10948
06 am	casual	12330
06 am	member	48697
07 am	casual	27046
07 am	member	114203
08 am	casual	43590
08 am	member	184374
09 am	casual	45413
09 am	member	154471
10 am	casual	47734
10 am	member	108217
11 am	casual	59590
11 am	member	112614
12 pm	casual	75634
12 pm	member	137910
01 pm	casual	81646
01 pm	member	142158
02 pm	casual	83794
02 pm	member	140673
03 pm	casual	90150
03 pm	member	156530
04 pm	casual	105809
04 pm	member	207565
05 pm	casual	124466
05 pm	member	284140
06 pm	casual	118031
06 pm	member	253878
07 pm	casual	91667
07 pm	member	185810
08 pm	casual	63696
08 pm	member	125410
09 pm	casual	48455
09 pm	member	91288
10 pm	casual	44013
10 pm	member	68679
11 pm	casual	34704
11 pm	member	46347
Source: `db/hod.db`
¹ hour of the day (12-hour clock)
² membership status
³ occurrences

Overall
Comparative

Table 33 and Figure 20 give an aggregated distribution of the overall hourly travel distribution.

Code

gplot <- transformData(conn = dbconn, path = "db/hod.db", select_cols = "hr", group_cols = "hr",
    doWeights = TRUE) |>
    plotter(x_col = hr, y_col = n, geomType = "column", title = paste0("Which hours do people tend to ride?"),
        x_label = "Hour", y_label = "n", ) + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45)) + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

gplot

Figure 20: The overall distribution of trips taken by hour of the day

Similarly, Table 34 and Figure 21 summarize the distribution, grouping the sums by membership.

Code

gplot <- transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr", "member_casual"),
    group_cols = c("hr", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which hours of the day do members ride?"), x_label = "Hour of Day",
        y_label = "n", x_col = hr, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE) + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45)) + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

gplot

Figure 21: The membership distribution of trips taken by hour of the day

Table 35 presents a Chi-Square of hourly travel among members and casual users. In summary:

Compared to casual users, members show relatively higher service usage at intervals [5:30am - 9:30am] and [4:30pm - 6:30pm]. Casual users show relatively higher rates at intervals [10:30am - 4:30pm] and [6:30pm - 5:30am].

There is a statistically significant association between hourly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between hourly travel and membership status. The frequency of hourly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (72733.38) with 23 degrees of freedom (calculated as [rows - 1] [columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.

data_tibble <- transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr",
    "member_casual"))

chiResult <- chisqTest(data = data_tibble, variable = "hr", by = "member_casual")

Code

tabler(title = gt::md("Chi-Square:<br>The signficance of hourly travel on membership"),
    source_note = gt::md("**Source**: `db/hod.db`"), label = list(hr = "Hour"), by = member_casual,
    isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)

Table 35: Testing the independence between hour and membership variables

Characteristic	casual N = 1,260,621¹	member N = 2,636,777¹	p-value²
Chi-Square: The signficance of hourly travel on membership
Hour			<0.001
12 am	24,468 (1.9%)	27,868 (1.1%)
01 am	15,391 (1.2%)	16,207 (0.6%)
02 am	10,859 (0.9%)	9,519 (0.4%)
03 am	5,053 (0.4%)	5,256 (0.2%)
04 am	3,164 (0.3%)	4,015 (0.2%)
05 am	3,918 (0.3%)	10,948 (0.4%)
06 am	12,330 (1.0%)	48,697 (1.8%)
07 am	27,046 (2.1%)	114,203 (4.3%)
08 am	43,590 (3.5%)	184,374 (7.0%)
09 am	45,413 (3.6%)	154,471 (5.9%)
10 am	47,734 (3.8%)	108,217 (4.1%)
11 am	59,590 (4.7%)	112,614 (4.3%)
12 pm	75,634 (6.0%)	137,910 (5.2%)
01 pm	81,646 (6.5%)	142,158 (5.4%)
02 pm	83,794 (6.6%)	140,673 (5.3%)
03 pm	90,150 (7.2%)	156,530 (5.9%)
04 pm	105,809 (8.4%)	207,565 (7.9%)
05 pm	124,466 (9.9%)	284,140 (11%)
06 pm	118,031 (9.4%)	253,878 (9.6%)
07 pm	91,667 (7.3%)	185,810 (7.0%)
08 pm	63,696 (5.1%)	125,410 (4.8%)
09 pm	48,455 (3.8%)	91,288 (3.5%)
10 pm	44,013 (3.5%)	68,679 (2.6%)
11 pm	34,704 (2.8%)	46,347 (1.8%)
Source: `db/hod.db`
¹ n (%); χ² = 72733.38; df = 23
² Pearson’s Chi-squared test

Two plots, Figure 22 and Figure 23, present a highly granular view of the hourly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::select(started_at_time, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::select(started_at_time, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at_time, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at_time, probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' hourly travel"),
        x_label = paste0("Hours\n", "(12-hour clock)"), y_label = "n", x_col = started_at_time,
        geomType = "column", isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "2 hour",
        date_labels = "%I %p", angle = 45, color_col = "black", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), quartiles = quartiles_casual, qformat = "%I:%M %p")

gplot + ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(0.9)))

Figure 22: A more detailed look that illustrates the quartile distribution of hourly casual travel.

Code

gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' hourly travel"),
        x_label = paste0("Hours\n", "(12-hour clock)"), y_label = "n", x_col = started_at_time,
        geomType = "column", isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "2 hour",
        date_labels = "%I %p", angle = 45, color_col = "black", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), quartiles = quartiles_member, qformat = "%I:%M %p")

gplot + ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(0.9)))

Figure 23: A more detailed look that illustrates the quartile distribution of hourly member travel.

To visualize hourly users through the lens of their respective concentrations, see Figure 24. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The member group parallels the daily bimodal patterns seen in daily density plot. The casual group likewise, but to less extent parallels the bimodal patterns seen in the daily density plot. The two groups, largely, are overlapping, with the highest densities for both falling around 5:00pm.

Code

gplot <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross hourly travel groups",
        x_label = paste0("Hours", "\n", "(12-hour clock)"), x_col = started_at_time,
        group_col = member_casual, geomType = "other", angle = 45, color_col = "black",
        density_alpha = 0.75, isTime = TRUE, date_breaks = "1 hour", date_labels = "%I %p",
        )

gplot

Figure 24: Visualizing the probability-density distribution of the hour by membership.

Table 36 presents the results of a binary logistic regression analyzing the relationship between hour of the day and membership status. The analysis divides the day into quartiles, with Q1 (12:00 am - 10:59 am) serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 1.44 times as high in Q2 (10:59 am - 03:24 pm), 1.04 times as high in Q3 (03:24 pm - 06:05 pm), and 0.97 times as high in Q4 (06:05 pm - 11:59 pm).

These results reveal a non-linear relationship between time of day and membership status. The highest likelihood of membership occurs during \(Q2\), corresponding to midday hours. There is a slight increase in membership likelihood during Q3 (late afternoon) compared to the reference period, while evening hours (Q4) show a slight decrease in membership likelihood.

Query hod_wq.db, process, and create model R object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/hod_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Hour Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of hourly member travel"),
        source_note = gt::md("**Source**: `db/hod_wq.db`"), isBinary = TRUE)

Table 36: Modeling the probability of hourly member travel.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of hourly member travel
Hour Ranges
Q1 (12:00 am - 10:59 am]	—
Q2 (10:59 am - 03:24 pm]	1.44	<0.001
Q3 (03:24 pm - 06:05 pm]	1.04	<0.001
Q4 (06:05 pm - 11:59 pm]	0.97	<0.001
Source: `db/hod_wq.db`
¹ OR = Odds Ratio

3.8 Distance

Database Operations

Table Preview

Write distance.db to the database.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(miles, member_casual) |>
        dplyr::arrange(miles) |>
        dplyr::collect() |>
        dplyr::mutate(milesR = miles, milesR = dplyr::case_when(milesR >= 1 ~ round(milesR,
            digits = 0), miles < 1 ~ round(signif(milesR, 3), digits = 1)), milesR = forcats::as_factor(milesR)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/distance.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance_wq.db"))) {
    transformData(conn = dbconn, path = "db/distance.db", select_cols = c("miles",
        "member_casual"), group_cols = c("miles", "member_casual"), binary_col = "member_casual",
        pred_col = "miles", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (0.10 - 0.63]", "Q2 (0.63 - 1.02]", "Q3 (1.02 - 1.76]",
            "Q4 (1.76 - 20.5]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/distance_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/distance_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 37: Distance

miles	member_casual	milesR
0.1000100	member	0.1
0.1000367	casual	0.1
0.1000491	member	0.1
0.1003780	casual	0.1
0.1004468	member	0.1
0.1005323	member	0.1

Table 38: Distance, Weighted Quantiles

miles	member_casual	n	quartile
0.1000100	member	1	Q1 (0.10 - 0.63]
0.1000367	casual	1	Q1 (0.10 - 0.63]
0.1000491	member	1	Q1 (0.10 - 0.63]
0.1003780	casual	1	Q1 (0.10 - 0.63]
0.1004468	member	1	Q1 (0.10 - 0.63]
0.1005323	member	1	Q1 (0.10 - 0.63]

Code

transformData(
conn = dbconn, 
path = "db/distance.db", 
select_cols = "milesR",
group_cols = "milesR", 
doWeights = TRUE
) |>
tabler( 
title = NULL,
note_list = list("distance (miles)", "occurrences"),
location_list = list("milesR", "n"),
source_note = gt::md("**Source**: `db/distance.db`"),
noteRows = TRUE,
hide_column_labels = TRUE,
row_index = 1,
label_parameter = "milesR",
align_parameter = "right"

)

Table 39: The overall distribution of trips taken by the esimated distance traveled

0.1¹	² 11429
0.2	70621
0.3	178150
0.4	225723
0.5	280367
0.6	288164
0.7	250369
0.8	262418
0.9	227646
1	867678
2	720842
3	292483
4	121920
5	55748
6	25499
7	11924
8	3275
9	1520
10	686
11	415
12	256
13	139
14	67
15	24
16	31
17	1
18	1
19	1
21	1
Source: `db/distance.db`
¹ distance (miles)
² occurrences

Code

transformData(conn = dbconn, path = "db/distance.db", select_cols = c("milesR", "member_casual"),
    group_cols = c("milesR", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "milesR", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/distance.db`"),
        isStub = TRUE, stub_label = "Distance", stub_note = "miles traveled per trip",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 40: The membership distribution of trips taken by the estimated distance traveled

Distance¹
0.1	casual²	³ 1410
0.1	member	10019
0.2	casual	11262
0.2	member	59359
0.3	casual	34957
0.3	member	143193
0.4	casual	54395
0.4	member	171328
0.5	casual	75269
0.5	member	205098
0.6	casual	83008
0.6	member	205156
0.7	casual	76686
0.7	member	173683
0.8	casual	91896
0.8	member	170522
0.9	casual	77635
0.9	member	150011
1	casual	316382
1	member	551296
2	casual	261540
2	member	459302
3	casual	99452
3	member	193031
4	casual	39690
4	member	82230
5	casual	18589
5	member	37159
6	casual	9393
6	member	16106
7	casual	5492
7	member	6432
8	casual	1706
8	member	1569
9	casual	789
9	member	731
10	casual	433
10	member	253
11	casual	273
11	member	142
12	casual	170
12	member	86
13	casual	92
13	member	47
14	casual	49
14	member	18
15	casual	22
15	member	2
16	casual	27
16	member	4
17	casual	1
18	casual	1
19	casual	1
21	casual	1
Source: `db/distance.db`
¹ miles traveled per trip
² membership status
³ occurrences

Overall
Comparative

Table 39 and Figure 25 show the overall distribution of distances traveled.

Code

gplot <- transformData(conn = dbconn, path = "db/distance.db", select_cols = "milesR",
    group_cols = "milesR", doWeights = TRUE) |>
    plotter(x_col = milesR, y_col = n, geomType = "column", title = paste0("How far do people ride?"),
        x_label = "Miles", y_label = "n")

gplot + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1, angle = 45)) +
    ggplot2::theme(axis.text = ggplot2::element_text(size = 7))

Figure 25: The overall distribution of trips taken by the estimated distance traveled

Table 40 and Figure 26 likewise summarize the distance distribution by membership status.

Code

gplot <- transformData(conn = dbconn, path = "db/distance.db", select_cols = c("milesR",
    "member_casual"), group_cols = c("milesR", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("How far do members travel?"), x_label = "Miles", y_label = "n",
        x_col = milesR, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE)

gplot + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1, angle = 45)) +
    ggplot2::theme(axis.text = ggplot2::element_text(size = 7))

Figure 26: The membership distribution of trips taken by the estimated distance traveled

Table 41 sheds light on the variability, range, and quartile information about the distance to membership data.

Code

gTable <- 
dplyr::tbl(dbconn, "db/distance.db") |>
dplyr::select(miles, member_casual) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = miles ~ "continuous2",
label = list(miles ~ "Distance (miles)"),
digits = list(
miles ~ c(2, 2)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel distance"),
source_note = gt::md("**Source**: `db/distance.db`"),
isBinary = TRUE
)

gTable

Table 41: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel distance.

Summary Stats: Member travel distance
Characteristic	casual N = 1,260,621	member N = 2,636,777
Distance (miles)
Median (Q1, Q3)	1.12 (0.72, 1.83)	0.97 (0.59, 1.71)
Mean (SD)	1.48 (1.17)	1.34 (1.11)
Min, Max	0.10, 20.54	0.10, 16.02
Source: `db/distance.db`

Two plots, Figure 27 and Figure 28, present a highly granular view of the distance data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(as.numeric(qdf_member$miles), probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(as.numeric(qdf_casual$miles), probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel distance"),
        x_label = "Miles", y_label = "n", x_col = miles, geomType = "column", isHistogram = TRUE,
        angle = 45, color_col = "transparent", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        limits = c(0.1, 5), breaks = seq(0.1, 5, by = 0.5), quartiles = quartiles_casual)

gplot

Figure 27: A more detailed look that illustrates the quartile distribution of casual travel distance.

Code

gplot <- 
qdf_member |>
plotter(
title = paste0("The quartile", "\n", "distribution of members' travel distance"),
x_label = "Miles",
y_label = "n",
x_col = miles, 
geomType = "column", 
isHistogram = TRUE,
angle = 45,
color_col = "transparent",
vline_color = "lightyellow",
vline_size = 0.5,
low = "blue",
high = "red",
binwidth = \(x) 2 * IQR(x) / (length(x)^(1/3)),
#binwidth = 0.2,
limits = c(0.1, 5),
breaks = seq(1, 5, by = 1),
quartiles = quartiles_member
) 

gplot

Figure 28: A more detailed look that illustrates the quartile distribution of member travel distance.

To visualize monthly users through the lens of their respective concentrations, see Figure 29. The y-axis shows the density (a measure of relative frequency).

The member group has a narrower, taller spike in between the 0.1 and 1 mile distance. There is less of a concentration than the casual group between the 1-3 mile distance. The casual group displays a broader spike with much overlap to the member group, but peaks closer to the 1 mile mark than for members. There is greater concentration than members between the 1-3 miles distance, but less around the 0.1-1 mile distance. Both groups overlap nearly completely from around the 3 mile mark and onwards.

Code

gplot <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross travel distance groups",
        x_label = "Miles", x_col = miles, group_col = member_casual, geomType = "column",
        angle = 45, color_col = "black", density_alpha = 0.75, isDensity = TRUE,
        is_colGroup = TRUE, breaks = seq(0, 11, by = 1), limits = c(0.1, 11))

gplot

Figure 29: Visualizing the probability-density distribution of distance by membership.

Table 42 presents the odds ratios for membership status across distance quartiles, with Q1 serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider were significantly lower in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 0.63 times as high in Q2, 0.59 times as high in Q3, and 0.65 times as high in Q4.

These results indicate an inverse relationship between ride distance and membership status, with members generally associated with shorter ride distances. Interestingly, the lowest odds of membership were observed in Q3, rather than Q4, suggesting a non-linear relationship between distance and membership likelihood.

Query, process, and create model R object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/distance_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Distance Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel distance"),
        source_note = gt::md("**Source**: `db/distance_wq.db`"), isBinary = TRUE)

Table 42: Modeling the probability of members’ travel by distance.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of members’ travel distance
Distance Ranges
Q1 (0.10 - 0.63]	—
Q2 (0.63 - 1.02]	0.63	<0.001
Q3 (1.02 - 1.76]	0.59	<0.001
Q4 (1.76 - 20.5]	0.65	<0.001
Source: `db/distance_wq.db`
¹ OR = Odds Ratio

3.9 Speed

Database Operations

Table Preview

Write to the database

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(mph, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(mphR = round(mph, digits = 0)) |>
        dplyr::arrange(mph, member_casual) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/speed.db", overwrite = TRUE)
}

Query, transform, and write weighted quartile data.

if (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed_wq.db"))) {
    transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mph", "member_casual"),
        group_cols = c("mph", "member_casual"), binary_col = "member_casual", pred_col = "mph",
        ntile_col = "quartile", zero_val = "casual", one_val = "member", qtile_levels = c("Q1 (1.0 - 5.4]",
            "Q2 (5.4 - 7.0]", "Q3 (7.0 - 8.6]", "Q4 (8.6 - 20]"), doQuantile = TRUE,
        doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/speed_wq.db", overwrite = TRUE)
}

Code

dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/speed_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()

Table 43: Speed

mph	member_casual	mphR
1.000007	casual	1
1.000008	casual	1
1.000050	casual	1
1.000075	casual	1
1.000088	member	1
1.000098	member	1

Table 44: Speed with weighted quartile groups.

mph	member_casual	n	quartile
1.000007	casual	1	Q1 (1.0 - 5.4]
1.000008	casual	1	Q1 (1.0 - 5.4]
1.000050	casual	1	Q1 (1.0 - 5.4]
1.000075	casual	1	Q1 (1.0 - 5.4]
1.000088	member	1	Q1 (1.0 - 5.4]
1.000098	member	1	Q1 (1.0 - 5.4]

Code

transformData(conn = dbconn, path = "db/speed.db", select_cols = "mphR", group_cols = "mphR",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("travel speed (miles per hour)", "occurrences"),
        location_list = list("mphR", "n"), source_note = gt::md("**Source**: `db/speed.db`"),
        noteRows = TRUE, hide_column_labels = TRUE, row_index = 1, label_parameter = "mphR",
        align_parameter = "right")

Table 45: The overall distribution of trips taken by the estimated average speed

¹ 1	² 45485
2	106779
3	161520
4	278819
5	460475
6	622275
7	647253
8	546353
9	399237
10	265438
11	165825
12	99529
13	55660
14	27274
15	11165
16	3354
17	714
18	138
19	69
20	36
Source: `db/speed.db`
¹ travel speed (miles per hour)
² occurrences

Code

transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mphR", "member_casual"),
    group_cols = c("mphR", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "mphR", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/speed.db`"),
        isStub = TRUE, stub_label = "Speed", stub_note = "estimated average speed traveled per trip (mph)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")

Table 46: The membership distribution of trips taken by the estimated average speed

Speed¹
1	casual²	³ 28675
1	member	16810
2	casual	67464
2	member	39315
3	casual	93694
3	member	67826
4	casual	128134
4	member	150685
5	casual	166759
5	member	293716
6	casual	191146
6	member	431129
7	casual	179540
7	member	467713
8	casual	144296
8	member	402057
9	casual	102522
9	member	296715
10	casual	67961
10	member	197477
11	casual	42345
11	member	123480
12	casual	24981
12	member	74548
13	casual	13475
13	member	42185
14	casual	6305
14	member	20969
15	casual	2402
15	member	8763
16	casual	693
16	member	2661
17	casual	171
17	member	543
18	casual	38
18	member	100
19	casual	12
19	member	57
20	casual	8
20	member	28
Source: `db/speed.db`
¹ estimated average speed traveled per trip (mph)
² membership status
³ occurrences

Overall
Comparative

Table 45 and Figure 30 show the overall distribution of travel speeds.

Code

gplot <- transformData(conn = dbconn, path = "db/speed.db", select_cols = "mphR",
    group_cols = "mphR", doWeights = TRUE) |>
    plotter(x_col = mphR, y_col = n, geomType = "column", title = paste0("How fast do people ride?"),
        x_label = "Miles per Hour", y_label = "n")

gplot

Figure 30: The overall distribution of trips taken by the estimated average speed

Table 46 and Figure 31 summarize the travel speed distribution by membership.

Code

gplot <- transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mphR",
    "member_casual"), group_cols = c("mphR", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("How fast do members travel?"), x_label = "Miles per Hour",
        y_label = "n", x_col = mphR, y_col = n, color_col = member_casual, geomType = "column",
        is_colGroup = TRUE, isFaceted = TRUE)

gplot

Figure 31: The membership distribution of trips taken by the estimated average speed

Table 47 gives the reader some idea of the variability, range, and quartile information about the distance data.

Code

dplyr::tbl(dbconn, "db/speed.db") |>
dplyr::select(mph, member_casual) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = mph ~ "continuous2",
label = list(mph ~ "Speed"),
digits = list(
mph ~ c(1, 1)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel speed"),
source_note = gt::md("**Source**: `db/speed.db`"),
isBinary = TRUE
)

Table 47: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel speed.

Summary Stats: Member travel speed
Characteristic	casual N = 1,260,621	member N = 2,636,777
Speed
Median (Q1, Q3)	6.3 (4.5, 8.1)	7.2 (5.7, 8.8)
Mean (SD)	6.4 (2.7)	7.3 (2.4)
Min, Max	1.0, 20.0	1.0, 20.0
Source: `db/speed.db`

Two plots, Figure 32 and Figure 33, present a highly granular view of the travel speed data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.

qdf_member <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$mph, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$mph, probs = c(0.25, 0.5, 0.75))

Code

gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel speed"),
        x_label = "Miles per Hour", y_label = "n", x_col = mph, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), breaks = seq(1, 20, by = 2), limits = c(1,
            20), quartiles = quartiles_casual)

gplot

Figure 32: A more detailed look that illustrates the quartile distribution of casual travel speed.

Code

gplot <- 
qdf_member |>
plotter(
title = paste0("The quartile", "\n", "distribution of members' travel speed"),
x_label = "Miles per Hour",
y_label = "n",
x_col = mph, 
geomType = "column", 
isHistogram = TRUE,
angle = 45,
color_col = "transparent",
vline_color = "lightyellow",
vline_size = 0.5,
low = "blue",
high = "red",
binwidth = \(x) 2 * IQR(x) / (length(x)^(1/3)),
#binwidth = 0.5,
breaks = seq(1, 20, by = 2),
limits = c(1, 20),
quartiles = quartiles_member
) 

gplot

Figure 33: A more detailed look that illustrates the quartile distribution of member travel speed.

To visualize travel speed patterns through the lens of their respective concentrations, see Figure 34. The y-axis shows the density (a measure of relative frequency).

The member group has a narrower, taller spike centering around 7mph. There is higher concentration than the casual group between ~ 5-15mph. The casual group has a broader, shorter spike centering closer to 6mph. There is higher concentration than the member group between ~ 1-5mph. Alongside significant overlap, both groups exhibit a unimodal distribution.

Code

gplot <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross travel speed groups",
        x_label = "Miles per Hour", x_col = mph, group_col = member_casual, geomType = "column",
        angle = 45, color_col = "black", density_alpha = 0.75, isDensity = TRUE,
        is_colGroup = TRUE, breaks = seq(1, 20, by = 1), limits = c(1, 20))

gplot

Figure 34: Visualizing the probability-density distribution of speed by membership.

Table 48 presents the odds ratios for membership status across speed quartiles, with Q1 serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider were significantly higher in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 2.09 times higher in Q2, 2.50 times higher in Q3, and 2.69 times higher in Q4.

These results suggest a strong positive association between riding speed and membership status, with the likelihood of membership increasing monotonically across speed quartiles.

Query, process, and create model R object for hour based on quartile range.

model <- dplyr::tbl(dbconn, "db/speed_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)

Code

model |>
    gtsummary::tbl_regression(label = list(quartile = "Speed Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel speed"),
        source_note = gt::md("**Source**: `db/speed_wq.db`"), isBinary = TRUE)

Table 48: Modeling the probability of members’ travel by speed.

Characteristic	OR¹	p-value
Binary Logistic Regression: Modeling the likelihood of members’ travel speed
Speed Ranges
Q1 (1.0 - 5.4]	—
Q2 (5.4 - 7.0]	2.09	<0.001
Q3 (7.0 - 8.6]	2.50	<0.001
Q4 (8.6 - 20]	2.69	<0.001
Source: `db/speed_wq.db`
¹ OR = Odds Ratio

3.10 Summarizing Tabsets

The EDA (exploratory data analysis) sections employ various statistical methods to uncover patterns in user behavior and preferences. A chi-square analysis reveals a significant association between bicycle type and membership status (p < 0.001). Binary logistic regression further quantifies this relationship, showing that electric bike users have lower odds of being members compared to classic bike users. Section 3.3

Trip duration analysis, also utilizing binary logistic regression, uncovers a notable trend: the odds of membership decrease substantially as trip duration increases. This model, using quartiles of trip duration, indicates that members generally take shorter, more concentrated trips, while casual users are more likely to engage in longer rides. Section 3.4

Seasonal trends emerge when examining monthly ridership patterns through another logistic regression model. The odds of membership fluctuate throughout the year, with the highest proportion of members riding during the colder months and early spring. As the weather warms, there’s a noticeable shift towards more casual ridership, as evidenced by lower odds ratios in the summer months. Section 3.5

Weekly and daily patterns, analyzed using similar regression techniques, provide further insights into user behavior. Weekdays, Section 3.6, particularly during typical commuting hours, Section 3.7, see higher odds of member rides. In contrast, weekends and evenings show decreased odds of membership, suggesting an increased likelihood of casual ridership during these times.

The analysis of trip distances, again using logistic regression with distance quartiles, reveals an inverse relationship with membership status. Members are more likely to take shorter trips, while casual users tend to embark on longer journeys. This aligns with the duration findings and reinforces the idea that members use the service for quick, routine travel. Section 3.4

Interestingly, trip speed shows a strong positive association with membership status in the logistic regression model. The odds of membership increase monotonically across speed quartiles, indicating that members generally ride at higher speeds compared to casual users. Section 3.9

These findings, derived from a combination of chi-square tests for independence and multiple binary logistic regression models, paint a picture of two distinct user groups: members who typically use the bike share for short, fast, routine trips during weekdays, and casual users who tend to take longer, slower rides, often during leisure hours or weekends.

Contingency tables and visualizations, including density plots and histograms, supplement these statistical analyses, providing a comprehensive view of the data distribution across various parameters such as bike type, trip duration, time of day, and day of the week.

The robust statistical approach, combining hypothesis testing (chi-square) with predictive modeling (logistic regression), provides strong evidence for the observed patterns in user behavior. These insights could prove valuable for tailoring marketing strategies, optimizing bike distribution, and enhancing service offerings to better serve both user segments.

3.11 Geographic Data

3.11.1 Traffic Flow

Figure 35 presents an intriguing bird’s-eye view of trip behaviors through an interactive epiflows graph. ]Moraga et al. (2019)] This R package used for creating this graph was re-purposed from its original intent for visualizing the spread of disease. This visualization employs a network of nodes (circles) connected by lines, where the thickness of the lines roughly corresponds to the volume of trips between the nodes, with thicker lines indicating a higher number of trips. The top 34 most frequently traveled stations are depicted in this visual network diagram.

Moraga, P., Dorigatti, I., Kamvar, Z. N., Piatkowski, P., Toikkanen, S. E., Nagraj, V. P., Donnelly, C. A., & Jombart, T. (2019). epiflows: an R package for risk assessment of travel-related spread of disease. https://doi.org/10.12688/f1000research.16032.3

The interactive nature of the epiflows allows users to click on individual nodes and lines to access more detailed information. Additionally, a drop-down window provides further exploration capabilities, enabling users to delve deeper into the data.

These stations represent the most active locations within the system. Fortunately, Section 3.11.2 explores a potential approach to gain insights into the typical high-traffic station locations and the underlying reasons behind their elevated activity levels.

First, creates the frequency of trips taken to and from pairs of stations.
We are only going to look deeper into the top 50 most traveled pairs.

flowData <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(start_station_name, end_station_name) |>
    dplyr::group_by(start_station_name, end_station_name) |>
    dplyr::summarize(n = n()) |>
    dplyr::ungroup() |>
    dplyr::arrange(desc(n)) |>
    dplyr::rename(from_station = start_station_name, to_station = end_station_name) |>
    dplyr::collect() |>
    dplyr::slice_head(n = 50)

Second, we need statistics but also to combine the statistics for every unique station name.

locationData <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(start_station_name, end_station_name, started_at, ended_at, trip_time) |>
    dplyr::group_by(start_station_name, end_station_name) |>
    dplyr::mutate(trip_time = round(trip_time, digits = 0)) |>
    dplyr::summarize(trip_count = dplyr::n(), first_date = min(started_at), last_date = max(ended_at),
        ) |>
    dplyr::ungroup() |>
    dplyr::rename(from_station = start_station_name, to_station = end_station_name) |>
    dplyr::arrange(desc(trip_count)) |>
    dplyr::collect()

# Need to combine all names to single column and recalculate or retain other
# stats.
locationData_pivoted <- locationData |>
    tidyr::pivot_longer(cols = 1:2, values_to = "allNames") |>
    dplyr::group_by(allNames) |>
    dplyr::summarize(trips_toAndfrom = sum(trip_count), first_date = min(first_date),
        last_date = max(last_date), ) |>
    dplyr::arrange(trips_toAndfrom)

Third, creates epiflow objects, which take in a pair of dataframes and creates the flows between them.

# for all the pairs
ef_test <- epiflows::make_epiflows(flows = flowData, locations = locationData_pivoted,
    num_cases = "trips_toAndfrom")

First, just a quick view of the flow data table we made earlier.

flowData

# A tibble: 50 × 3
   from_station                      to_station                   n
   <chr>                             <chr>                    <dbl>
 1 Ellis Ave & 60th St               Ellis Ave & 55th St       6927
 2 Ellis Ave & 60th St               University Ave & 57th St  6600
 3 Ellis Ave & 55th St               Ellis Ave & 60th St       6349
 4 University Ave & 57th St          Ellis Ave & 60th St       6168
 5 Calumet Ave & 33rd St             State St & 33rd St        5417
 6 State St & 33rd St                Calumet Ave & 33rd St     5343
 7 DuSable Lake Shore Dr & Monroe St Streeter Dr & Grand Ave   4023
 8 Loomis St & Lexington St          Morgan St & Polk St       3719
 9 Morgan St & Polk St               Loomis St & Lexington St  3379
10 University Ave & 57th St          Kimbark Ave & 53rd St     3112
# ℹ 40 more rows

Second, another quick view, but for thethe location data we pivoted earlier.

locationData_pivoted |>
    dplyr::arrange(desc(trips_toAndfrom))

# A tibble: 1,567 × 4
   allNames              trips_toAndfrom first_date          last_date          
   <chr>                           <dbl> <dttm>              <dttm>             
 1 Streeter Dr & Grand …           86422 2023-01-01 00:05:43 2024-01-01 00:19:01
 2 Kingsbury St & Kinzi…           61277 2023-01-01 01:21:59 2023-12-31 21:30:50
 3 DuSable Lake Shore D…           60808 2023-01-01 02:12:09 2023-12-31 23:34:53
 4 Clark St & Elm St               60552 2023-01-01 01:06:48 2023-12-31 23:29:33
 5 Clinton St & Washing…           58278 2023-01-01 00:44:39 2023-12-31 18:03:02
 6 Wells St & Concord Ln           57642 2023-01-01 01:15:27 2023-12-31 23:51:50
 7 Michigan Ave & Oak St           54000 2023-01-01 00:59:17 2023-12-31 23:09:35
 8 Wells St & Elm St               52315 2023-01-01 00:59:22 2023-12-31 23:51:48
 9 DuSable Lake Shore D…           48833 2023-01-01 00:14:47 2023-12-31 16:49:56
10 Theater on the Lake             48349 2023-01-01 03:14:22 2023-12-31 22:53:53
# ℹ 1,557 more rows

Figure 35: EpiFlow Network

Code Steps

Table Preview

3.11.2 Checking the Map

This section was made possible thanks to the latitude and longitude coordinates data provided alongside the stations names. Coming from the epiflow diagram, this should help make the data less abstract. The accordion below expands and collapses four OpenStreet maps found in the callout section below. These maps were split for viewing logistics. They contain from the epiflow in the section above. These maps are interactive, so the default views are zoom-able and movable. The transparent burst buttons enable snappy zooming-in of the station groups.

Processing ‘flowData’ created earlier to include geolocation data for mapview plots.

# All distinct stations in one column
names <- flowData |>
    dplyr::select(from_station, to_station) |>
    tidyr::pivot_longer(cols = 1:2, names_to = NULL, values_to = "station_names") |>
    dplyr::distinct()

# The important geo-coordinates corresponding to station names
mapData <- dplyr::tbl(dbconn, filtered_path, check_from = FALSE) |>
    dplyr::select(start_station_name, start_lat, start_lng, end_station_name, end_lat,
        end_lng)

# Filter to include all observations that match the station names listed in
# 'names'. We need the geo-coordinates alongside the names.
mapData1 <- mapData |>
    dplyr::collect() |>
    # Filter, but through a vector of conditions.
dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>
    dplyr::select(start_station_name:start_lng)

# Had to split 'mapData' into two and pivot into a single table.
mapData2 <- mapData |>
    dplyr::collect() |>
    dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>
    dplyr::select(end_station_name:end_lng)

# Nice grouping
stations_groupMap <- dplyr::bind_rows(mapData1, mapData2) |>
    dplyr::select(start_station_name, start_lat, start_lng) |>
    dplyr::rename(station_names = start_station_name, lat = start_lat, lng = start_lng) |>
    dplyr::distinct() |>
    dplyr::group_by(station_names)

# Setting seed for sampling
set.seed(113)

# Taking 10 random samples from each station_name group
sampled_stations <- stations_groupMap |>
    dplyr::slice_sample(n = 10) |>
    tidyr::drop_na()

Creates a map coloring palette excluding grays.

# All of the r-colors
allPalette <- colors()

# The grays are vast so we don't want those watering down the samples.
colorfulPal <- purrr::discard(allPalette, stringr::str_detect(allPalette, "gr(a|e)y"))

# When we sample the colors, 10 should be slightly more than needed.
n_colors <- 10

First, sourcing the script needed to generate the maps and creating the list of vectors used as input. These vectors are the slices of the top most traveled stations.

slicerVector <- list(c(1:9), c(10:18), c(19:27), c(28:34))
source("Scripts/mapViewer.R")

The script used to generate the maps.

# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# I needed the stations groups' burst buttons to fit the viewing window in my 
# document and the only way I could think of is to split the stations into 
# multiple maps. This reduces duplicate code.
# ----
mapViewer <- function(x) {
    
    nameSlice <- sampled_stations |>
        dplyr::ungroup() |>
        dplyr::distinct(station_names) |>
        dplyr::slice(x)
    
    viewMap <- sampled_stations |>
        dplyr::filter(station_names %in% nameSlice$station_names) |>
        sf::st_as_sf(coords = c(3:2), crs = 4326) |>
        mapview::mapview(
            zcol = "station_names",
            col.regions = randomColors,
            map.types = "OpenStreetMap",
            burst = TRUE,
            legend = FALSE)
    
    return(viewMap)
}

Benson Ave & Church St … Ellis Ave & 60th St

Code

set.seed(240)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[1]])

Figure 36: Benson Ave & Church St - Ellis Ave & 60th St

Greenview Ave & Fullteron Ave … Loomis Ave & Lexington St

Code

set.seed(241)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[2]])

Figure 37: Greenview Ave & Fullteron Ave - Loomis Ave & Lexington St

Michigan Ave & Oak St … State St & 33rd St

Code

set.seed(242)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[3]])

Figure 38: Michigan Ave & Oak St - State St & 33rd St

Street Dr & Grand Ave … Woodlawn Ave & 55th St

Code

set.seed(243)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[4]])

Figure 39: Street Dr & Grand Ave - Woodlawn Ave & 55th St

3.11.3 Summarzing Geographic EDA

For example, suppose the user selects University Ave & 57th St in the epiflow visualization. This intersection happens to be at the heart of the University of Chicago campus. The natural next question is: where does the traffic to and from this location typically flow? By selecting one of the other nodes highlighted with flows directing away from the previous node, the user can identify Kimbark Ave and 53rd St. As seen in the map view, this location is situated adjacent to the Vue 53 Apartments complex. By analyzing such connections between nodes, the user can gain insights into common routes and destinations originating from a particular point of interest, potentially revealing patterns related to student housing, campus facilities, or other points of interest in the vicinity.

The data suggests individual members utilize the service multiple times weekly. However, further analysis is needed to determine if a significantly larger volume of unique individuals are annual members. Verifying associations between specific locations and higher or lower traffic could be a next step. Preliminary observations indicate universities, shopping centers, major companies, and nearby apartment complexes tend to have the highest ridership volumes.

To improve membership, addressing factors deterring individuals from becoming annual members could be key. These may include a lack of stations within walking distance of residences or destinations, or concerns over electric bicycle battery life and charging station availability, potentially explaining their lower utilization compared to classic bikes. Offering trial periods could allow casual users to experience the service’s reliability and convenience, encouraging conversion to annual memberships.

3.12 Updated Database Tables List

Code

dbList2 <- duckdb::dbListTables(dbconn) |>
    as.data.frame() |>
    tabler(title = gt::md("Which tables have<br>been created so far?"), note_list = list(gt::md("Tables in `db/data.db` at this stage")),
        location_list = list("duckdb::dbListTables(dbconn)"), noteColumns = TRUE,
        source_note = gt::md("**Source**: `db/data.db`"), label_n = NULL) |>
    gt::cols_label(`duckdb::dbListTables(dbconn)` = "Table Paths") |>
    gt::cols_align(align = "left")

dbList2

Table 49: The list of tables created by the end of the analysis.

Table Paths¹
Which tables have been created so far?
db/bType.db
db/bType_wb.db
db/complete_data.db
db/data_unduped.db
db/distance.db
db/distance_wq.db
db/dow.db
db/dow_wq.db
db/duration.db
db/duration_wq.db
db/filtered_data.db
db/hod.db
db/hod_wq.db
db/membership.db
db/moy.db
db/moy_wq.db
db/original_data.db
db/speed.db
db/speed_wq.db
Source: `db/data.db`
¹ Tables in `db/data.db` at this stage

Notice the size difference from the previous image. The database is still represented in the data folder as one file.

3.13 Export the Final Database

For future use, it might save a lot of time to not have to re-download the original data or recreate tables. The database is exported and compressed to parquet files.

A view of the data exporting script.

# ----
# CC BY-SA, Eric Mossotti 
# ----
# Description ----
# 
# Simplifies duckDB database export to a separate directory. Prevents the need
# for re-downloading and recreating db tables in some circumstances.

dbExporter <- function(dbdir, query) {
    
    if (dir.exists("db_exported") == FALSE ) {
        dir.create("db_exported")
    }
    
    conn <- DBI::dbConnect(duckdb::duckdb(), dbdir)
    
    DBI::dbExecute(conn, query)
    
}

Execute above script with a custom SQL query string parameter to the db_exported directory.

if (dir.exists("db_exported") == FALSE) {

    source("Scripts/dbExporter.R")

    queryString <- paste0("EXPORT DATABASE 'db_exported' (", "FORMAT PARQUET, ",
        "COMPRESSION ZSTD", ")")

    dbExporter(dbdir = "db/data.db", query = queryString)

}

4 Conclusion

These findings and recommendations are based on robust statistical analyses, including chi-square tests, binary logistic regression models, and visualization techniques. They provide a data-driven foundation for enhancing the Divvy bike-sharing service and better serving the residents of Chicago.

4.1 Key Findings

Membership status significantly influences bike usage patterns:
- Members prefer classic bikes over electric bikes.
- Casual users have a higher electric bike usage compared to members.
- Members typically take shorter, faster trips.
- Casual users tend to engage in longer, slower rides.
Temporal patterns reveal distinct user behaviors:
- Weekdays and typical commuting hours see higher member activity.
- Weekends and evenings show increased casual ridership.
- Membership likelihood is highest during colder months and early spring.
- Summer months see a shift towards more casual ridership.
Trip characteristics vary by user type:
- Members are associated with shorter ride distances.
- Trip speed shows a strong positive association with membership status.
- The likelihood of membership decreases as trip duration increases.
High-traffic stations are often near universities, shopping centers, major companies, and apartment complexes.
The large sample size (nearly 4 million users) allows for high statistical significance in observed differences.

4.2 Recommendations

Tailor marketing strategies to target potential members for short, frequent trips, especially for commuting purposes.
Optimize bike distribution to meet the demand for classic bikes among members and electric bikes among casual users.
Implement promotional campaigns during summer months to convert casual users to members.
Consider offering trial periods to allow casual users to experience the benefits of membership.
Investigate factors deterring individuals from becoming annual members, such as station proximity to residences or destinations.
Address potential concerns over electric bicycle battery life and charging station availability.
Focus on improving service near high-traffic areas like universities, shopping centers, and residential complexes.
Develop targeted strategies to encourage casual users of longer, leisure rides to consider membership benefits.
Utilize the epiflows visualization tool to identify and optimize popular routes and destinations.
Continue to collect and analyze data to refine understanding of user behavior and preferences over time.

Reuse

CC BY-SA 4.0