Bike-Sharing in the Streets of Chicago

Author

Eric Mossotti

Published

May 23, 2024

Modified

Oct 14, 2024

Objective
Communicating reproducible, data-driven insight into the questions relating but not limited to, increasing annual membership subscriptions.

1 Intro

Download Files and DB Creation

A view the dl_toDirectory.R script used below.
# ----
# CC BY-SA, Eric Mossotti
# ----
# Description ----
# 
# Downloads files from the cloud, calls the file relocating function and cleans
# up the environment.
# 
# ----
source("Scripts/unz_relocate.R")

dl_toDirectory <- function (
        durls,
        tmpzip_dir,
        tmpfile_dir,
        tmpzip_paths,
        tmpfile_paths,
        tmpfile_names
        ) {

  # Create a directory to store the temporary files
  dir.create(tmpzip_dir)
  
  # A simple way to download and relocate those files from the working directory to the file-folder paths created earlier.
  curl::multi_download(durls, 
                       destfiles = tmpzip_paths)
  
  # Create tempFile directory
  dir.create(tmpfile_dir)
  
  # Execute sourced custom function from the unz_relocate.R file
  unz_relocate(fPaths = tmpfile_paths,
               zPaths = tmpzip_paths,
               fNames = tmpfile_names)
  
  # To remove the directory and contents thereof after having finished using
  unlink(tmpzip_dir, 
         recursive = TRUE)
  
}
A view of the csv_toDB.R script used below.
# ----
# CC BY-SA, Eric Mossotti
#  ----
# Description ----
# 
# Reads the downloaded files into a dataframe. Then writes the df to local 
# storage and cleans up the environment.
# 
# ---- 

library(duckdb)

csv_toDB <- function (
        tmpfile_dir,
        tmpfile_paths,
        db_dir,
        database_path,
        original_path,
        complete_path
) {
    # Would like to read in all unzipped files to a single dataframe
    dataTibble <-
        purrr::map(tmpfile_paths[1:12], 
                   arrow::read_csv_arrow) |>
        purrr::list_rbind()
    
    # duckDB for data storage ----
    
    # Folder to hold database files
    dir.create(db_dir)
    
    # Initialize a duckDB database connection
    dbconn <- DBI::dbConnect(duckdb::duckdb(), 
                             dbdir = database_path, 
                             read_only = FALSE)
    
    # Original data table
    dataTibble |>
        as.data.frame() |>
        duckdb::dbWriteTable(conn = dbconn,
                             name = original_path,
                             overwrite = TRUE)
    
    # Clean data and working environment ----
    rm(dataTibble)
    
    # Complete observations write
    dplyr::tbl(dbconn, original_path) |>
        dplyr::collect() |>
        tidyr::drop_na() |>
        duckdb::dbWriteTable(conn = dbconn,
                             name = complete_path,
                             overwrite = TRUE)
    
    unlink(tmpfile_dir, recursive = TRUE)
}
Execute initial download and removal of incomplete observations, then
store original data and complete data separately.

Or, just connect to existing db,
then source frequently used scripts used later.
if (exists("dbconn") == FALSE && dir.exists("db") == FALSE) {

    source("Scripts/dl_toDirectory.R")

    dl_toDirectory(durls = sprintf("https://divvy-tripdata.s3.amazonaws.com/%d-divvy-tripdata.zip",
        202301:202312), tmpzip_dir = "tempZips", tmpzip_paths = sprintf("tempZips/%d-divvy-tripdata.zip",
        202301:202312), tmpfile_dir = "tempFiles", tmpfile_paths = sprintf("tempFiles/%d-divvy-tripdata.csv",
        202301:202312), tmpfile_names = sprintf("%d-divvy-tripdata.csv", 202301:202312))

    source("Scripts/csv_toDB.R")

    csv_toDB(tmpfile_dir = "tempFiles", tmpfile_paths = sprintf("tempFiles/%d-divvy-tripdata.csv",
        202301:202312), db_dir = "db", database_path = "db/data.db", original_path = "db/original_data.db",
        complete_path = "db/complete_data.db")
} else {
    database_path <- "db/data.db"
    original_path <- "db/original_data.db"
    complete_path <- "db/complete_data.db"
    filtered_path <- "db/filtered_data.db"

    dbconn <- DBI::dbConnect(duckdb::duckdb(), dbdir = database_path, read_only = FALSE)
}

source("Scripts/tabler.R")
source("Scripts/plotter.R")
source("Scripts/transformData.R")
source("Scripts/chisqTest.R")

1.1 Stakeholders

The primary stakeholders in this analysis are Divvy, Lyft (the parent company of Divvy), and the City of Chicago Department of Transportation. The analysis aims to provide these stakeholders with data-driven insights to enhance the Divvy bike-sharing service, better serving the residents of Chicago and its users. The initial rationale behind Divvy’s implementation included improving air quality, promoting economic recovery, and reducing traffic congestion within the city. (About Divvy)

About divvy: Company & history | divvy bikes. https://divvybikes-marketing-staging.lyft.net/about

1.2 Data Import

Off-Canvas Buttons

Code processing steps are accessible via buttons like the one below. Drop-down code summaries and tables therein add context and transparency regarding the presented findings to enhance understanding.

Data import scripts and initial analysis setup decision

1.3 Source

The raw 2023 dataset was imported from Divvy Data. (Divvy Data)

Code
# List of column labels to feed tabler() and add_multiple_footnotes()
location_list <- dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    colnames() |>
    as.list()

# A simple list of footnotes to feed tabler() and add_multiple_footnotes().
note_list <- list("Anonymized trip identifier.", "The bicycle type.", "Starting date-time (to the second).",
    "Ending date-time (to the second).", "Station name of where the trip started.",
    "Station ID of where the trip started.", "Station name of where the trip ended.",
    "Station ID of where the trip ended.", "Latitude associated with the starting location.",
    "Longitude associated with the starting location.", "Latitude associated with the ending location.",
    "Longitude associated with the ending location.", "If someone is an annual subscriber or not.")

dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    dplyr::slice_head(n = 10) |>
    tabler(title = "What data did we start with?", source_note = gt::md("**Source**: Divvy Data"),
        note_list = note_list, location_list = location_list, noteColumns = TRUE,
        label_n = NULL) |>
    gt::tab_options(table.font.size = gt::pct(75), footnotes.multiline = FALSE)
Table 1: Unaltered data (other than being combined and formatted for presentation).
What data did we start with?
ride_id1 rideable_type2 started_at3 ended_at4 start_station_name5 start_station_id6 end_station_name7 end_station_id8 start_lat9 start_lng10 end_lat11 end_lng12 member_casual13
F96D5A74A3E41399 electric_bike 2023-01-21 20:05:42 2023-01-21 20:16:33 Lincoln Ave & Fullerton Ave TA1309000058 Hampden Ct & Diversey Ave 202480.0 41.92407 -87.64628 41.93000 -87.64000 member
13CB7EB698CEDB88 classic_bike 2023-01-10 15:37:36 2023-01-10 15:46:05 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
BD88A2E670661CE5 electric_bike 2023-01-02 07:51:57 2023-01-02 08:05:11 Western Ave & Lunt Ave RP-005 Valli Produce - Evanston Plaza 599 42.00857 -87.69048 42.03974 -87.69941 casual
C90792D034FED968 classic_bike 2023-01-22 10:52:58 2023-01-22 11:01:44 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
3397017529188E8A classic_bike 2023-01-12 13:58:01 2023-01-12 14:13:20 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
58E68156DAE3E311 electric_bike 2023-01-31 07:18:03 2023-01-31 07:21:16 Lakeview Ave & Fullerton Pkwy TA1309000019 Hampden Ct & Diversey Ave 202480.0 41.92607 -87.63886 41.93000 -87.64000 member
2F7194B6012A98D4 electric_bike 2023-01-15 21:18:36 2023-01-15 21:32:36 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79955 -87.59462 41.80983 -87.59938 member
DB1CF84154D6A049 classic_bike 2023-01-25 10:49:01 2023-01-25 10:58:22 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
34EAB943F88C4C5D electric_bike 2023-01-25 20:49:47 2023-01-25 21:02:14 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79959 -87.59467 41.80983 -87.59938 member
BC8AB1AA51DA9115 classic_bike 2023-01-06 16:37:19 2023-01-06 16:49:52 Kimbark Ave & 53rd St TA1309000037 Greenwood Ave & 47th St TA1308000002 41.79957 -87.59475 41.80983 -87.59938 member
Source: Divvy Data
1 Anonymized trip identifier. 2 The bicycle type. 3 Starting date-time (to the second). 4 Ending date-time (to the second). 5 Station name of where the trip started. 6 Station ID of where the trip started. 7 Station name of where the trip ended. 8 Station ID of where the trip ended. 9 Latitude associated with the starting location. 10 Longitude associated with the starting location. 11 Latitude associated with the ending location. 12 Longitude associated with the ending location. 13 If someone is an annual subscriber or not.

Glimpse

A ‘glimpse’ function output of the raw data with data type information.
dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    tibble::glimpse()
Rows: 5,719,877
Columns: 13
$ ride_id            <chr> "F96D5A74A3E41399", "13CB7EB698CEDB88", "BD88A2E670…
$ rideable_type      <chr> "electric_bike", "classic_bike", "electric_bike", "…
$ started_at         <dttm> 2023-01-21 20:05:42, 2023-01-10 15:37:36, 2023-01-…
$ ended_at           <dttm> 2023-01-21 20:16:33, 2023-01-10 15:46:05, 2023-01-…
$ start_station_name <chr> "Lincoln Ave & Fullerton Ave", "Kimbark Ave & 53rd …
$ start_station_id   <chr> "TA1309000058", "TA1309000037", "RP-005", "TA130900…
$ end_station_name   <chr> "Hampden Ct & Diversey Ave", "Greenwood Ave & 47th …
$ end_station_id     <chr> "202480.0", "TA1308000002", "599", "TA1308000002", …
$ start_lat          <dbl> 41.92407, 41.79957, 42.00857, 41.79957, 41.79957, 4…
$ start_lng          <dbl> -87.64628, -87.59475, -87.69048, -87.59475, -87.594…
$ end_lat            <dbl> 41.93000, 41.80983, 42.03974, 41.80983, 41.80983, 4…
$ end_lng            <dbl> -87.64000, -87.59938, -87.69941, -87.59938, -87.599…
$ member_casual      <chr> "member", "member", "casual", "member", "member", "…

R console output of the raw data

1.4 Design

Another worthy objective of this analysis is to achieve reproducibility and efficiency. To facilitate future research and enable subsequent analyst teams to build upon this work, the project aimed to provide adequate code documentation and adhere to best practices regarding clean and modular code.

For instance, certain design decisions were incorporated to eliminate the need for re-downloading and re-processing data. For analysts conducting analysis over an extended period, such as days or months, on this dataset, it is now possible to simply reconnect to the single database file containing all the original data, including tables generated throughout the analysis process, following the initial download and subsequent processing.

The underlying code incorporates an if-else decision, which includes a source code script responsible for handling the initial data processing and establishing the database filesystem. Opting for a persistent DuckDB filesystem (as opposed to a purely in-memory solution) appeared optimal in terms of simplicity, cost-effectiveness of SQL database queries, and retaining progress made over extended periods. (Why DuckDB)

To streamline the process, reduce code duplication, and maintain consistent formatting throughout the project, reusable functions were developed for generating most of the tables and figures. These functions are located in the “Scripts” folder within the working directory. Their modular design not only simplifies the implementation of formatting changes but also facilitates the integration of additional code snippets when necessary. For instance, certain plots might require limiting the range of the axes, which can be achieved by combining these functions with appropriate code addendum. By leveraging these functions, the project benefits from reduced redundancy, improved efficiency, and cohesive formatting across all visualizations and data representations.

1.5 Initial Database Table List

Code
# Since we already know the paths, we'll manually define these Otherwise, might
# have to delete db folder and start over

dbList <- list(Paths = c("db/original_data.db", "db/complete_data.db")) |>
    as.data.frame()

dbList |>
    tabler(title = gt::md("Which tables have<br>been created so far?"), note_list = list(gt::md("Tables in `db/data.db` at this stage")),
        location_list = list("Paths"), noteColumns = TRUE, source_note = gt::md("**Source**: `db/data.db`"),
        label_n = NULL, align_parameter = "left") |>
    gt::cols_align(align = "left")
Table 2: The list of tables created in data.db so far.
Which tables have
been created so far?
Paths1
db/original_data.db
db/complete_data.db
Source: db/data.db
1 Tables in db/data.db at this stage
Figure 1: dbList_1

A view of the filesystem directory. Notice there are not separate files for tables. The file, data.db, in this view does not represent a table name but the name of the database.

2 Tidying

2.1 Duplicates

Code to Remove Duplicates

First, record original observations from the raw data.
# Need to save this count for the summary table later
original_nobs <- dplyr::tbl(dbconn, original_path) |>
    dplyr::collect() |>
    nrow()
Create a table containing the duplicated observations.
# This is a separate table used to analyze the observations returned as not
# distinct (n > 1). This adds an extra column, labeled 'n'.
dupeTable <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::select(started_at:end_station_name) |>
    # Counts of unique rows added for column 'n'
dplyr::add_count(started_at, ended_at, start_station_name, end_station_name) |>
    # Only observations that have been duplicated 1 or more times are shown.
dplyr::filter(n > 1) |>
    # To see all rows, not just one row for each obs.
dplyr::ungroup() |>
    dplyr::arrange(started_at) |>
    dplyr::collect()
Record a count of distinct duplicates and total observations.
distinctCopiesCount <- dupeTable |>
    dplyr::distinct(n) |>
    as.integer()

duplicateObs <- length(dupeTable[[1]])
Create a table of the now unduplicated observations seen earlier.
# The issue is, we need to get rid of not all of these rows, but just the extra
# duplicate observations.

# If there were 2 rows of duplicates, one would want to end up with 1 row after
# removing the extras.
undupedTable <- dupeTable |>
    dplyr::distinct(started_at, start_station_name, ended_at, end_station_name)
Record a count of the incorrect observations.
# Run an incorrect count on how many rows or observations there are in the
# dataset.
count_incorrectDists <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::distinct(dplyr::pick("ride_id")) |>
    dplyr::count(name = "Incorrect Distinct Observations") |>
    dplyr::collect() |>
    as.integer()
Record a count of the correct observations.
# For the correct count of obs
count_correctDists <- dplyr::tbl(dbconn, complete_path) |>
    dplyr::distinct(dplyr::pick("started_at", "start_station_name", "ended_at", "end_station_name")) |>
    dplyr::count() |>
    dplyr::collect() |>
    as.integer()
Lastly, write the unduplicated data to the database.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/data_unduped.db"))) {
    dplyr::tbl(dbconn, complete_path) |>
        dplyr::distinct(started_at, start_station_name, ended_at, end_station_name,
            .keep_all = TRUE) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/data_unduped.db", overwrite = TRUE)
}

A crucial question arises: How can one identify and handle duplicate data? This section covers the process of checking for duplicates and selectively removing them while exercising caution. It is essential to recognize that the presence of unique values in a single column does not necessarily guarantee the uniqueness of each observation or row.

While all values in the ride_id column were found to be unique, not all observations were truly distinct. To verify the uniqueness of each observation, additional columns such as start_time, end_time, start_station, and end_station were utilized. These columns provide more granular information, including the precise starting and ending times down to the second, as well as the starting and ending locations. It was assumed that observations with identical starting and ending date-times and stations, despite having different rider IDs, were potentially erroneous duplicates.

Code
gtDupes <- 
dupeTable |>
dplyr::group_by(started_at) |>
gt::gt(
rowname_col = "row",
groupname_col = "started_at",
row_group_as_column = TRUE
) |>
gt::tab_style(style = list(
    gt::cell_text(weight = "bold", align = "center"),
    gt::cell_borders(sides = c("bottom"))
    ),
    locations = gt::cells_column_labels(gt::everything())) |>
gt::tab_style(
    style = list(gt::cell_borders(
        sides = c("left", "right"), 
        color = "transparent"), 
        gt::cell_text(align = "center", v_align = "middle")), 
    locations = gt::cells_body(gt::everything())) |> 
gt::data_color(
        columns = start_station_name,
        target_columns = gt::everything(),
        method = "auto",
        palette = "basetheme::brutal") |> 
gt::tab_header(title = "A view of duplicated observations", 
               subtitle = "Grouping follows the starting date-time value") |>
gt::tab_options(
heading.title.font.weight = "bolder",
heading.subtitle.font.weight = "lighter",
heading.align = "center",
table.background.color = "transparent",
table.font.color = "SeaShell",
table.font.size = gt::pct(75),
) |>
gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`")
                    )
gtDupes
Table 3: Duplicates Table
A view of duplicated observations
Grouping follows the starting date-time value
ended_at start_station_name start_station_id end_station_name n
2023-02-19 12:10:52 2023-02-19 12:24:04 Orleans St & Merchandise Mart Plaza TA1305000022 Green St & Randolph St* 2
2023-02-19 12:24:04 Orleans St & Merchandise Mart Plaza TA1305000022 Green St & Randolph St* 2
2023-04-15 15:56:18 2023-04-15 16:01:54 Kedzie Ave & 45th St 342 Fairfield Ave & 44th St 2
2023-04-15 16:01:54 Kedzie Ave & 45th St 342 Fairfield Ave & 44th St 2
2023-04-21 09:45:26 2023-04-21 10:01:23 Michigan Ave & 8th St 623 Wentworth Ave & Cermak Rd* 2
2023-04-21 10:01:23 Michigan Ave & 8th St 623 Wentworth Ave & Cermak Rd* 2
2023-07-08 18:22:07 2023-07-08 18:37:31 Milwaukee Ave & Grand Ave 13033 Bissell St & Armitage Ave* 2
2023-07-08 18:37:31 Milwaukee Ave & Grand Ave 13033 Bissell St & Armitage Ave* 2
2023-07-08 18:58:14 2023-07-08 19:08:47 Clark St & Schiller St TA1309000024 Bissell St & Armitage Ave* 2
2023-07-08 19:08:47 Clark St & Schiller St TA1309000024 Bissell St & Armitage Ave* 2
2023-07-09 18:00:17 2023-07-09 18:40:49 Wells St & Hubbard St TA1307000151 Fort Dearborn Dr & 31st St* 2
2023-07-09 18:40:49 Wells St & Hubbard St TA1307000151 Fort Dearborn Dr & 31st St* 2
2023-07-10 20:10:41 2023-07-10 20:19:59 Clark St & Newport St 632 Lincoln Ave & Roscoe St* 2
2023-07-10 20:19:59 Clark St & Newport St 632 Lincoln Ave & Roscoe St* 2
2023-07-15 10:48:09 2023-07-15 10:58:22 Racine Ave & Wrightwood Ave TA1309000059 Bissell St & Armitage Ave* 2
2023-07-15 10:58:22 Racine Ave & Wrightwood Ave TA1309000059 Bissell St & Armitage Ave* 2
2023-07-15 19:38:51 2023-07-15 19:55:04 Avondale Ave & Irving Park Rd 15624 Public Rack - Hamlin Ave & Fullerton Ave 2
2023-07-15 19:55:04 Avondale Ave & Irving Park Rd 15624 Public Rack - Hamlin Ave & Fullerton Ave 2
2023-07-23 11:41:36 2023-07-23 12:07:13 Burnham Harbor 15545 Fort Dearborn Dr & 31st St* 2
2023-07-23 12:07:13 Burnham Harbor 15545 Fort Dearborn Dr & 31st St* 2
2023-07-25 18:08:47 2023-07-25 18:21:45 Wabash Ave & Roosevelt Rd TA1305000002 Wentworth Ave & Cermak Rd* 2
2023-07-25 18:21:45 Wabash Ave & Roosevelt Rd TA1305000002 Wentworth Ave & Cermak Rd* 2
2023-07-26 21:10:55 2023-07-26 21:28:44 Burnham Harbor 15545 Fort Dearborn Dr & 31st St* 2
2023-07-26 21:28:44 Burnham Harbor 15545 Fort Dearborn Dr & 31st St* 2
2023-08-05 19:40:37 2023-08-05 20:08:35 Morgan St & Lake St* chargingstx4 Morgan St & Lake St* 2
2023-08-05 20:08:35 Morgan St & Lake St* chargingstx4 Morgan St & Lake St* 2
2023-08-12 17:46:53 2023-08-12 17:57:40 Calumet Ave & 18th St 13102 Wentworth Ave & Cermak Rd* 2
2023-08-12 17:57:40 Calumet Ave & 18th St 13102 Wentworth Ave & Cermak Rd* 2
2023-08-17 12:23:50 2023-08-17 12:37:45 DuSable Lake Shore Dr & Monroe St 13300 Streeter Dr & Grand Ave 2
2023-08-17 12:37:45 DuSable Lake Shore Dr & Monroe St 13300 Streeter Dr & Grand Ave 2
2023-09-03 14:55:59 2023-09-03 15:58:21 Bissell St & Armitage Ave* chargingstx1 Bissell St & Armitage Ave* 2
2023-09-03 15:58:21 Bissell St & Armitage Ave* chargingstx1 Bissell St & Armitage Ave* 2
2023-09-25 17:38:05 2023-09-25 17:52:29 Fairbanks Ct & Grand Ave TA1305000003 DuSable Lake Shore Dr & North Blvd 2
2023-09-25 17:52:29 Fairbanks Ct & Grand Ave TA1305000003 DuSable Lake Shore Dr & North Blvd 2
2023-10-10 13:22:51 2023-10-10 13:29:37 Loomis St & Lexington St 13332 Morgan St & Polk St 2
2023-10-10 13:29:37 Loomis St & Lexington St 13332 Morgan St & Polk St 2
Source: db/data_complete.db

Although the cause of such duplication errors is unknown, it could be assumed that one person checked out multiple bikes simultaneously. In that scenario, each bike would be assigned a unique ride_id. However, this occurrence was relatively rare, happening only 18 times over the course of a year. Since there is only one duplicate for each instance, it raises concerns and warrants further investigation. It is possible that trips could be grouped where one person pays for another rider’s fare. However, if that were the case, it raises the question of why there is always precisely one duplicate.

In Table 3, duplicate observations are listed and grouped by color for visual clarity. In contrast, Table 4 presents the data after removing the extra copy of each duplicate observation while preserving the unique observations. Of the duplicates identified, each had one extra copy. It was noted that the number of rows in the duplicates table is 36. Each duplicated observation has one duplicate, where n (the count) is always 2. Therefore, the expected number of observations to be removed was 18. A complication arose in determining how to remove not all observations but only the extra duplicate observation from each group.

Code
gt_undupes <- undupedTable |>
dplyr::collect() |>
dplyr::group_by(started_at) |>
gt::gt(
rowname_col = "row",
groupname_col = "started_at",
row_group_as_column = TRUE
) |>
gt::fmt_number(decimals = 0) |>
gt::tab_style(
style = list(
gt::cell_text(weight = "bold", align = "center"),
gt::cell_borders(sides = c("bottom"))
),
locations = gt::cells_column_labels(gt::everything())
) |>
gt::tab_style(
style = list(
gt::cell_borders(sides = c("left", "right")),
gt::cell_text(align = "center", v_align = "middle")
),
locations = gt::cells_body(gt::everything())
) |>
gt::data_color(
columns = start_station_name,
target_columns = gt::everything(),
method = "auto",
palette = "basetheme::brutal"
) |>
gt::tab_header(title = "After duplicates were removed", subtitle = "Same grouping") |>
gt::tab_options(
heading.title.font.weight = "bolder",
heading.subtitle.font.weight = "lighter",
heading.align = "center",
table.background.color = "transparent",
table.font.color = "SeaShell",
table.font.size = gt::pct(75)
) |>
gt::tab_source_note(source_note = gt::md("**Source**: `db/data_complete.db`"))

gt_undupes
Table 4: Un-duplicated Table
After duplicates were removed
Same grouping
start_station_name ended_at end_station_name
2023-02-19 12:10:52 Orleans St & Merchandise Mart Plaza 2023-02-19 12:24:04 Green St & Randolph St*
2023-04-15 15:56:18 Kedzie Ave & 45th St 2023-04-15 16:01:54 Fairfield Ave & 44th St
2023-04-21 09:45:26 Michigan Ave & 8th St 2023-04-21 10:01:23 Wentworth Ave & Cermak Rd*
2023-07-08 18:22:07 Milwaukee Ave & Grand Ave 2023-07-08 18:37:31 Bissell St & Armitage Ave*
2023-07-08 18:58:14 Clark St & Schiller St 2023-07-08 19:08:47 Bissell St & Armitage Ave*
2023-07-09 18:00:17 Wells St & Hubbard St 2023-07-09 18:40:49 Fort Dearborn Dr & 31st St*
2023-07-10 20:10:41 Clark St & Newport St 2023-07-10 20:19:59 Lincoln Ave & Roscoe St*
2023-07-15 10:48:09 Racine Ave & Wrightwood Ave 2023-07-15 10:58:22 Bissell St & Armitage Ave*
2023-07-15 19:38:51 Avondale Ave & Irving Park Rd 2023-07-15 19:55:04 Public Rack - Hamlin Ave & Fullerton Ave
2023-07-23 11:41:36 Burnham Harbor 2023-07-23 12:07:13 Fort Dearborn Dr & 31st St*
2023-07-25 18:08:47 Wabash Ave & Roosevelt Rd 2023-07-25 18:21:45 Wentworth Ave & Cermak Rd*
2023-07-26 21:10:55 Burnham Harbor 2023-07-26 21:28:44 Fort Dearborn Dr & 31st St*
2023-08-05 19:40:37 Morgan St & Lake St* 2023-08-05 20:08:35 Morgan St & Lake St*
2023-08-12 17:46:53 Calumet Ave & 18th St 2023-08-12 17:57:40 Wentworth Ave & Cermak Rd*
2023-08-17 12:23:50 DuSable Lake Shore Dr & Monroe St 2023-08-17 12:37:45 Streeter Dr & Grand Ave
2023-09-03 14:55:59 Bissell St & Armitage Ave* 2023-09-03 15:58:21 Bissell St & Armitage Ave*
2023-09-25 17:38:05 Fairbanks Ct & Grand Ave 2023-09-25 17:52:29 DuSable Lake Shore Dr & North Blvd
2023-10-10 13:22:51 Loomis St & Lexington St 2023-10-10 13:29:37 Morgan St & Polk St
Source: db/data_complete.db

To ensure the accurate removal of duplicates, the count of distinct n-values (representing the number of occurrences) for the un-duplicated table was computed, confirming the expected 18 unique instances. Subsequently, the total number of observations in the dataset was recorded, initially standing at 4,331,707. After removing the identified duplicate observations, the correct count of observations was 4,331,689. In summary, 18 additional observations were successfully removed, aligning with the expected number of duplicates identified earlier. These steps are documented in Table 5 for reference.

By carefully analyzing the count of distinct n-values and the total occurrences before and after reduplication, it was ensured that only the precise number of duplicate observations was removed, preserving the integrity of the unique data while eliminating the identified duplicates. This meticulous approach to data cleaning is crucial for maintaining data quality and reliability throughout the analysis process.

Code used for handling duplicates in this section

2.2 Outliers

Transform and Filter the Database

If you happen to be re-using this code -
this is so you do not have to re-download
or re-filter after making further adjustments.
filtered_path <- "db/filtered_data.db"

# Do we still need to filter the database?
if (duckdb::dbExistsTable(dbconn, filtered_path) == FALSE) {
    source("Scripts/transFilter.R")
    transFilter(conxn = dbconn, oldPath = "db/data_unduped.db", newPath = filtered_path)
}
This would execute if the conditions were met
to filter the db/data.db database table
# ----
# Author: Eric Mossotti
# CC-BY SA
# ----
# Performs data transformations and filters to enforce consistency across
# all of the analyses. It also will simplify query syntax in extended analyses.
# Filtering can be based on flexible, but sensible criteria.
# ----
transFilter <- function(conxn, oldPath, newPath) {
    dplyr::tbl(conxn, oldPath) |>
        dplyr::collect() |>
        dplyr::mutate(
            "trip_time" = lubridate::time_length(lubridate::interval(started_at, ended_at), 
                                                 unit = "minute"),
            miles = geosphere::distGeo(
                p1 = cbind(start_lng, start_lat),
                p2 = cbind(end_lng, end_lat)
            ) / 1000 * 0.62137119,
            mph = (miles / (trip_time / 60)),
            .keep = "all",
        ) |> 
        # Floor rationale -  less than 0.1 miles are distances easily walked
        # Speed ceiling rationale - pro cyclists average around 20 mph
        # Speed floor rationale - accounts for trips possibly spent idling
        # rideable_type rationale - docked_bike stopped being recorded as a distinct
        # category within the time being analyzed (2023)
        # docked_bike was phased out and not much info about what it means
        dplyr::filter(
            trip_time > 1,
            trip_time < 480,
            rideable_type != "docked_bike",
            miles >= 0.1,
            mph <= 20,
            mph >= 1
        ) |>
        duckdb::dbWriteTable(conn = conxn,
                             name = newPath,
                             overwrite = TRUE)
}

Observations deemed erroneous or irrelevant for identifying usage trends among members and casual users were filtered out. Keeping track of these errors is a good practice, as they might provide insights into the differences in how members and casuals utilize the service.

Trips with negative duration were flagged as errors and removed. Additionally, trips lasting less than a minute but greater than zero were noted and removed, as they could potentially skew the derived statistics. These extremely short trips might be attributed to users briefly trying out the service before committing or quickly realizing their dissatisfaction with it. While some observations seemed nonsensical, most of the data was retained.

Consistent with the previous approach, an if-else decision was employed to facilitate testing. An external database filtering script was utilized to streamline the code within the main Quarto document. The resulting filtered data served as the foundation for subsequent analysis and table generation.

To get a count of the new total observations after filtering.
count_filtered <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(ride_id) |>
    dplyr::distinct() |>
    dplyr::count() |>
    dplyr::collect() |>
    as.integer()
Code
# To see the history of obs in our dataset.
summaryProcessTable <- tidyr::tribble(
~ " ",
~ "  ",
"Original   ",
original_nobs,
"Complete Observations   ",
count_incorrectDists,
"Duplicates   ",
(count_incorrectDists - count_correctDists),
"Filtered     ",
(count_correctDists - count_filtered),
"Total Corrected   ",
count_filtered
) |>
gt::gt(rownames_to_stub = FALSE) |>
gt::tab_header(title = "Documenting transformation results", 
subtitle = gt::md("Noting the change<br>in overall observations")) |>
gt::cols_label("  " = "n") |>
gt::tab_footnote(
footnote = gt::md("Row counts throughout the cleaning steps."),
locations = gt::cells_column_labels(columns = "  ")
) |>
gt::tab_style(
style = list(
gt::cell_borders(sides = "bottom"),
gt::cell_text(
align = "left",
stretch = "semi-expanded"
)
),
locations = gt::cells_body(gt::everything())
) |>
gt::tab_style(
gt::cell_text(
align = "center",
stretch = "semi-expanded"),
locations = list(
gt::cells_title(groups = c("title", "subtitle")),
gt::cells_column_labels(gt::everything())
)
) |>
gt::fmt_number(decimals = 0) |>
gt::tab_options(
column_labels.font.weight = "bold",
table.background.color = "transparent",
table.font.color = "SeaShell",
row.striping.background_color = "gray10",
row.striping.include_table_body = TRUE
) |>
gt::tab_source_note(source_note = gt::md("**Sources**: `db/data_original.db`, `db/data_complete.db`,<br>`db/data_unduped.db`, `db/data_filtered.db`"))

summaryProcessTable
Table 5: Documenting observation count history.
Documenting transformation results
Noting the change
in overall observations
n1
Original 5,719,877
Complete Observations 4,331,707
Duplicates 18
Filtered 434,291
Total Corrected 3,897,398
Sources: db/data_original.db, db/data_complete.db,
db/data_unduped.db, db/data_filtered.db
1 Row counts throughout the cleaning steps.

Outlier and additional transformations

3 Exploratory Analysis

EDA Scripts

Reduces duplicate code and improves conistency in table formatting across this report.
# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# Reduces duplicate code for generating publication ready tables.
# ----
tabler <- function(tbl_name,
                   source_note = NULL,
                   title = NULL,
                   subtitle = NULL,
                   groupName = NULL,
                   by = NULL,
                   label = NULL,
                   label_n = "n",
                   label_member = "member_casual",
                   location = NULL,
                   footnote = NULL,
                   footer_column = NULL,
                   value_columns = gt::everything(),
                   decimals = 0,
                   font_color = "seashell",
                   bg_color = "transparent",
                   vline_color = "gray20",
                   hline_color = "gray20",
                   hide_column_labels = FALSE,
                   isSummary = FALSE,
                   isBinary = FALSE,
                   note_list = NULL,
                   location_list = NULL,
                   noteColumns = FALSE,
                   noteRows = FALSE,
                   chi_result = NULL,
                   chiVar = NULL,
                   chiBy = NULL,
                   stub_label = NULL,
                   stub_note = NULL,
                   stubName = NULL,
                   isStub = FALSE,
                   row_index = NULL,
                   label_parameter = NULL,
                   align_parameter = "center"
                   ) {
    # Need to account for grouped tables and non-grouped tables
    # Grouped ----
    if (!is.null(groupName)) {
        tbl <- gt::gt(tbl_name,
                      groupname_col = groupName,
                      row_group_as_column = TRUE) |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_stubhead(label = stub_label) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table.margin.left = 0,
                table.margin.right = 0,
                table_body.vlines.color = "transparent",
                table_body.hlines.color = hline_color,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
                column_labels.hidden = hide_column_labels
            ) |>
            gt::tab_source_note(source_note = {{ source_note }})
    }
    
    # Chi Square ----
    if (isSummary == TRUE) {
        tbl <- tbl_name |>
            gtsummary::tbl_summary(by = {{ by }}, label = label) |>
            gtsummary::modify_footnote(
                gtsummary::all_stat_cols() ~ paste0(
                    "n (%);   χ²  = ",
                    round(chi_result$statistic, 2),
                    ";   df = ",
                    chi_result$parameter
                )
            ) |>
            gtsummary::add_p() |>
            gtsummary::as_gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = "bottom"),
                    gt::cell_text(
                        align = "left",
                        stretch = "semi-expanded",
                        whitespace = "break-spaces"
                    )
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                gt::cell_text(align = "center", stretch = "semi-expanded"),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle")),
                    gt::cells_column_labels(gt::everything())
                )
            ) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table_body.vlines.color = vline_color,
                table_body.hlines.color = hline_color,
                column_labels.hidden = hide_column_labels,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    
    # Binary Logistic ----
    if (isBinary == TRUE) {
        tbl <- tbl_name |>
            gtsummary::as_gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = "bottom"),
                    gt::cell_text(align = "left", stretch = "semi-expanded")
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                gt::cell_text(align = "center", stretch = "semi-expanded"),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle")),
                    gt::cells_column_labels(gt::everything())
                )
            ) |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table_body.vlines.color = vline_color,
                table_body.hlines.color = hline_color,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.hidden = hide_column_labels,
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    # non - grouped, non - binary, non - summary ----
    if (is.null(groupName) &&
        isFALSE(isBinary) &&
        isFALSE(isSummary)) {
        tbl <- tbl_name |>
            gt::gt() |>
            gt::tab_header(title = title, subtitle = subtitle) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(sides = c("right", "left"), color = vline_color),
                    gt::cell_text(stretch = "semi-expanded")
                ),
                locations = gt::cells_body(gt::everything())
            ) |>
            gt::tab_style(
                style = list(
                    gt::cell_borders(
                        sides = c("right", "left", "top"),
                        color = vline_color
                    ),
                    gt::cell_text(align = "center", stretch = "semi-expanded")
                ),
                locations = list(
                    gt::cells_title(groups = c("title", "subtitle"))
                )
            ) |>
            gt::tab_style(
                style = list(
                    gt::cell_text(align = align_parameter, stretch = "semi-expanded")
                ),
                locations = list(gt::cells_column_labels(columns = label_parameter))
            )
        # Need to avoid this part for non-frequency tables?
        if (!is.null(label_n)) {
            tbl <- tbl |> gt::tab_style(
                style = list(
                    gt::cell_text(align = "right", stretch = "semi-expanded")
                ),
                locations = list(gt::cells_column_labels(columns = label_n))
            )
        }
        tbl <- tbl |>
            gt::tab_options(
                table.background.color = bg_color,
                table.font.color = font_color,
                table.margin.left = 0,
                table.margin.right = 0,
                table_body.vlines.color = "transparent",
                table_body.hlines.color = "transparent",
                column_labels.vlines.color = vline_color,
                column_labels.hidden = hide_column_labels,
                heading.border.lr.color = "transparent",
                container.padding.x = 0,
                container.padding.y = 0,
                row.striping.include_table_body = TRUE,
                row.striping.background_color = "gray10",
                column_labels.font.weight = "bold",
            ) |>
            gt::tab_source_note(source_note = source_note)
    }
    # To add footnotes to the various table types ----
    if (!is.null(note_list) && !is.null(location_list)) {
        tbl <- tbl |>
            add_multiple_footnotes(note_list,
                                   location_list,
                                   noteColumns,
                                   noteRows,
                                   row_index)
    }
    
    if (isTRUE(isStub)) {
        tbl <- tbl |>
            gt::tab_stubhead(label = stub_label) |>
            gt::cols_align(columns = groupName, align = "center") |>
            gt::tab_footnote(
                footnote = stub_note,
                location = gt::cells_stubhead(),
                placement = "auto"
            )
    }
    return(tbl)
}


# Footnotes function ----
add_multiple_footnotes <- function(tbl,
                                   note_list,
                                   location_list,
                                   noteColumns,
                                   noteRows,
                                   row_index) {
    if (length(note_list) != length(location_list)) {
        stop("The lengths of note_list and location_list must be equal.")
    }
    
    if (isTRUE(noteColumns)) {
        for (i in seq_along(note_list)) {
            tbl <- tbl |>
                gt::tab_footnote(
                    footnote = note_list[[i]],
                    location = gt::cells_column_labels(columns = location_list[[i]]),
                    placement = "left"
                )
        }
    }
    
    if (isTRUE(noteRows)) {
        for (i in seq_along(note_list)) {
            tbl <- tbl |>
                gt::tab_footnote(
                    footnote = note_list[[i]],
                    location = gt::cells_body(columns = location_list[[i]], rows = row_index),
                    placement = "auto"
                )
        }
    }
    
    return(tbl)
}
Provides a useful script for consistent formatting of many of the plots.
#  ----
# Author: Eric Mossotti
# CC BY-SA
#  ----
# To help reduce duplicate code and implement a consistent theme throughout a
# markdown project. This doesn't account for all plot types possible yet.
# ----
plotter <- function(data,
                    x_col,
                    y_col,
                    group_col = NULL,
                    color_col = NULL,
                    title = NULL,
                    subtitle = NULL,
                    x_label = NULL,
                    y_label = NULL,
                    geomType = NULL,
                    lineGroup_palette = "Paired",
                    colGroup_palette = "Paired",
                    line_palette = "YlOrRd",
                    col_palette = "YlOrRd",
                    facetCol_palette = "YlGnBu",
                    fill_col = "YlOrRd",
                    axis_ticks_color = "lavenderblush",
                    bg_color = "#222222",
                    text_color = "seashell",
                    grid_color = "grey30",
                    x_lim = c(NA, NA),
                    isFaceted = FALSE,
                    is_lineGroup = FALSE,
                    is_colGroup = FALSE,
                    isHistogram = FALSE,
                    isDensity = FALSE,
                    breaks = NULL,
                    limits = NULL,
                    angle = ggplot2::waiver(),
                    n.dodge = 1,
                    binwidth = NULL,
                    areaFill = NULL,
                    density_fill = ggplot2::waiver(),
                    density_color = ggplot2::waiver(),
                    alpha = ggplot2::waiver(),
                    isROC = FALSE,
                    roc_color = NULL,
                    vline_color = NULL,
                    vline_size = NULL,
                    density_alpha = ggplot2::waiver(),
                    dnorm_color = ggplot2::wavier(),
                    bins = NULL,
                    histo_breaks = NULL,
                    low = NULL,
                    high = NULL,
                    isTime = FALSE,
                    date_breaks = ggplot2::waiver(),
                    date_labels = ggplot2::waiver(),
                    date_minor_breaks = ggplot2::waiver(),
                    lineColor = NULL,
                    colPosition = "stack",
                    labels = ggplot2::waiver(),
                    isTimeHist = FALSE,
                    quartiles = NULL,
                    qformat = NULL) {
    # line ----
    if (geomType == "line") {
        if (is_lineGroup == TRUE) {
            plot <- data |> ggplot2::ggplot(mapping = ggplot2::aes(
                x = {{ x_col }},
                y = {{ y_col }},
                color = {{ color_col }}
            )) +
                ggplot2::geom_line(show.legend = TRUE) +
                ggplot2::scale_color_brewer(palette = lineGroup_palette, name = "")
        } else {
            plot <- data |> ggplot2::ggplot(mapping = ggplot2::aes(x = {{ x_col }}, y = {{ y_col }})) +
                ggplot2::geom_line(show.legend = TRUE, color = lineColor)
        }
    }
    # col ----
    else if (geomType == "column") {
        # non-grouped, non-faceted, non-density, non-histogram ----
        if (isFALSE(is_colGroup) &&
            isFALSE(isDensity) &&
            isFALSE(isDensity) &&
            isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ y_col }}
                )) +
                ggplot2::geom_col(show.legend = FALSE) +
                ggplot2::scale_fill_distiller(palette = col_palette)
        }
        # grouped, non - faceted, non - density, non - histogram----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_col(
                    show.legend = TRUE,
                    position = colPosition,
                    color = color_col
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
        # grouped, faceted, non-density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ y_col }}
                )) +
                ggplot2::geom_col(show.legend = FALSE) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_fill_distiller(palette = facetCol_palette)
        }
        # grouped, faceted, density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ..density..,
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_density(alpha = alpha) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = 1, angle = angle)
                )
        }
        # grouped, non-faceted, density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ..density..,
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    alpha = alpha,
                    breaks = histo_breaks
                ) +
                ggplot2::geom_density(alpha = alpha,
                                      color = density_color,
                                      fill = density_fill) +
                ggplot2::stat_function(fun = dnorm,
                                       args = list(mean = mean({{ x_col }}), sd = sd({{ x_col }}))) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = 1, angle = angle)
                )
        }
        # grouped, non-faceted, density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = ggplot2::after_stat(density),
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_density(alpha = density_alpha, color = color_col) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
        # non-grouped, non-faceted, density, non-histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(
                    ggplot2::aes(
                        x = {{ x_col }},
                        y = ..density..,
                        fill = density_fill,
                        color = density_color
                    )
                ) +
                ggplot2::geom_density(show.legend = FALSE) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, non-faceted, non-density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    alpha = alpha,
                    breaks = histo_breaks
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, non-density, non-histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(
                    x = {{ x_col }},
                    y = {{ y_col }},
                    fill = {{ group_col }}
                )) +
                ggplot2::geom_col(show.legend = TRUE, position = "dodge") +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "") +
                ggplot2::scale_x_discrete(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, non-density, histogram ----
        else if (!isFALSE(is_colGroup) &&
                 !isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    bins = bins,
                    breaks = breaks
                ) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # non-grouped, non-faceted, non-density, histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }})) +
                ggplot2::geom_histogram(
                    show.legend = FALSE,
                    color = color_col,
                    binwidth = binwidth,
                    ggplot2::aes(fill = ..count..)
                )
            if (!isFALSE(isTimeHist)) {
                plot <- plot +
                    ggplot2::scale_x_datetime(
                        date_breaks = date_breaks,
                        date_labels = date_labels,
                        date_minor_breaks = date_minor_breaks,
                        guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle),
                        sec.axis = ggplot2::sec_axis(
                            ~ .,
                            breaks = quartiles,
                            labels = scales::date_format(qformat),
                            guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                        )
                    )
            } else {
                plot <- plot +
                    ggplot2::scale_x_continuous(
                        breaks = breaks,
                        limits = limits,
                        guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle),
                        sec.axis = ggplot2::sec_axis(
                            ~ .,
                            breaks = quartiles,
                            labels = scales::label_number(),
                            guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                        )
                    )
            }
            plot <- plot +
                ggplot2::scale_y_continuous() +
                ggplot2::scale_fill_gradient(low = low, high = high) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{x_col}})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::geom_vline(
                    data = data.frame(q = quartiles),
                    ggplot2::aes(xintercept = q, color = factor(q)),
                    linetype = "dashed",
                    size = 1
                ) +
                ggplot2::scale_color_manual(
                    values = c("orange", "green", "purple"),
                    labels = c("25th", "50th", "75th")
                )
        }
        # non-grouped, non-faceted, density, histogram ----
        else if (isFALSE(is_colGroup) &&
                 isFALSE(isFaceted) &&
                 !isFALSE(isDensity) &&
                 !isFALSE(isHistogram)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }}, y = ..density..)) +
                ggplot2::geom_histogram(color = color_col,
                                        bins = bins,
                                        breaks = breaks) +
                ggplot2::geom_density(color = density_color,
                                      fill = density_fill,
                                      alpha = density_alpha) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{ x_col }})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # grouped, faceted, density, histogram ----
        else {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }},
                                             fill = {{ group_col }})) +
                ggplot2::geom_histogram(
                    binwidth = binwidth,
                    color = color_col,
                    bins = bins,
                    breaks = histo_breaks
                ) +
                ggplot2::geom_density(color = density_color,
                                      fill = density_fill,
                                      alpha = density_alpha) +
                ggplot2::geom_vline(
                    ggplot2::aes(xintercept = mean({{ x_col }})),
                    color = vline_color,
                    size = vline_size,
                    linetype = "solid"
                ) +
                ggplot2::facet_grid(rows = "member_casual") +
                ggplot2::scale_x_continuous(
                    breaks = breaks,
                    limits = limits,
                    guide = ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                )
        }
        # other misc plot types ----
    } else {
        # pROC objects ----
        if (!isFALSE(isROC)) {
            plot <- data |>
                pROC::ggroc(aes = "linetype", color = roc_color)
        }
        # time-series data ----
        # grouped, non-faceted, density, non-histogram
        if (!isFALSE(isTime)) {
            plot <- data |>
                ggplot2::ggplot(ggplot2::aes(x = {{ x_col }},
                                             fill = {{ group_col }})) +
                ggplot2::geom_density(alpha = density_alpha, color = color_col) +
                ggplot2::scale_x_datetime(
                    date_breaks = date_breaks,
                    date_labels = date_labels,
                    date_minor_breaks = date_minor_breaks,
                    labels = labels,
                    guide =  ggplot2::guide_axis(n.dodge = n.dodge, angle = angle)
                ) +
                ggplot2::scale_fill_brewer(palette = colGroup_palette, name = "")
        }
    }
    # For the rest of the otherwise likely duplicated plot settings ----
    plot <- plot +
        ggplot2::labs(title = title, subtitle = subtitle, x = x_label, y = y_label) +
        ggplot2::theme(
            panel.background = ggplot2::element_rect(fill = bg_color, color = NA),
            plot.background = ggplot2::element_rect(fill = bg_color, color = NA),
            text = ggplot2::element_text(color = text_color),
            panel.grid = ggplot2::element_blank(),
            axis.title.x = ggplot2::element_text(margin = grid::unit(c(5, 5, 5, 5), "mm")),
            axis.title.y = ggplot2::element_text(margin = grid::unit(c(5, 5, 5, 5), "mm")),
            axis.text.x = ggplot2::element_text(color = "Snow", margin = grid::unit(c(1, 1, 1, 1), "mm")),
            axis.text.y = ggplot2::element_text(color = "Snow", margin = grid::unit(c(2, 2, 2, 2), "mm")),
            axis.ticks = ggplot2::element_line(color = axis_ticks_color),
            axis.ticks.y = ggplot2::element_blank(),
            panel.grid.major.x = ggplot2::element_line(color = grid_color, linetype = "dotted"),
            panel.grid.major.y = ggplot2::element_line(color = grid_color),
            legend.background = ggplot2::element_rect(fill = bg_color),
            legend.title = ggplot2::element_blank()
        )
}
Performs many of the more backend query transformations I needed for this document.
# Author: Eric Mossotti
# CC BY-SA
# ----
# Programmatic db extraction and df transformation, utilizing tidy evaluation.
#
# Allows one to specify which columns to select, which columns to group by for 
# counting, and which column (if any) to transform. It be used for a variety of 
# similar data transformation tasks.
# 
# Results can be weighted to help certain modeling functions run faster while 
# not altering the results one would expect from using non-aggregated datasets 
# as input.
# ---- 
transformData <- function(conn,
                          path,
                          df,
                          select_cols,
                          group_cols = NULL,
                          binary_col = NULL,
                          ntile_col = NULL,
                          pred_col = NULL,
                          zero_val = NULL,
                          one_val = NULL,
                          qtile_levels = NULL,
                          doWeights = FALSE,
                          doQuantile = FALSE,
                          isDF = FALSE) {
    # Weight the data? ----
    if (isTRUE(doWeights)) {
        # Is the data from a df or db? ----
        if (isTRUE(isDF)) {
            freq_data <- df |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::add_count(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::distinct() |>
                dplyr::arrange(dplyr::across(dplyr::all_of(group_cols)))
        } else {
            freq_data <- dplyr::tbl(conn, path) |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::add_count(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::distinct() |>
                dplyr::arrange(dplyr::across(dplyr::all_of(group_cols))) |>
                dplyr::collect()
        }
    } else {
        # Is the data from a df or db? ----
        if (isTRUE(isDF)) {
            freq_data <- df |>
                dplyr::select(dplyr::all_of(!!select_cols)) |>
                dplyr::arrange(dplyr::across(dplyr::all_of(!!select_cols)))
        } else {
            freq_data <- dplyr::tbl(conn, path) |>
                dplyr::select(dplyr::all_of(select_cols)) |>
                dplyr::arrange(dplyr::across(dplyr::all_of(select_cols))) |>
                dplyr::collect()
        }
    }
    # Do we want to transform a column to binary for modeling?
    if (!is.null(binary_col) &&
        binary_col %in% names(freq_data)) {
        freq_data <- freq_data |>
            dplyr::mutate(
                # Tidy-selects data with ':=' and bang-bang operators
                !!binary_col := factor(.data[[binary_col]],
                                       levels = c(zero_val, one_val))
            )
    }
    # Do we want to predict anything?
    if (isTRUE(doQuantile)) {
        freq_data <- freq_data |>
            dplyr::mutate(
                !!ntile_col := dplyr::ntile(.data[[pred_col]], n = 4),
                !!ntile_col := factor(.data[[ntile_col]],
                                      levels = c(1, 2, 3, 4),
                                      labels = as.vector(qtile_levels))
            )
    }
    
    return(freq_data)
}
Helpfully, this contributes to the Chi-Squared tables by allowing me to add the chi-squared statistic and degrees of freedom information in the footnotes.
# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# The code for returning chi-square test results as a tibble for use in tables.
# ----
chisqTest <- function(data, variable, by) {
    test_result <- chisq.test(x = data[[variable]], y = data[[by]]) |>
        broom::tidy() |>
        dplyr::select(statistic, parameter, p.value)
    
    return(test_result)
}

Scripts used frequently in this section

3.1 Methods

This section explains certain methodologies that might be less obvious to the reader.

3.1.1 Chi-Square

Set up

Null Hypothesis (H0) There is no association between membership status and travel behavior.

Alternative Hypothesis (H1) There is an association between membership status and travel behavior.

Significance level, \((\alpha) = 0.05\)

Collect data

Tables are transformed and stored for later use. Where needed, the appropriate table is extracted from the database for analysis.

Expected frequencies

\(\text Expected_{for\ each\ cell} = (\text Row\ Total \times Column\ Total) / \text Grand\ Total\)

Chi-square statistic

\(\chi^2=\sum[(O-E)^2/E]\)

\(O\) - Observed frequency

\(E\) - Expected frequency

\(\sum\) - Sum over all cells in the table

Degrees of freedom

\(\text df = (rows - 1) \times (columns - 1)\)

The critical value

The critical value can be determined from the chi-square distribution table or software using \(α\) and the df.

The p-value

The p-value is calculated from the chi-square statistic and the df. When calculating the p-value, statistical software or functions (like the ones later in this report) internally use this integral:

\(\text p-value = \int[\chi^2\ \ to\ \infty]\ f(x,\ df)dx\)

\(\chi^2\) is the calculated chi-square statistic (the lower bound of integration)

\(\infty\) is the upper bound of integration

\(f(x,\ df)\) is the chi-square PDF (probability density function)

\(x\) is the variable of integration

Interpreting results

Our calculated chi-square statistic is compared to the critical value (greater or less than). We then would be checking for consistency by comparing the p-value to the significance level value. Then, we either reject, where \(p < \alpha\) , or fail to reject, where \(p>\alpha\) , the null hypothesis. This informs us if there is a significant association between the membership status and the given travel behavior parameter.

Calculation Process

The software used to generate the Chi-Square tables in this report makes it so a human doesn’t need to work with the integrals directly. The software calculates the chi-square statistic and df from the data and uses that information as input to the chi-square calculator, which handles the integration to return the p-value.

3.1.2 Binary Logistic Regression

Model Fitting

The logistic regression model is fitted using Maximum Likelihood Estimation (MLE) through an Iteratively Reweighted Least Squares (IRLS) algorithm. This process determines the optimal coefficients \((\beta)\) for the predictors.

Odds Ratio Calculation

For each predictor variable, the odds ratio is calculated as \(exp(β)\), where \(\beta\) is the coefficient estimated by the model. This represents the change in odds of the outcome for a one-unit increase in the predictor.

Statistical Significance

p-values are computed for each predictor using the Wald test. This test calculates a z-statistic, \((\beta/SE(\beta))\), and compares it to a standard normal distribution to determine the probability of observing such a result under the null hypothesis.

Results Compiled As

  • Predictor variables (Characteristics)
  • Odds Ratios (OR)
  • p-values
  • Sometimes confidence intervals for the ORs (not shown in this example)

Reference Category

For categorical predictors, one category is set as the reference (marked with a dash in the OR column). This dashed-out value is very close to 1.00. This process transforms the raw model output into an interpretable format, allowing readers to quickly assess the direction, magnitude, and significance of each predictor’s effect on the outcome variable.

3.2 Membership

Database Operations

Table Preview

Database Code

Write to .db
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/membership.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(member_casual) |>
        dplyr::arrange(member_casual) |>
        dplyr::collect() |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/membership.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/membership.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 6: Kable output
member_casual
casual
casual
casual
casual
casual
casual
Code
transformData(conn = dbconn, path = "db/membership.db", select_cols = "member_casual",
    group_cols = "member_casual", doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("membership status", "occurrences"), location_list = c("member_casual",
        "n"), source_note = gt::md("**Source**: `db/membership.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)
Table 7: Tabularizing membership distribution across all observations
casual1 2 1260621
member 2636777
Source: db/membership.db
1 membership status
2 occurrences

Table 7 and Figure 2 give an idea of the overall member to casual travel distribution.

Code
gplot <- transformData(conn = dbconn, path = "db/membership.db", select_cols = "member_casual",
    group_cols = "member_casual", doWeights = TRUE) |>
    plotter(x_col = member_casual, y_col = n, geomType = "column", title = "How often do members travel?",
        x_label = "Rider Groups", y_label = "n")

gplot
Figure 2: Plotting the overall membership distribution

3.3 Cycle Types

Database Operations

Table Preview

DB Operations

Write bType.db to the database.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(rideable_type, member_casual) |>
        dplyr::arrange(rideable_type, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(rideable_type = forcats::as_factor(rideable_type)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/bType.db", overwrite = TRUE)
}
Transform and write as a weighted binary table for modeling.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/bType_wb.db"))) {
    transformData(conn = dbconn, path = "db/bType.db", select_cols = c("rideable_type",
        "member_casual"), group_cols = c("rideable_type", "member_casual"), binary_col = "member_casual",
        zero_val = "casual", one_val = "member", doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/bType_wb.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/bType.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/bType_wb.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 8: Kable output
rideable_type member_casual
classic_bike casual
classic_bike casual
classic_bike casual
classic_bike casual
classic_bike casual
classic_bike casual
Table 9: Kable output
rideable_type member_casual n
classic_bike casual 739475
classic_bike member 1714250
electric_bike casual 521146
electric_bike member 922527
Code
transformData(conn = dbconn, path = "db/bType.db", select_cols = "rideable_type",
    group_cols = "rideable_type", doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("bicycle type", "occurrences"), location_list = list("rideable_type",
        "n"), source_note = gt::md("**Source**: `db/bType.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)
Table 10: The overall distribution of bicycle choice
classic_bike1 2 2453725
electric_bike 1443673
Source: db/bType.db
1 bicycle type
2 occurrences
Code
transformData(
conn = dbconn, 
path = "db/bType.db", 
select_cols = c("rideable_type", "member_casual"),
group_cols = c("rideable_type", "member_casual"),
doWeights = TRUE
) |>
tabler(
title = NULL, 
groupName = "rideable_type", 
note_list =  list("membership status", "occurrences"),
location_list = list("member_casual", "n"),
source_note = gt::md("**Source**: `db/bType.db`"),
noteRows = TRUE,
#hide_column_labels = TRUE,
row_index = 1,
isStub = TRUE,
stub_label = "Type",
stub_note = "bicycle type"
) |>       
gt::cols_label(member_casual = " ", n = " ")
Table 11: The membership distribution of bicycle choice
Type1
classic_bike casual2 3 739475
member 1714250
electric_bike casual 521146
member 922527
Source: db/bType.db
1 bicycle type
2 membership status
3 occurrences

Table 10 and Figure 3 present the overall frequency distribution for choice of bicycle.

Code
gplot <- transformData(conn = dbconn, path = "db/bType.db", select_cols = "rideable_type",
    group_cols = "rideable_type", doWeights = TRUE) |>
    plotter(x_col = rideable_type, y_col = n, geomType = "column", title = paste0("Which bicycles are preferred?"),
        x_label = "Type", y_label = "n")

gplot
Figure 3: The overall distribution of trips taken by bicycle choice

Likewise, Table 11 and Figure 4 summarize the choice of bicycle with the addtion of a grouped sum by the membership status.

Code
gplot <- transformData(conn = dbconn, path = "db/bType.db", select_cols = c("rideable_type",
    "member_casual"), group_cols = c("rideable_type", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which bicycles are members choosing?"), x_label = "Type",
        y_label = "n", x_col = rideable_type, y_col = n, group_col = member_casual,
        geomType = "column", is_colGroup = TRUE, color_col = "black", colPosition = "dodge",
        colGroup_palette = "Paired")

gplot
Figure 4: The membership distribution of bicycle choice

Table 12 presents a Chi-Square analysis of bicycle type usage among casual users and members. In summary, members show a higher preference for classic bikes (65%) compared to casual users (59%). Casual users have a higher proportion of electric bike usage (41%) compared to members (35%).

There is a statistically significant association between bicycle type and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between bicycle type and membership status. This suggests that the choice of bicycle type is not independent of membership status.

The large \(\chi^2\) value (14762.37) with just 1 degree of freedom (calculated as [rows - 1] * [columns - 1])** results in the very small p-value (< 0.001). This combination strongly suggests that the difference in bicycle preference between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble <- dplyr::tbl(dbconn, "db/bType.db") |>
    dplyr::select(rideable_type, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "rideable_type", by = "member_casual")
Code
chi_table <- tabler(title = gt::md("Chi-Square:<br>The signficance of bicycle choice and membership"),
    source_note = gt::md("**Source**: `db/bType.db`"), label = list(rideable_type = "Bicycle Type"),
    by = member_casual, isSummary = TRUE, chiVar = "rideable_type", chiBy = "member_casual",
    tbl_name = data_tibble, chi_result = chiResult)

chi_table
Table 12: Testing indepdence of bicycle choice and membership variables
Chi-Square:
The signficance of bicycle choice and membership
Characteristic casual
N = 1,260,621
1
member
N = 2,636,777
1
p-value2
Bicycle Type

<0.001
    classic_bike 739,475 (59%) 1,714,250 (65%)
    electric_bike 521,146 (41%) 922,527 (35%)
Source: db/bType.db
1 n (%); χ² = 14762.37; df = 1
2 Pearson’s Chi-squared test

Table 13 presents the results of a binary logistic regression analyzing the relationship between bicycle type and membership status. The analysis compares classic bikes and electric bikes, with classic bikes serving as the reference category.

The odds of membership for users of electric bikes were 0.76 times the odds for users of classic bikes. The difference in membership likelihood between electric and classic bike users is highly statistically significant (p < 0.001).

Predicting the log-odds of being a member versus being a casual user.
model <- dplyr::tbl(dbconn, "db/bType_wb.db") |>
    glm(formula = member_casual ~ rideable_type, weights = n, family = binomial)
Code
regression_tbl <- model |>
    gtsummary::tbl_regression(label = list(rideable_type = "Bicycle Choice"), conf.int = FALSE,
        exponentiate = TRUE)

regression_tbl |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' bicycle choices"),
        source_note = gt::md("**Source**: `db/bType_wb.db`"), isBinary = TRUE)
Table 13: Modeling the probability of member travel by the choice of bicycle.
Binary Logistic Regression:
Modeling the likelihood of members’ bicycle choices
Characteristic OR1 p-value
Bicycle Choice

    classic_bike
    electric_bike 0.76 <0.001
Source: db/bType_wb.db
1 OR = Odds Ratio

3.4 Duration

Database Operations

Table Preview

DB Operations

Write to .db.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(trip_time, member_casual) |>
        dplyr::arrange(trip_time, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(trip_time = round(trip_time, digits = 2), mins = round(trip_time,
            digits = 0), mins = forcats::as_factor(mins)) |>
        dplyr::arrange(trip_time, member_casual) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/duration.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/duration_wq.db"))) {
    transformData(conn = dbconn, path = "db/duration.db", select_cols = c("trip_time",
        "member_casual"), group_cols = c("trip_time", "member_casual"), binary_col = "member_casual",
        pred_col = "trip_time", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (1.02 - 5.73]", "Q2 (5.73 - 9.55]", "Q3 (9.55 - 16.13]",
            "Q4 (16.13 - 475.22]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/duration_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/duration_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 14: Kable output
trip_time member_casual mins
1.02 casual 1
1.02 casual 1
1.02 casual 1
1.02 casual 1
1.02 member 1
1.02 member 1
Table 15: Kable output
trip_time member_casual n quartile
1.02 casual 4 Q1 (1.02 - 5.73]
1.02 member 153 Q1 (1.02 - 5.73]
1.03 casual 7 Q1 (1.02 - 5.73]
1.03 member 149 Q1 (1.02 - 5.73]
1.05 casual 8 Q1 (1.02 - 5.73]
1.05 member 182 Q1 (1.02 - 5.73]
Code
transformData(conn = dbconn, path = "db/duration.db", select_cols = "mins", group_cols = "mins",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("trip duration", "occurrences"), location_list = list("mins",
        "n"), source_note = gt::md("**Source**: `db/duration.db`"), noteRows = TRUE,
        hide_column_labels = TRUE, row_index = 1)
Table 16: The overall distribution of trip durations
11 2 11712
2 109700
3 212213
4 282122
5 288960
6 291295
7 264824
8 252680
9 221983
10 206307
11 176905
12 162289
13 139963
14 128094
15 109254
16 100585
17 85815
18 79498
19 69038
20 63928
21 55184
22 51877
23 45435
24 42596
25 37041
26 34652
27 30606
28 28605
29 25154
30 23712
31 21143
32 19696
33 17808
34 16396
35 14606
36 13646
37 12219
38 11851
39 10433
40 9795
41 8683
42 8167
43 7069
44 6685
45 5842
46 5360
47 4589
48 4425
49 3936
50 3726
51 3274
52 3090
53 2859
54 2790
55 2458
56 2351
57 2136
58 2068
59 1897
60 1876
61 1626
62 1578
63 1417
64 1455
65 1256
66 1236
67 1115
68 1167
69 984
70 961
71 908
72 872
73 829
74 835
75 755
76 745
77 657
78 689
79 622
80 632
81 585
82 566
83 500
84 479
85 443
86 478
87 432
88 382
89 416
90 388
91 337
92 354
93 301
94 309
95 260
96 273
97 238
98 266
99 229
100 217
101 241
102 245
103 203
104 215
105 190
106 190
107 169
108 167
109 154
110 148
111 161
112 161
113 160
114 129
115 125
116 147
117 153
118 120
119 106
120 96
121 116
122 85
123 104
124 86
125 99
126 119
127 87
128 84
129 94
130 74
131 62
132 90
133 77
134 64
135 66
136 67
137 84
138 72
139 62
140 60
141 53
142 36
143 46
144 37
145 55
146 44
147 65
148 44
149 54
150 39
151 47
152 43
153 45
154 37
155 36
156 42
157 29
158 33
159 32
160 33
161 27
162 37
163 28
164 26
165 35
166 31
167 30
168 38
169 12
170 26
171 19
172 25
173 20
174 20
175 30
176 21
177 11
178 18
179 13
180 19
181 17
182 13
183 12
184 11
185 17
186 9
187 9
188 10
189 13
190 13
191 5
192 14
193 9
194 11
195 7
196 5
197 3
198 8
199 6
200 8
201 3
202 7
203 8
204 5
205 3
206 4
207 7
208 9
209 4
210 1
211 4
212 5
213 3
214 3
215 1
216 5
217 4
218 1
219 7
220 2
221 1
222 3
223 1
224 2
225 5
226 3
227 6
228 2
229 2
230 2
231 1
232 2
233 2
234 3
235 2
236 3
238 3
239 4
240 1
241 1
242 1
243 2
244 1
246 1
247 1
248 4
249 3
250 3
252 3
254 1
256 4
257 1
259 1
260 2
261 1
262 1
263 1
264 1
265 3
266 7
267 3
268 1
269 3
270 1
271 1
272 1
273 2
274 3
275 1
279 2
281 2
284 1
286 1
287 1
292 1
293 1
294 1
295 1
299 1
301 1
303 1
305 1
312 1
313 1
316 1
320 2
321 1
322 1
325 1
326 1
327 1
333 1
334 1
335 1
336 1
340 4
341 1
346 1
349 1
368 1
369 1
377 1
398 1
406 1
424 1
471 1
475 1
Source: db/duration.db
1 trip duration
2 occurrences
Code
transformData(conn = dbconn, path = "db/duration.db", select_cols = c("mins", "member_casual"),
    group_cols = c("mins", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "mins", label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/duration.db`"),
        isStub = TRUE, stub_label = "Time", stub_note = "trip duration (minutes)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 17: The membership distribution of trip durations
Time1
1 casual2 3 880
member 10832
2 casual 14808
member 94892
3 casual 39423
member 172790
4 casual 62621
member 219501
5 casual 71625
member 217335
6 casual 78289
member 213006
7 casual 76446
member 188378
8 casual 76284
member 176396
9 casual 69783
member 152200
10 casual 66354
member 139953
11 casual 58888
member 118017
12 casual 55308
member 106981
13 casual 48937
member 91026
14 casual 45360
member 82734
15 casual 39607
member 69647
16 casual 36901
member 63684
17 casual 32199
member 53616
18 casual 30249
member 49249
19 casual 27066
member 41972
20 casual 25171
member 38757
21 casual 22183
member 33001
22 casual 21114
member 30763
23 casual 18764
member 26671
24 casual 17793
member 24803
25 casual 15523
member 21518
26 casual 14652
member 20000
27 casual 13050
member 17556
28 casual 12451
member 16154
29 casual 10983
member 14171
30 casual 10291
member 13421
31 casual 9203
member 11940
32 casual 8770
member 10926
33 casual 8081
member 9727
34 casual 7357
member 9039
35 casual 6715
member 7891
36 casual 6414
member 7232
37 casual 5827
member 6392
38 casual 5600
member 6251
39 casual 5002
member 5431
40 casual 4787
member 5008
41 casual 4331
member 4352
42 casual 4117
member 4050
43 casual 3691
member 3378
44 casual 3628
member 3057
45 casual 3229
member 2613
46 casual 3166
member 2194
47 casual 2833
member 1756
48 casual 2795
member 1630
49 casual 2554
member 1382
50 casual 2470
member 1256
51 casual 2299
member 975
52 casual 2170
member 920
53 casual 1971
member 888
54 casual 2040
member 750
55 casual 1764
member 694
56 casual 1742
member 609
57 casual 1561
member 575
58 casual 1556
member 512
59 casual 1492
member 405
60 casual 1468
member 408
61 casual 1285
member 341
62 casual 1263
member 315
63 casual 1115
member 302
64 casual 1115
member 340
65 casual 983
member 273
66 casual 1022
member 214
67 casual 907
member 208
68 casual 954
member 213
69 casual 815
member 169
70 casual 789
member 172
71 casual 758
member 150
72 casual 730
member 142
73 casual 692
member 137
74 casual 692
member 143
75 casual 636
member 119
76 casual 635
member 110
77 casual 556
member 101
78 casual 586
member 103
79 casual 538
member 84
80 casual 531
member 101
81 casual 484
member 101
82 casual 480
member 86
83 casual 431
member 69
84 casual 413
member 66
85 casual 377
member 66
86 casual 407
member 71
87 casual 380
member 52
88 casual 324
member 58
89 casual 361
member 55
90 casual 331
member 57
91 casual 301
member 36
92 casual 302
member 52
93 casual 268
member 33
94 casual 268
member 41
95 casual 232
member 28
96 casual 237
member 36
97 casual 212
member 26
98 casual 238
member 28
99 casual 204
member 25
100 casual 189
member 28
101 casual 213
member 28
102 casual 212
member 33
103 casual 175
member 28
104 casual 195
member 20
105 casual 164
member 26
106 casual 170
member 20
107 casual 150
member 19
108 casual 140
member 27
109 casual 132
member 22
110 casual 129
member 19
111 casual 145
member 16
112 casual 143
member 18
113 casual 143
member 17
114 casual 113
member 16
115 casual 118
member 7
116 casual 132
member 15
117 casual 137
member 16
118 casual 110
member 10
119 casual 92
member 14
120 casual 88
member 8
121 casual 102
member 14
122 casual 76
member 9
123 casual 95
member 9
124 casual 79
member 7
125 casual 86
member 13
126 casual 108
member 11
127 casual 76
member 11
128 casual 78
member 6
129 casual 84
member 10
130 casual 68
member 6
131 casual 54
member 8
132 casual 85
member 5
133 casual 64
member 13
134 casual 56
member 8
135 casual 61
member 5
136 casual 57
member 10
137 casual 74
member 10
138 casual 68
member 4
139 casual 56
member 6
140 casual 56
member 4
141 casual 48
member 5
142 casual 33
member 3
143 casual 42
member 4
144 casual 35
member 2
145 casual 45
member 10
146 casual 41
member 3
147 casual 62
member 3
148 casual 39
member 5
149 casual 52
member 2
150 casual 37
member 2
151 casual 45
member 2
152 casual 41
member 2
153 casual 41
member 4
154 casual 34
member 3
155 casual 34
member 2
156 casual 35
member 7
157 casual 24
member 5
158 casual 27
member 6
159 casual 28
member 4
160 casual 29
member 4
161 casual 22
member 5
162 casual 28
member 9
163 casual 26
member 2
164 casual 23
member 3
165 casual 29
member 6
166 casual 28
member 3
167 casual 25
member 5
168 casual 33
member 5
169 casual 11
member 1
170 casual 25
member 1
171 casual 16
member 3
172 casual 22
member 3
173 casual 19
member 1
174 casual 17
member 3
175 casual 28
member 2
176 casual 19
member 2
177 casual 10
member 1
178 casual 16
member 2
179 casual 13
180 casual 16
member 3
181 casual 13
member 4
182 casual 11
member 2
183 casual 10
member 2
184 casual 8
member 3
185 casual 13
member 4
186 casual 8
member 1
187 casual 7
member 2
188 casual 9
member 1
189 casual 10
member 3
190 casual 10
member 3
191 casual 5
192 casual 13
member 1
193 casual 8
member 1
194 casual 9
member 2
195 casual 6
member 1
196 casual 5
197 casual 3
198 casual 7
member 1
199 casual 5
member 1
200 casual 8
201 casual 3
202 casual 6
member 1
203 casual 8
204 casual 4
member 1
205 casual 3
206 casual 4
207 casual 6
member 1
208 casual 8
member 1
209 casual 4
210 casual 1
211 casual 3
member 1
212 casual 4
member 1
213 casual 3
214 casual 2
member 1
215 casual 1
216 casual 5
217 casual 2
member 2
218 casual 1
219 casual 7
220 casual 2
221 casual 1
222 casual 3
223 casual 1
224 casual 2
225 casual 5
226 casual 3
227 casual 6
228 casual 2
229 casual 1
member 1
230 casual 2
231 casual 1
232 casual 2
233 casual 2
234 casual 2
member 1
235 casual 1
member 1
236 casual 2
member 1
238 casual 1
member 2
239 casual 2
member 2
240 casual 1
241 casual 1
242 casual 1
243 casual 2
244 casual 1
246 member 1
247 casual 1
248 casual 3
member 1
249 casual 3
250 casual 3
252 casual 3
254 casual 1
256 casual 3
member 1
257 member 1
259 casual 1
260 casual 2
261 casual 1
262 member 1
263 casual 1
264 casual 1
265 casual 3
266 casual 5
member 2
267 casual 2
member 1
268 member 1
269 casual 3
270 casual 1
271 member 1
272 casual 1
273 casual 2
274 casual 3
275 casual 1
279 casual 1
member 1
281 casual 2
284 casual 1
286 member 1
287 casual 1
292 casual 1
293 member 1
294 casual 1
295 casual 1
299 casual 1
301 casual 1
303 casual 1
305 casual 1
312 casual 1
313 casual 1
316 casual 1
320 casual 1
member 1
321 casual 1
322 member 1
325 member 1
326 casual 1
327 member 1
333 member 1
334 casual 1
335 casual 1
336 member 1
340 casual 1
member 3
341 member 1
346 casual 1
349 casual 1
368 casual 1
369 casual 1
377 member 1
398 casual 1
406 member 1
424 member 1
471 casual 1
475 casual 1
Source: db/duration.db
1 trip duration (minutes)
2 membership status
3 occurrences

Table 16 and Figure 5 give an idea of the overall travel duration distribution. The data were rounded to the nearest minute for these figures.

Code
gplot <- transformData(conn = dbconn, path = "db/duration.db", select_cols = "mins",
    group_cols = "mins", doWeights = TRUE) |>
    plotter(x_col = as.integer(mins), y_col = n, geomType = "column", title = paste0("How long are the trips?"),
        x_label = "Minutes", y_label = "n", color_col = "black") + ggplot2::scale_x_continuous(limits = c(0,
    60), breaks = seq(0, 60, by = 5), guide = ggplot2::guide_axis(n.dodge = 1, angle = 45))

gplot + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))
Figure 5: The overall distribution of trip durations

Likewise, Table 17 and Figure 6 summarize the travel duration distribution, where frequencies were summed by membership status.

Code
gplot <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(mins, member_casual) |>
    dplyr::filter(as.integer(mins) <= 100) |>
    dplyr::collect() |>
    transformData(conn = NULL, path = NULL, select_cols = c("mins", "member_casual"),
        group_cols = c("mins", "member_casual"), doWeights = TRUE, isDF = TRUE) |>
    plotter(title = paste0("How long do members ride for?"), x_label = "Minutes",
        y_label = "n", x_col = mins, y_col = n, group_col = member_casual, geomType = "column",
        is_colGroup = TRUE, colPosition = ggplot2::position_stack(reverse = TRUE),
        color_col = "black") + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45), limits = forcats::as_factor(seq(0, 60)), breaks = forcats::as_factor(seq(0,
    60, by = 5)))

gplot + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))
Figure 6: The membership distribution of trip durations

Table 18 sheds light on the variability, range, and quartile information about the duration data.

Code
gtTable <- dplyr::tbl(dbconn, "db/duration.db") |>
dplyr::select(mins, member_casual) |>
dplyr::mutate(
mins = as.numeric(mins)
) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = mins ~ "continuous2",
label = list(mins ~ "Duration (mins)"),
digits = list(
mins ~ c(1, 1)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel duration"),
source_note = gt::md("**Source**: `db/duration.db`"),
isBinary = TRUE
)

gtTable
Table 18: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel duration.
Summary Stats:
Member travel duration
Characteristic casual
N = 1,260,621
member
N = 2,636,777
Duration (mins)

    Median (Q1, Q3) 12.0 (7.0, 20.0) 9.0 (5.0, 14.0)
    Mean (SD) 16.5 (15.8) 11.3 (9.2)
    Min, Max 1.0, 475.0 1.0, 424.0
Source: db/duration.db

To accompany the summary table, two plots, Figure 7 and Figure 8, present a highly granular view of the duration dataset (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.

Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(trip_time, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/duration.db") |>
    dplyr::select(trip_time, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$trip_time, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$trip_time, probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel duration"),
        x_label = "Minutes", y_label = "n", x_col = trip_time, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", limits = c(0, 100), breaks = seq(0,
            100, by = 5), binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)), quartiles = quartiles_casual)

gplot
Figure 7: A more detailed look that illustrates the quartile distribution of casual travel duration.
Code
# tried 2.5 for binwidth but not sure if want to keep

gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' travel duration"),
        x_label = "Minutes", y_label = "n", x_col = trip_time, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", limits = c(0, 100), breaks = seq(0,
            100, by = 5), binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)), quartiles = quartiles_member)

gplot
Figure 8: A more detailed look that illustrates the quartile distribution of member travel duration.

Supplementing the histogram, Figure 9 shows a density plot comparing the duration of trips for casual users and members. The x-axis represents time in minutes, limited to a range of 0 to 100 for presentation purposes, while the y-axis shows the density (a measure of relative frequency).

The member group (darker blue) has a higher and narrower peak compared to the casual group (lighter blue). This indicates that members tend to have more concentrated distribution of a session duration around their most common length. The casual group appears to have a slightly fatter tail, extending further to the right than the member group. This suggests that casual users might occasionally have longer sessions than members, even if it’s less common.

Both groups exhibit right-skewed distributions, with a peak near the left side and a long tail extending to the right. This suggests that for both groups, a shorter duration is more common, while a longer duration occurs less frequently but can extend quite far. They seem to have their peak density around 5-10 minutes, with members peaking slightly earlier than casual users. There is significant overlap between the two distributions, indicating that while there are differences, there is also considerable similarity in duration patterns between casual users and members.

Code
gplot <- transformData(conn = dbconn, path = "db/duration.db", select_cols = c("trip_time",
    "member_casual")) |>
    plotter(title = "Visualizing relative frequency\n across travel duration groups",
        x_label = paste0("Minutes"), x_col = trip_time, group_col = member_casual,
        geomType = "column", angle = 45, color_col = "black", density_alpha = 0.75,
        isDensity = TRUE, is_colGroup = TRUE, breaks = seq(0, 100, by = 5), limits = c(0,
            100))

gplot
Figure 9: Visualizing the probability-density distribution between duration and membership variables.

Table 19 presents the results of a binary logistic regression model, analyzing the relationship between ride duration and membership status. The analysis divides ride duration into quartiles, with Q1 (1.02 - 5.73 minutes) serving as the reference category.

Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other duration quartiles ( p < 0.001 for all comparisons). With Q2 (5.73 - 9.55 minutes), the odds of membership were 0.38 times as high as with Q1. This indicates a substantial decrease (62%) in the likelihood of membership for slightly longer rides. With Q3 (9.55 - 16.13 minutes), the odds of membership were 0.08 times as high as in Q1. This shows a dramatic decrease (92%) in membership likelihood for medium-length rides. With Q4 (16.13 - 475.22 minutes), the odds of membership were 0.06 times as high as in Q1. This represents an even more pronounced decrease (94%) in membership likelihood for the longest rides.

Query, process, and create model R object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/duration_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Duration Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel durations"),
        source_note = gt::md("**Source**: `db/duration_wq.db`"), isBinary = TRUE)
Table 19: Modeling the probability of members’ travel by the duration.
Binary Logistic Regression:
Modeling the likelihood of members’ travel durations
Characteristic OR1 p-value
Duration Ranges

    Q1 (1.02 - 5.73]
    Q2 (5.73 - 9.55] 0.38 <0.001
    Q3 (9.55 - 16.13] 0.08 <0.001
    Q4 (16.13 - 475.22] 0.06 <0.001
Source: db/duration_wq.db
1 OR = Odds Ratio

3.5 Month

Database Operations

Table Preview

DB Operations

Write moy.db to the database.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(member_casual = factor(member_casual, levels = c("casual",
            "member")), abbMonths = lubridate::month(started_at, label = TRUE, abbr = TRUE),
            abbMonths = forcats::as_factor(abbMonths)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/moy.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/moy_wq.db"))) {
    transformData(conn = dbconn, path = "db/moy.db", select_cols = c("started_at",
        "member_casual"), group_cols = c("started_at", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (Jan 01 - May 20]", "Q2 (May 20 - Jul 21]", "Q3 (Jul 21 - Sep 18]",
            "Q4 (Sep 18 - Dec 31]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/moy_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/moy_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 20: Kable output for months of the year
started_at member_casual abbMonths
2023-01-01 00:04:07 casual Jan
2023-01-01 00:04:54 member Jan
2023-01-01 00:05:43 member Jan
2023-01-01 00:09:33 member Jan
2023-01-01 00:09:53 member Jan
2023-01-01 00:10:45 member Jan
Table 21: Kable output for weighted months of the year.
started_at member_casual n quartile
2023-01-01 00:04:07 casual 1 Q1 (Jan 01 - May 20]
2023-01-01 00:04:54 member 1 Q1 (Jan 01 - May 20]
2023-01-01 00:05:43 member 1 Q1 (Jan 01 - May 20]
2023-01-01 00:09:33 member 1 Q1 (Jan 01 - May 20]
2023-01-01 00:09:53 member 1 Q1 (Jan 01 - May 20]
2023-01-01 00:10:45 member 1 Q1 (Jan 01 - May 20]
Code
transformData(conn = dbconn, path = "db/moy.db", select_cols = "abbMonths", group_cols = "abbMonths",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("month of year", "occurrences"), location_list = c("abbMonths",
        "n"), source_note = gt::md("**Source**: `db/moy.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)
Table 22: The overall distribution of trips taken by month
Jan1 2 136886
Feb 136818
Mar 183129
Apr 286343
May 408373
Jun 474756
Jul 502519
Aug 519423
Sep 461164
Oct 374268
Nov 257513
Dec 156206
Source: db/moy.db
1 month of year
2 occurrences
Code
transformData(conn = dbconn, path = "db/moy.db", select_cols = c("abbMonths", "member_casual"),
    group_cols = c("abbMonths", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "abbMonths", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/moy.db`"),
        isStub = TRUE, stub_label = "Month", stub_note = "months of the year (abbreviated)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 23: The membership distribution of trips taken by month
Month1
Jan casual2 3 25262
member 111624
Feb casual 27111
member 109707
Mar casual 39137
member 143992
Apr casual 87211
member 199132
May casual 139998
member 268375
Jun casual 177745
member 297011
Jul casual 194030
member 308489
Aug casual 189563
member 329860
Sep casual 168881
member 292283
Oct casual 114475
member 259793
Nov casual 64508
member 193005
Dec casual 32700
member 123506
Source: db/moy.db
1 months of the year (abbreviated)
2 membership status
3 occurrences

Table 22 and Figure 10 shows the overall monthly travel distribution.

Code
gplot <- transformData(conn = dbconn, path = "db/moy.db", select_cols = "abbMonths",
    group_cols = "abbMonths", doWeights = TRUE) |>
    plotter(x_col = abbMonths, y_col = n, geomType = "column", title = paste0("Which months do people tend to ride?"),
        x_label = paste0("Months", "\n", "(2023)"), y_label = "n")

gplot
Figure 10: The overall distribution of trips taken by month

Similarly, Table 23 and Figure 11 summarize the monthly travel, where frequencies were summed by membership status.

Code
gplot <- transformData(conn = dbconn, path = "db/moy.db", select_cols = c("abbMonths",
    "member_casual"), group_cols = c("abbMonths", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which months do members tend to travel?"), x_label = "Months",
        y_label = "n", x_col = abbMonths, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE)

gplot
Figure 11: The membership distribution of trips taken by month

Table 24 presents a Chi-Square of monthly travel among members and casual users. In summary, members exhibit higher usage for periods Jan - Mar and Oct - Dec. Both groups prefer the months of Apr - Sep, where casuals show a higher interest, proportionally for their respective group.

There is a statistically significant association between monthly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between monthly travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (71802.26) with 11 degrees of freedom (calculated as [rows - 1] [columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(abbMonths, member_casual) |>
    dplyr::arrange(abbMonths, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "abbMonths", by = "member_casual")
Code
chi_table <- tabler(title = gt::md("Chi-Square:<br>The signficance of monthly travel on membership"),
    source_note = gt::md("**Source**: `db/moy.db`"), label = list(abbMonths = "Month"),
    by = member_casual, isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)

chi_table
Table 24: Testing the independence between month and membership variables
Chi-Square:
The signficance of monthly travel on membership
Characteristic casual
N = 1,260,621
1
member
N = 2,636,777
1
p-value2
Month

<0.001
    Jan 25,262 (2.0%) 111,624 (4.2%)
    Feb 27,111 (2.2%) 109,707 (4.2%)
    Mar 39,137 (3.1%) 143,992 (5.5%)
    Apr 87,211 (6.9%) 199,132 (7.6%)
    May 139,998 (11%) 268,375 (10%)
    Jun 177,745 (14%) 297,011 (11%)
    Jul 194,030 (15%) 308,489 (12%)
    Aug 189,563 (15%) 329,860 (13%)
    Sep 168,881 (13%) 292,283 (11%)
    Oct 114,475 (9.1%) 259,793 (9.9%)
    Nov 64,508 (5.1%) 193,005 (7.3%)
    Dec 32,700 (2.6%) 123,506 (4.7%)
Source: db/moy.db
1 n (%); χ² = 71802.26; df = 11
2 Pearson’s Chi-squared test

Two plots, Figure 12 and Figure 13, present a highly granular view of the monthly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help understand the variability.

Create a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at, probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' monthly travel"),
        x_label = "Months", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 month", date_labels = "%b",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_casual, qformat = "%b-%d")

gplot
Figure 12: A more detailed look that illustrates the quartile distribution of monthly casual travel.
Code
gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' monthly travel"),
        x_label = "Months", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 month", date_labels = "%b",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_member, qformat = "%b-%d")

gplot
Figure 13: A more detailed look that illustrates the quartile distribution of monthly member travel.

To visualize monthly users through the lens of their respective concentrations, see Figure 14. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The casual group has higher peaks near the middle of the distribution. The member group has higher peaks near the left and right sides of the distribution. Both of the groups exhibit somewhat normal, multi-modal distributions. They overlap each other significantly, suggesting similarity in travel patterns.

Code
gplot <- dplyr::tbl(dbconn, "db/moy.db") |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross monthly travel groups",
        x_label = paste0("Months"), x_col = started_at, group_col = member_casual,
        geomType = "other", angle = 45, color_col = "black", density_alpha = 0.75,
        isTime = TRUE, date_breaks = "1 month", date_labels = "%b", )

gplot
Figure 14: Visualizing the probability-density distribution of months by membership.

Table 25 presents the results of a binary logistic regression analyzing the relationship between months of the year and membership status. The analysis divides the year into quartiles, with Q1 (January 01 - May 20) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).

With Q2 (May 20 - Jul 21), the odds of membership were 0.57 times as high as in Q1. This indicates a substantial decrease (43%) in the likelihood of membership during late spring and early summer. With Q3 (Jul 21 - Sep 18), the odds of membership were 0.58 times as high as in Q1. This shows a similar decrease (42%) in membership likelihood during late summer and early fall, nearly identical to Q2. And with Q4 (Sep 18 - Dec 31), the odds of membership were 0.87 times as high as in Q1. While still lower than Q1, this represents a less pronounced decrease (13%) in membership likelihood during fall and early winter.

Query, process and create model R object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/moy_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Months Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of monthly member travel"),
        source_note = gt::md("**Source**: `db/moy.db`"), isBinary = TRUE)
Table 25: Modeling the probability of monthly member travel.
Binary Logistic Regression:
Modeling the likelihood of monthly member travel
Characteristic OR1 p-value
Months Ranges

    Q1 (Jan 01 - May 20]
    Q2 (May 20 - Jul 21] 0.57 <0.001
    Q3 (Jul 21 - Sep 18] 0.58 <0.001
    Q4 (Sep 18 - Dec 31] 0.87 <0.001
Source: db/moy.db
1 OR = Odds Ratio

3.6 Day

Database Operations

Table Preview

DB Operations

Write dow.db to the database.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(wkdays = lubridate::wday(started_at, week_start = 7), member_casual = factor(member_casual,
            levels = c("casual", "member")), started_at = update(started_at, year = 2024,
            month = 9, day = wkdays), abbDays = lubridate::wday(started_at, label = TRUE,
            abbr = TRUE), abbDays = forcats::as_factor(abbDays)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/dow.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/dow_wq.db"))) {
    transformData(conn = dbconn, path = "db/dow.db", select_cols = c("started_at",
        "member_casual"), group_cols = c("started_at", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (Sun 12:00 am - Mon 11:40 am]", "Q2 (Mon 11:40 am - Wed 05:14 am]",
            "Q3 (Wed 05:14 am - Fri 12:19 pm]", "Q4 (Fri 12:19 pm - Sat 11:59 pm]"),
        doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/dow_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 26: Days of the week
started_at member_casual wkdays abbDays
2024-09-01 00:04:07 casual 1 Sun
2024-09-01 00:04:54 member 1 Sun
2024-09-01 00:05:43 member 1 Sun
2024-09-01 00:09:33 member 1 Sun
2024-09-01 00:09:53 member 1 Sun
2024-09-01 00:10:45 member 1 Sun
Code
transformData(conn = dbconn, path = "db/dow.db", select_cols = "abbDays", group_cols = "abbDays",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("day of the week", "occurrences"), location_list = list("abbDays",
        "n"), source_note = gt::md("**Source**: `db/dow.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)
Table 27: The overall distribution of trips taken by day of the week
Sun1 2 486275
Mon 507355
Tue 578095
Wed 585266
Thu 597363
Fri 566008
Sat 577036
Source: db/dow.db
1 day of the week
2 occurrences
Code
transformData(conn = dbconn, path = "db/dow.db", select_cols = c("abbDays", "member_casual"),
    group_cols = c("abbDays", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "abbDays", label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/dow.db`"),
        isStub = TRUE, stub_label = "Day", stub_note = "days of the week (abbreviated)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 28: The membership distribution of trips taken by day of the week
Day1
Sun casual2 3 200588
member 285687
Mon casual 142860
member 364495
Tue casual 152604
member 425491
Wed casual 155978
member 429288
Thu casual 169117
member 428246
Fri casual 189429
member 376579
Sat casual 250045
member 326991
Source: db/dow.db
1 days of the week (abbreviated)
2 membership status
3 occurrences

Table 27 and Figure 15 show an aggregated distribution of the overall daily travel.

Code
# Values were too similar to visualize differences, see coord_cartesion()
gplot <- transformData(conn = dbconn, path = "db/dow.db", select_cols = "abbDays",
    group_cols = "abbDays", doWeights = TRUE) |>
    plotter(x_col = abbDays, y_col = n, geomType = "column", title = paste0("Which days\ndo people tend to ride?"),
        x_label = "Days of the Week", y_label = "n") + ggplot2::coord_cartesian(ylim = c(4.5 *
    10^5, NA))

gplot
Figure 15: The overall distribution of trips taken by day of the week

Similarly, Table 28 and Figure 16 summarize the duration distribution, grouping sums by membership status.

Code
gplot <- transformData(conn = dbconn, path = "db/dow.db", select_cols = c("abbDays",
    "member_casual"), group_cols = c("abbDays", "member_casual"), doWeights = TRUE) |>
    plotter(title = "Day Groups", x_label = "Days", y_label = "n", x_col = abbDays,
        y_col = n, group_col = member_casual, geomType = "column", isFaceted = TRUE,
        is_colGroup = TRUE)

gplot
Figure 16: The membership distribution of trips taken by day of the week

In Table 29, present a Chi-Square of daily travel among members and casual users. Members show higher usage Mon - Thu compared to casual users. The pattern is more regular compared to casual users. Casual users show higher usage Fri - Sun compared to members. Their preferences fall within the weekend.

There is a statistically significant association between daily travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between daily travel and membership status. The frequency of monthly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (76305.71) with 6 degrees of freedom (calculated as [rows - 1] * [columns - 1]) results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(abbDays, member_casual) |>
    dplyr::arrange(abbDays, member_casual) |>
    dplyr::collect()

chiResult <- chisqTest(data = data_tibble, variable = "abbDays", by = "member_casual")
Code
tabler(title = gt::md("Chi-Square:<br>The signficance of daily travel on membership"),
    source_note = gt::md("**Source**: `db/moy.db`"), label = list(abbDays = "Day"),
    by = member_casual, isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)
Table 29: Testing the independence between month and membership variables
Chi-Square:
The signficance of daily travel on membership
Characteristic casual
N = 1,260,621
1
member
N = 2,636,777
1
p-value2
Day

<0.001
    Sun 200,588 (16%) 285,687 (11%)
    Mon 142,860 (11%) 364,495 (14%)
    Tue 152,604 (12%) 425,491 (16%)
    Wed 155,978 (12%) 429,288 (16%)
    Thu 169,117 (13%) 428,246 (16%)
    Fri 189,429 (15%) 376,579 (14%)
    Sat 250,045 (20%) 326,991 (12%)
Source: db/moy.db
1 n (%); χ² = 76305.71; df = 6
2 Pearson’s Chi-squared test

Two plots, Figure 17 and Figure 18, present a highly granular view of the daily data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at, probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' daily travel"),
        x_label = "Days", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 day", date_labels = "%a",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_casual, qformat = "%a %I %p")

gplot
Figure 17: A more detailed look that illustrates the quartile distribution of daily casual travel.
Code
gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' daily travel"),
        x_label = "Days", y_label = "n", x_col = started_at, geomType = "column",
        isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "1 day", date_labels = "%a",
        angle = 45, color_col = "black", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        quartiles = quartiles_member, qformat = "%a %I %p")

gplot
Figure 18: A more detailed look that illustrates the quartile distribution of daily member travel.

To visualize daily users through the lens of their respective concentrations, see Figure 19 The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The casual group has higher peaks near the left and right sides of the distribution. Their highest peak appears to be on Saturday. The member group has higher peaks near the middle of the distribution. They also show less day-to-day variability during weekdays compared to casual users.

On weekdays, both groups show a bimodal distribution, with two peaks each day. On weekends, there is a unimodal distribution. They overlap each other significantly, suggesting similarity in travel patterns.

Code
gplot <- dplyr::tbl(dbconn, "db/dow.db") |>
    dplyr::select(started_at, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross daily travel groups",
        x_label = paste0("Day"), x_col = started_at, group_col = member_casual, geomType = "other",
        angle = 45, color_col = "black", density_alpha = 0.75, isTime = TRUE, date_breaks = "1 day",
        date_labels = "%a")

gplot
Figure 19: Visualizing the probability-density distribution of the day by membership variables.

Table 30 presents the results of a binary logistic regression analyzing the relationship between days of the week and membership status. The analysis divides the week into quartiles, with Q1 (Sunday 12:00 am - Monday 11:40 am) serving as the reference category. Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons).

With Q2 (Monday 11:40 am - Wednesday 05:14 am), the odds of membership were 1.52 times higher than in Q1. This suggests a substantial increase in the likelihood of members riding during the early part of the work week.

With Q3 (Wednesday 05:14 am - Friday 12:19 pm), the odds of membership were 1.36 times higher than in Q1. This indicates a continued higher likelihood of membership during the latter part of the work week, though slightly lower than Q2. With Q4 (Friday 12:19 pm - Saturday 11:59 pm), the odds of membership were 0.80 times as high as in Q1. This represents a significant decrease in the likelihood of membership during the weekend period.

Query, process, and create a model object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/dow_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Weekday Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of daily member travel"),
        source_note = gt::md("**Source**: `db/dow_wq.db`"), isBinary = TRUE)
Table 30: Modeling the probability of daily member travel.
Binary Logistic Regression:
Modeling the likelihood of daily member travel
Characteristic OR1 p-value
Weekday Ranges

    Q1 (Sun 12:00 am - Mon 11:40 am]
    Q2 (Mon 11:40 am - Wed 05:14 am] 1.52 <0.001
    Q3 (Wed 05:14 am - Fri 12:19 pm] 1.36 <0.001
    Q4 (Fri 12:19 pm - Sat 11:59 pm] 0.80 <0.001
Source: db/dow_wq.db
1 OR = Odds Ratio

3.7 Hour

Database Operations

Table Preview

DB Operations

Write to .db
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(started_at, member_casual) |>
        dplyr::arrange(started_at) |>
        dplyr::collect() |>
        dplyr::mutate(started_at_time = update(started_at, year = 2023, month = 1,
            day = 1), hr = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "hour"), "%I %p")), hrMin = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "minute"), "%I:%M %p")), hrminSec = stringr::str_to_lower(format(lubridate::round_date(started_at,
            unit = "second"), "%r")), hr = forcats::as_factor(hr), hrMin = forcats::as_factor(hrMin)) |>
        dplyr::select(member_casual:hrminSec) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/hod.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data to hod_wq.db.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/hod_wq.db"))) {
    transformData(conn = dbconn, path = "db/hod.db", select_cols = c("started_at_time",
        "member_casual"), group_cols = c("started_at_time", "member_casual"), binary_col = "member_casual",
        pred_col = "started_at_time", ntile_col = "quartile", zero_val = "casual",
        one_val = "member", qtile_levels = c("Q1 (12:00 am - 10:59 am]", "Q2 (10:59 am - 03:24 pm]",
            "Q3 (03:24 pm - 06:05 pm]", "Q4 (06:05 pm - 11:59 pm]"), doQuantile = TRUE,
        doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/hod_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/hod_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 31: Kable output of hod.db
member_casual started_at_time hr hrMin hrminSec
casual 2023-01-01 00:04:07 12 am 12:04 am 12:04:07 am
member 2023-01-01 00:04:54 12 am 12:05 am 12:04:54 am
member 2023-01-01 00:05:43 12 am 12:06 am 12:05:43 am
member 2023-01-01 00:09:33 12 am 12:10 am 12:09:33 am
member 2023-01-01 00:09:53 12 am 12:10 am 12:09:53 am
member 2023-01-01 00:10:45 12 am 12:11 am 12:10:45 am
Table 32: Kable output of hod_wq.db
started_at_time member_casual n quartile
2023-01-01 00:00:00 casual 6 Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:00 member 6 Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:01 casual 2 Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:01 member 10 Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:02 casual 14 Q1 (12:00 am - 10:59 am]
2023-01-01 00:00:02 member 6 Q1 (12:00 am - 10:59 am]
Code
transformData(conn = dbconn, path = "db/hod.db", select_cols = "hr", group_cols = "hr",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("hour of the day", "occurrences"), location_list = list("hr",
        "n"), source_note = gt::md("**Source**: `db/hod.db`"), noteRows = TRUE, hide_column_labels = TRUE,
        row_index = 1)
Table 33: The overall distribution of trips taken by hour of the day
12 am1 2 52336
01 am 31598
02 am 20378
03 am 10309
04 am 7179
05 am 14866
06 am 61027
07 am 141249
08 am 227964
09 am 199884
10 am 155951
11 am 172204
12 pm 213544
01 pm 223804
02 pm 224467
03 pm 246680
04 pm 313374
05 pm 408606
06 pm 371909
07 pm 277477
08 pm 189106
09 pm 139743
10 pm 112692
11 pm 81051
Source: db/hod.db
1 hour of the day
2 occurrences
Code
transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr", "member_casual"),
    group_cols = c("hr", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "hr", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/hod.db`"),
        isStub = TRUE, stub_label = "Hour", stub_note = "hour of the day (12-hour clock)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 34: The membership distribution of trips taken by hour of the day
Hour1
12 am casual2 3 24468
member 27868
01 am casual 15391
member 16207
02 am casual 10859
member 9519
03 am casual 5053
member 5256
04 am casual 3164
member 4015
05 am casual 3918
member 10948
06 am casual 12330
member 48697
07 am casual 27046
member 114203
08 am casual 43590
member 184374
09 am casual 45413
member 154471
10 am casual 47734
member 108217
11 am casual 59590
member 112614
12 pm casual 75634
member 137910
01 pm casual 81646
member 142158
02 pm casual 83794
member 140673
03 pm casual 90150
member 156530
04 pm casual 105809
member 207565
05 pm casual 124466
member 284140
06 pm casual 118031
member 253878
07 pm casual 91667
member 185810
08 pm casual 63696
member 125410
09 pm casual 48455
member 91288
10 pm casual 44013
member 68679
11 pm casual 34704
member 46347
Source: db/hod.db
1 hour of the day (12-hour clock)
2 membership status
3 occurrences

Table 33 and Figure 20 give an aggregated distribution of the overall hourly travel distribution.

Code
gplot <- transformData(conn = dbconn, path = "db/hod.db", select_cols = "hr", group_cols = "hr",
    doWeights = TRUE) |>
    plotter(x_col = hr, y_col = n, geomType = "column", title = paste0("Which hours do people tend to ride?"),
        x_label = "Hour", y_label = "n", ) + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45)) + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

gplot
Figure 20: The overall distribution of trips taken by hour of the day

Similarly, Table 34 and Figure 21 summarize the distribution, grouping the sums by membership.

Code
gplot <- transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr", "member_casual"),
    group_cols = c("hr", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("Which hours of the day do members ride?"), x_label = "Hour of Day",
        y_label = "n", x_col = hr, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE) + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1,
    angle = 45)) + ggplot2::theme(axis.text = ggplot2::element_text(size = 8))

gplot
Figure 21: The membership distribution of trips taken by hour of the day

Table 35 presents a Chi-Square of hourly travel among members and casual users. In summary:

Compared to casual users, members show relatively higher service usage at intervals [5:30am - 9:30am] and [4:30pm - 6:30pm]. Casual users show relatively higher rates at intervals [10:30am - 4:30pm] and [6:30pm - 5:30am].

There is a statistically significant association between hourly travel and membership status (p < 0.001). The very low p-value indicates strong evidence against the null hypothesis of no association between hourly travel and membership status. The frequency of hourly travel is not likely to be independent of membership status.

The large \(\chi^2\) value (72733.38) with 23 degrees of freedom (calculated as [rows - 1] [columns - 1])* results in the very small p-value. This combination strongly suggests that the difference in monthly preferences between casual users and members is not due to random chance. However, with such a large sample size (nearly 4 million total users), even small differences can produce statistically significant results.

Save the chi-square statistic and degrees of freedom values in a tibble format to add to the gtsummary table.
data_tibble <- transformData(conn = dbconn, path = "db/hod.db", select_cols = c("hr",
    "member_casual"))

chiResult <- chisqTest(data = data_tibble, variable = "hr", by = "member_casual")
Code
tabler(title = gt::md("Chi-Square:<br>The signficance of hourly travel on membership"),
    source_note = gt::md("**Source**: `db/hod.db`"), label = list(hr = "Hour"), by = member_casual,
    isSummary = TRUE, tbl_name = data_tibble, chi_result = chiResult)
Table 35: Testing the independence between hour and membership variables
Chi-Square:
The signficance of hourly travel on membership
Characteristic casual
N = 1,260,621
1
member
N = 2,636,777
1
p-value2
Hour

<0.001
    12 am 24,468 (1.9%) 27,868 (1.1%)
    01 am 15,391 (1.2%) 16,207 (0.6%)
    02 am 10,859 (0.9%) 9,519 (0.4%)
    03 am 5,053 (0.4%) 5,256 (0.2%)
    04 am 3,164 (0.3%) 4,015 (0.2%)
    05 am 3,918 (0.3%) 10,948 (0.4%)
    06 am 12,330 (1.0%) 48,697 (1.8%)
    07 am 27,046 (2.1%) 114,203 (4.3%)
    08 am 43,590 (3.5%) 184,374 (7.0%)
    09 am 45,413 (3.6%) 154,471 (5.9%)
    10 am 47,734 (3.8%) 108,217 (4.1%)
    11 am 59,590 (4.7%) 112,614 (4.3%)
    12 pm 75,634 (6.0%) 137,910 (5.2%)
    01 pm 81,646 (6.5%) 142,158 (5.4%)
    02 pm 83,794 (6.6%) 140,673 (5.3%)
    03 pm 90,150 (7.2%) 156,530 (5.9%)
    04 pm 105,809 (8.4%) 207,565 (7.9%)
    05 pm 124,466 (9.9%) 284,140 (11%)
    06 pm 118,031 (9.4%) 253,878 (9.6%)
    07 pm 91,667 (7.3%) 185,810 (7.0%)
    08 pm 63,696 (5.1%) 125,410 (4.8%)
    09 pm 48,455 (3.8%) 91,288 (3.5%)
    10 pm 44,013 (3.5%) 68,679 (2.6%)
    11 pm 34,704 (2.8%) 46,347 (1.8%)
Source: db/hod.db
1 n (%); χ² = 72733.38; df = 23
2 Pearson’s Chi-squared test

Two plots, Figure 22 and Figure 23, present a highly granular view of the hourly data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::select(started_at_time, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::select(started_at_time, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$started_at_time, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$started_at_time, probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' hourly travel"),
        x_label = paste0("Hours\n", "(12-hour clock)"), y_label = "n", x_col = started_at_time,
        geomType = "column", isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "2 hour",
        date_labels = "%I %p", angle = 45, color_col = "black", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), quartiles = quartiles_casual, qformat = "%I:%M %p")

gplot + ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(0.9)))
Figure 22: A more detailed look that illustrates the quartile distribution of hourly casual travel.
Code
gplot <- qdf_member |>
    plotter(title = paste0("The quartile", "\n", "distribution of members' hourly travel"),
        x_label = paste0("Hours\n", "(12-hour clock)"), y_label = "n", x_col = started_at_time,
        geomType = "column", isHistogram = TRUE, isTimeHist = TRUE, date_breaks = "2 hour",
        date_labels = "%I %p", angle = 45, color_col = "black", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), quartiles = quartiles_member, qformat = "%I:%M %p")

gplot + ggplot2::theme(axis.text.x = ggplot2::element_text(size = ggplot2::rel(0.9)))
Figure 23: A more detailed look that illustrates the quartile distribution of hourly member travel.

To visualize hourly users through the lens of their respective concentrations, see Figure 24. The plot looks a little different because date-time data was used directly when plotting the x-axis. The y-axis shows the density (a measure of relative frequency).

The member group parallels the daily bimodal patterns seen in daily density plot. The casual group likewise, but to less extent parallels the bimodal patterns seen in the daily density plot. The two groups, largely, are overlapping, with the highest densities for both falling around 5:00pm.

Code
gplot <- dplyr::tbl(dbconn, "db/hod.db") |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross hourly travel groups",
        x_label = paste0("Hours", "\n", "(12-hour clock)"), x_col = started_at_time,
        group_col = member_casual, geomType = "other", angle = 45, color_col = "black",
        density_alpha = 0.75, isTime = TRUE, date_breaks = "1 hour", date_labels = "%I %p",
        )

gplot
Figure 24: Visualizing the probability-density distribution of the hour by membership.

Table 36 presents the results of a binary logistic regression analyzing the relationship between hour of the day and membership status. The analysis divides the day into quartiles, with Q1 (12:00 am - 10:59 am) serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider varied significantly across the other time quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 1.44 times as high in Q2 (10:59 am - 03:24 pm), 1.04 times as high in Q3 (03:24 pm - 06:05 pm), and 0.97 times as high in Q4 (06:05 pm - 11:59 pm).

These results reveal a non-linear relationship between time of day and membership status. The highest likelihood of membership occurs during \(Q2\), corresponding to midday hours. There is a slight increase in membership likelihood during Q3 (late afternoon) compared to the reference period, while evening hours (Q4) show a slight decrease in membership likelihood.

Query hod_wq.db, process, and create model R object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/hod_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Hour Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of hourly member travel"),
        source_note = gt::md("**Source**: `db/hod_wq.db`"), isBinary = TRUE)
Table 36: Modeling the probability of hourly member travel.
Binary Logistic Regression:
Modeling the likelihood of hourly member travel
Characteristic OR1 p-value
Hour Ranges

    Q1 (12:00 am - 10:59 am]
    Q2 (10:59 am - 03:24 pm] 1.44 <0.001
    Q3 (03:24 pm - 06:05 pm] 1.04 <0.001
    Q4 (06:05 pm - 11:59 pm] 0.97 <0.001
Source: db/hod_wq.db
1 OR = Odds Ratio

3.8 Distance

Database Operations

Table Preview

DB Operations

Write distance.db to the database.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(miles, member_casual) |>
        dplyr::arrange(miles) |>
        dplyr::collect() |>
        dplyr::mutate(milesR = miles, milesR = dplyr::case_when(milesR >= 1 ~ round(milesR,
            digits = 0), miles < 1 ~ round(signif(milesR, 3), digits = 1)), milesR = forcats::as_factor(milesR)) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/distance.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/distance_wq.db"))) {
    transformData(conn = dbconn, path = "db/distance.db", select_cols = c("miles",
        "member_casual"), group_cols = c("miles", "member_casual"), binary_col = "member_casual",
        pred_col = "miles", ntile_col = "quartile", zero_val = "casual", one_val = "member",
        qtile_levels = c("Q1 (0.10 - 0.63]", "Q2 (0.63 - 1.02]", "Q3 (1.02 - 1.76]",
            "Q4 (1.76 - 20.5]"), doQuantile = TRUE, doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/distance_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/distance_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 37: Distance
miles member_casual milesR
0.1000100 member 0.1
0.1000367 casual 0.1
0.1000491 member 0.1
0.1003780 casual 0.1
0.1004468 member 0.1
0.1005323 member 0.1
Table 38: Distance, Weighted Quantiles
miles member_casual n quartile
0.1000100 member 1 Q1 (0.10 - 0.63]
0.1000367 casual 1 Q1 (0.10 - 0.63]
0.1000491 member 1 Q1 (0.10 - 0.63]
0.1003780 casual 1 Q1 (0.10 - 0.63]
0.1004468 member 1 Q1 (0.10 - 0.63]
0.1005323 member 1 Q1 (0.10 - 0.63]
Code
transformData(
conn = dbconn, 
path = "db/distance.db", 
select_cols = "milesR",
group_cols = "milesR", 
doWeights = TRUE
) |>
tabler( 
title = NULL,
note_list = list("distance (miles)", "occurrences"),
location_list = list("milesR", "n"),
source_note = gt::md("**Source**: `db/distance.db`"),
noteRows = TRUE,
hide_column_labels = TRUE,
row_index = 1,
label_parameter = "milesR",
align_parameter = "right"

)
Table 39: The overall distribution of trips taken by the esimated distance traveled
0.11 2 11429
0.2 70621
0.3 178150
0.4 225723
0.5 280367
0.6 288164
0.7 250369
0.8 262418
0.9 227646
1 867678
2 720842
3 292483
4 121920
5 55748
6 25499
7 11924
8 3275
9 1520
10 686
11 415
12 256
13 139
14 67
15 24
16 31
17 1
18 1
19 1
21 1
Source: db/distance.db
1 distance (miles)
2 occurrences
Code
transformData(conn = dbconn, path = "db/distance.db", select_cols = c("milesR", "member_casual"),
    group_cols = c("milesR", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "milesR", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/distance.db`"),
        isStub = TRUE, stub_label = "Distance", stub_note = "miles traveled per trip",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 40: The membership distribution of trips taken by the estimated distance traveled
Distance1
0.1 casual2 3 1410
member 10019
0.2 casual 11262
member 59359
0.3 casual 34957
member 143193
0.4 casual 54395
member 171328
0.5 casual 75269
member 205098
0.6 casual 83008
member 205156
0.7 casual 76686
member 173683
0.8 casual 91896
member 170522
0.9 casual 77635
member 150011
1 casual 316382
member 551296
2 casual 261540
member 459302
3 casual 99452
member 193031
4 casual 39690
member 82230
5 casual 18589
member 37159
6 casual 9393
member 16106
7 casual 5492
member 6432
8 casual 1706
member 1569
9 casual 789
member 731
10 casual 433
member 253
11 casual 273
member 142
12 casual 170
member 86
13 casual 92
member 47
14 casual 49
member 18
15 casual 22
member 2
16 casual 27
member 4
17 casual 1
18 casual 1
19 casual 1
21 casual 1
Source: db/distance.db
1 miles traveled per trip
2 membership status
3 occurrences

Table 39 and Figure 25 show the overall distribution of distances traveled.

Code
gplot <- transformData(conn = dbconn, path = "db/distance.db", select_cols = "milesR",
    group_cols = "milesR", doWeights = TRUE) |>
    plotter(x_col = milesR, y_col = n, geomType = "column", title = paste0("How far do people ride?"),
        x_label = "Miles", y_label = "n")

gplot + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1, angle = 45)) +
    ggplot2::theme(axis.text = ggplot2::element_text(size = 7))
Figure 25: The overall distribution of trips taken by the estimated distance traveled

Table 40 and Figure 26 likewise summarize the distance distribution by membership status.

Code
gplot <- transformData(conn = dbconn, path = "db/distance.db", select_cols = c("milesR",
    "member_casual"), group_cols = c("milesR", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("How far do members travel?"), x_label = "Miles", y_label = "n",
        x_col = milesR, y_col = n, group_col = member_casual, geomType = "column",
        isFaceted = TRUE, is_colGroup = TRUE)

gplot + ggplot2::scale_x_discrete(guide = ggplot2::guide_axis(n.dodge = 1, angle = 45)) +
    ggplot2::theme(axis.text = ggplot2::element_text(size = 7))
Figure 26: The membership distribution of trips taken by the estimated distance traveled

Table 41 sheds light on the variability, range, and quartile information about the distance to membership data.

Code
gTable <- 
dplyr::tbl(dbconn, "db/distance.db") |>
dplyr::select(miles, member_casual) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = miles ~ "continuous2",
label = list(miles ~ "Distance (miles)"),
digits = list(
miles ~ c(2, 2)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel distance"),
source_note = gt::md("**Source**: `db/distance.db`"),
isBinary = TRUE
)

gTable
Table 41: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel distance.
Summary Stats:
Member travel distance
Characteristic casual
N = 1,260,621
member
N = 2,636,777
Distance (miles)

    Median (Q1, Q3) 1.12 (0.72, 1.83) 0.97 (0.59, 1.71)
    Mean (SD) 1.48 (1.17) 1.34 (1.11)
    Min, Max 0.10, 20.54 0.10, 16.02
Source: db/distance.db

Two plots, Figure 27 and Figure 28, present a highly granular view of the distance data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(as.numeric(qdf_member$miles), probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(as.numeric(qdf_casual$miles), probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel distance"),
        x_label = "Miles", y_label = "n", x_col = miles, geomType = "column", isHistogram = TRUE,
        angle = 45, color_col = "transparent", vline_color = "lightyellow", vline_size = 0.5,
        low = "blue", high = "red", binwidth = \(x) 2 * IQR(x)/(length(x)^(1/3)),
        limits = c(0.1, 5), breaks = seq(0.1, 5, by = 0.5), quartiles = quartiles_casual)

gplot
Figure 27: A more detailed look that illustrates the quartile distribution of casual travel distance.
Code
gplot <- 
qdf_member |>
plotter(
title = paste0("The quartile", "\n", "distribution of members' travel distance"),
x_label = "Miles",
y_label = "n",
x_col = miles, 
geomType = "column", 
isHistogram = TRUE,
angle = 45,
color_col = "transparent",
vline_color = "lightyellow",
vline_size = 0.5,
low = "blue",
high = "red",
binwidth = \(x) 2 * IQR(x) / (length(x)^(1/3)),
#binwidth = 0.2,
limits = c(0.1, 5),
breaks = seq(1, 5, by = 1),
quartiles = quartiles_member
) 

gplot
Figure 28: A more detailed look that illustrates the quartile distribution of member travel distance.

To visualize monthly users through the lens of their respective concentrations, see Figure 29. The y-axis shows the density (a measure of relative frequency).

The member group has a narrower, taller spike in between the 0.1 and 1 mile distance. There is less of a concentration than the casual group between the 1-3 mile distance. The casual group displays a broader spike with much overlap to the member group, but peaks closer to the 1 mile mark than for members. There is greater concentration than members between the 1-3 miles distance, but less around the 0.1-1 mile distance. Both groups overlap nearly completely from around the 3 mile mark and onwards.

Code
gplot <- dplyr::tbl(dbconn, "db/distance.db") |>
    dplyr::select(miles, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross travel distance groups",
        x_label = "Miles", x_col = miles, group_col = member_casual, geomType = "column",
        angle = 45, color_col = "black", density_alpha = 0.75, isDensity = TRUE,
        is_colGroup = TRUE, breaks = seq(0, 11, by = 1), limits = c(0.1, 11))

gplot
Figure 29: Visualizing the probability-density distribution of distance by membership.

Table 42 presents the odds ratios for membership status across distance quartiles, with Q1 serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider were significantly lower in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 0.63 times as high in Q2, 0.59 times as high in Q3, and 0.65 times as high in Q4.

These results indicate an inverse relationship between ride distance and membership status, with members generally associated with shorter ride distances. Interestingly, the lowest odds of membership were observed in Q3, rather than Q4, suggesting a non-linear relationship between distance and membership likelihood.

Query, process, and create model R object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/distance_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Distance Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel distance"),
        source_note = gt::md("**Source**: `db/distance_wq.db`"), isBinary = TRUE)
Table 42: Modeling the probability of members’ travel by distance.
Binary Logistic Regression:
Modeling the likelihood of members’ travel distance
Characteristic OR1 p-value
Distance Ranges

    Q1 (0.10 - 0.63]
    Q2 (0.63 - 1.02] 0.63 <0.001
    Q3 (1.02 - 1.76] 0.59 <0.001
    Q4 (1.76 - 20.5] 0.65 <0.001
Source: db/distance_wq.db
1 OR = Odds Ratio

3.9 Speed

Database Operations

Table Preview

DB Operations

Write to the database
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed.db"))) {
    dplyr::tbl(dbconn, filtered_path) |>
        dplyr::select(mph, member_casual) |>
        dplyr::collect() |>
        dplyr::mutate(mphR = round(mph, digits = 0)) |>
        dplyr::arrange(mph, member_casual) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/speed.db", overwrite = TRUE)
}
Query, transform, and write weighted quartile data.
if (isFALSE(duckdb::dbExistsTable(dbconn, "db/speed_wq.db"))) {
    transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mph", "member_casual"),
        group_cols = c("mph", "member_casual"), binary_col = "member_casual", pred_col = "mph",
        ntile_col = "quartile", zero_val = "casual", one_val = "member", qtile_levels = c("Q1 (1.0 - 5.4]",
            "Q2 (5.4 - 7.0]", "Q3 (7.0 - 8.6]", "Q4 (8.6 - 20]"), doQuantile = TRUE,
        doWeights = TRUE) |>
        duckdb::dbWriteTable(conn = dbconn, name = "db/speed_wq.db", overwrite = TRUE)
}

Tables

Code
dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
dplyr::tbl(dbconn, "db/speed_wq.db") |>
    dplyr::collect() |>
    head() |>
    kableExtra::kable()
Table 43: Speed
mph member_casual mphR
1.000007 casual 1
1.000008 casual 1
1.000050 casual 1
1.000075 casual 1
1.000088 member 1
1.000098 member 1
Table 44: Speed with weighted quartile groups.
mph member_casual n quartile
1.000007 casual 1 Q1 (1.0 - 5.4]
1.000008 casual 1 Q1 (1.0 - 5.4]
1.000050 casual 1 Q1 (1.0 - 5.4]
1.000075 casual 1 Q1 (1.0 - 5.4]
1.000088 member 1 Q1 (1.0 - 5.4]
1.000098 member 1 Q1 (1.0 - 5.4]
Code
transformData(conn = dbconn, path = "db/speed.db", select_cols = "mphR", group_cols = "mphR",
    doWeights = TRUE) |>
    tabler(title = NULL, note_list = list("travel speed (miles per hour)", "occurrences"),
        location_list = list("mphR", "n"), source_note = gt::md("**Source**: `db/speed.db`"),
        noteRows = TRUE, hide_column_labels = TRUE, row_index = 1, label_parameter = "mphR",
        align_parameter = "right")
Table 45: The overall distribution of trips taken by the estimated average speed
1 1 2 45485
2 106779
3 161520
4 278819
5 460475
6 622275
7 647253
8 546353
9 399237
10 265438
11 165825
12 99529
13 55660
14 27274
15 11165
16 3354
17 714
18 138
19 69
20 36
Source: db/speed.db
1 travel speed (miles per hour)
2 occurrences
Code
transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mphR", "member_casual"),
    group_cols = c("mphR", "member_casual"), doWeights = TRUE) |>
    tabler(title = NULL, groupName = "mphR", location = n, label_n = "n", note_list = list("membership status",
        "occurrences"), location_list = list("member_casual", "n"), source_note = gt::md("**Source**: `db/speed.db`"),
        isStub = TRUE, stub_label = "Speed", stub_note = "estimated average speed traveled per trip (mph)",
        row_index = 1, noteRows = TRUE) |>
    gt::cols_label(member_casual = " ", n = " ")
Table 46: The membership distribution of trips taken by the estimated average speed
Speed1
1 casual2 3 28675
member 16810
2 casual 67464
member 39315
3 casual 93694
member 67826
4 casual 128134
member 150685
5 casual 166759
member 293716
6 casual 191146
member 431129
7 casual 179540
member 467713
8 casual 144296
member 402057
9 casual 102522
member 296715
10 casual 67961
member 197477
11 casual 42345
member 123480
12 casual 24981
member 74548
13 casual 13475
member 42185
14 casual 6305
member 20969
15 casual 2402
member 8763
16 casual 693
member 2661
17 casual 171
member 543
18 casual 38
member 100
19 casual 12
member 57
20 casual 8
member 28
Source: db/speed.db
1 estimated average speed traveled per trip (mph)
2 membership status
3 occurrences

Table 45 and Figure 30 show the overall distribution of travel speeds.

Code
gplot <- transformData(conn = dbconn, path = "db/speed.db", select_cols = "mphR",
    group_cols = "mphR", doWeights = TRUE) |>
    plotter(x_col = mphR, y_col = n, geomType = "column", title = paste0("How fast do people ride?"),
        x_label = "Miles per Hour", y_label = "n")

gplot
Figure 30: The overall distribution of trips taken by the estimated average speed

Table 46 and Figure 31 summarize the travel speed distribution by membership.

Code
gplot <- transformData(conn = dbconn, path = "db/speed.db", select_cols = c("mphR",
    "member_casual"), group_cols = c("mphR", "member_casual"), doWeights = TRUE) |>
    plotter(title = paste0("How fast do members travel?"), x_label = "Miles per Hour",
        y_label = "n", x_col = mphR, y_col = n, color_col = member_casual, geomType = "column",
        is_colGroup = TRUE, isFaceted = TRUE)

gplot
Figure 31: The membership distribution of trips taken by the estimated average speed

Table 47 gives the reader some idea of the variability, range, and quartile information about the distance data.

Code
dplyr::tbl(dbconn, "db/speed.db") |>
dplyr::select(mph, member_casual) |>
dplyr::collect() |>
gtsummary::tbl_summary(
by = member_casual,
type = mph ~ "continuous2",
label = list(mph ~ "Speed"),
digits = list(
mph ~ c(1, 1)),
statistic = 
gtsummary::all_continuous() ~ c(
"{median} ({p25}, {p75})", 
"{mean} ({sd})", 
"{min}, {max}")
) |>
gtsummary::italicize_levels() |>
tabler(
title = gt::md("Summary Stats:<br>Member travel speed"),
source_note = gt::md("**Source**: `db/speed.db`"),
isBinary = TRUE
)
Table 47: The mean with standard deviation, median with inter-quartile distance, and range with min and max of member travel speed.
Summary Stats:
Member travel speed
Characteristic casual
N = 1,260,621
member
N = 2,636,777
Speed

    Median (Q1, Q3) 6.3 (4.5, 8.1) 7.2 (5.7, 8.8)
    Mean (SD) 6.4 (2.7) 7.3 (2.4)
    Min, Max 1.0, 20.0 1.0, 20.0
Source: db/speed.db

Two plots, Figure 32 and Figure 33, present a highly granular view of the travel speed data (the solid yellow line represents the mean). Quartile ranges are likewise shown to help visualize the variability.

Creates a data frame, then extract the desired quartile info to supplement histogram visualization for the data.
qdf_member <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::filter(member_casual == "member") |>
    dplyr::collect()

qdf_casual <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::filter(member_casual == "casual") |>
    dplyr::collect()

quartiles_member <- quantile(qdf_member$mph, probs = c(0.25, 0.5, 0.75))

quartiles_casual <- quantile(qdf_casual$mph, probs = c(0.25, 0.5, 0.75))
Code
gplot <- qdf_casual |>
    plotter(title = paste0("The quartile", "\n", "distribution of casuals' travel speed"),
        x_label = "Miles per Hour", y_label = "n", x_col = mph, geomType = "column",
        isHistogram = TRUE, angle = 45, color_col = "transparent", vline_color = "lightyellow",
        vline_size = 0.5, low = "blue", high = "red", binwidth = \(x) 2 *
            IQR(x)/(length(x)^(1/3)), breaks = seq(1, 20, by = 2), limits = c(1,
            20), quartiles = quartiles_casual)

gplot
Figure 32: A more detailed look that illustrates the quartile distribution of casual travel speed.
Code
gplot <- 
qdf_member |>
plotter(
title = paste0("The quartile", "\n", "distribution of members' travel speed"),
x_label = "Miles per Hour",
y_label = "n",
x_col = mph, 
geomType = "column", 
isHistogram = TRUE,
angle = 45,
color_col = "transparent",
vline_color = "lightyellow",
vline_size = 0.5,
low = "blue",
high = "red",
binwidth = \(x) 2 * IQR(x) / (length(x)^(1/3)),
#binwidth = 0.5,
breaks = seq(1, 20, by = 2),
limits = c(1, 20),
quartiles = quartiles_member
) 

gplot
Figure 33: A more detailed look that illustrates the quartile distribution of member travel speed.

To visualize travel speed patterns through the lens of their respective concentrations, see Figure 34. The y-axis shows the density (a measure of relative frequency).

The member group has a narrower, taller spike centering around 7mph. There is higher concentration than the casual group between ~ 5-15mph. The casual group has a broader, shorter spike centering closer to 6mph. There is higher concentration than the member group between ~ 1-5mph. Alongside significant overlap, both groups exhibit a unimodal distribution.

Code
gplot <- dplyr::tbl(dbconn, "db/speed.db") |>
    dplyr::select(mph, member_casual) |>
    dplyr::collect() |>
    plotter(title = "Visualizing relative frequency\nacross travel speed groups",
        x_label = "Miles per Hour", x_col = mph, group_col = member_casual, geomType = "column",
        angle = 45, color_col = "black", density_alpha = 0.75, isDensity = TRUE,
        is_colGroup = TRUE, breaks = seq(1, 20, by = 1), limits = c(1, 20))

gplot
Figure 34: Visualizing the probability-density distribution of speed by membership.

Table 48 presents the odds ratios for membership status across speed quartiles, with Q1 serving as the reference category. In summary:

Compared to Q1, the odds of being a member versus a casual rider were significantly higher in all other quartiles (p < 0.001 for all comparisons). Specifically, the odds of membership were 2.09 times higher in Q2, 2.50 times higher in Q3, and 2.69 times higher in Q4.

These results suggest a strong positive association between riding speed and membership status, with the likelihood of membership increasing monotonically across speed quartiles.

Query, process, and create model R object for hour based on quartile range.
model <- dplyr::tbl(dbconn, "db/speed_wq.db") |>
    dplyr::collect() |>
    glm(formula = member_casual ~ quartile, family = binomial, weights = n)
Code
model |>
    gtsummary::tbl_regression(label = list(quartile = "Speed Ranges"), conf.int = FALSE,
        exponentiate = TRUE) |>
    tabler(title = gt::md("Binary Logistic Regression:<br>Modeling the likelihood of members' travel speed"),
        source_note = gt::md("**Source**: `db/speed_wq.db`"), isBinary = TRUE)
Table 48: Modeling the probability of members’ travel by speed.
Binary Logistic Regression:
Modeling the likelihood of members’ travel speed
Characteristic OR1 p-value
Speed Ranges

    Q1 (1.0 - 5.4]
    Q2 (5.4 - 7.0] 2.09 <0.001
    Q3 (7.0 - 8.6] 2.50 <0.001
    Q4 (8.6 - 20] 2.69 <0.001
Source: db/speed_wq.db
1 OR = Odds Ratio

3.10 Summarizing Tabsets

The EDA (exploratory data analysis) sections employ various statistical methods to uncover patterns in user behavior and preferences. A chi-square analysis reveals a significant association between bicycle type and membership status (p < 0.001). Binary logistic regression further quantifies this relationship, showing that electric bike users have lower odds of being members compared to classic bike users. Section 3.3

Trip duration analysis, also utilizing binary logistic regression, uncovers a notable trend: the odds of membership decrease substantially as trip duration increases. This model, using quartiles of trip duration, indicates that members generally take shorter, more concentrated trips, while casual users are more likely to engage in longer rides. Section 3.4

Seasonal trends emerge when examining monthly ridership patterns through another logistic regression model. The odds of membership fluctuate throughout the year, with the highest proportion of members riding during the colder months and early spring. As the weather warms, there’s a noticeable shift towards more casual ridership, as evidenced by lower odds ratios in the summer months. Section 3.5

Weekly and daily patterns, analyzed using similar regression techniques, provide further insights into user behavior. Weekdays, Section 3.6, particularly during typical commuting hours, Section 3.7, see higher odds of member rides. In contrast, weekends and evenings show decreased odds of membership, suggesting an increased likelihood of casual ridership during these times.

The analysis of trip distances, again using logistic regression with distance quartiles, reveals an inverse relationship with membership status. Members are more likely to take shorter trips, while casual users tend to embark on longer journeys. This aligns with the duration findings and reinforces the idea that members use the service for quick, routine travel. Section 3.4

Interestingly, trip speed shows a strong positive association with membership status in the logistic regression model. The odds of membership increase monotonically across speed quartiles, indicating that members generally ride at higher speeds compared to casual users. Section 3.9

These findings, derived from a combination of chi-square tests for independence and multiple binary logistic regression models, paint a picture of two distinct user groups: members who typically use the bike share for short, fast, routine trips during weekdays, and casual users who tend to take longer, slower rides, often during leisure hours or weekends.

Contingency tables and visualizations, including density plots and histograms, supplement these statistical analyses, providing a comprehensive view of the data distribution across various parameters such as bike type, trip duration, time of day, and day of the week.

The robust statistical approach, combining hypothesis testing (chi-square) with predictive modeling (logistic regression), provides strong evidence for the observed patterns in user behavior. These insights could prove valuable for tailoring marketing strategies, optimizing bike distribution, and enhancing service offerings to better serve both user segments.

3.11 Geographic Data

3.11.1 Traffic Flow

Figure 35 presents an intriguing bird’s-eye view of trip behaviors through an interactive epiflows graph. ]Moraga et al. (2019)] This R package used for creating this graph was re-purposed from its original intent for visualizing the spread of disease. This visualization employs a network of nodes (circles) connected by lines, where the thickness of the lines roughly corresponds to the volume of trips between the nodes, with thicker lines indicating a higher number of trips. The top 34 most frequently traveled stations are depicted in this visual network diagram.

Moraga, P., Dorigatti, I., Kamvar, Z. N., Piatkowski, P., Toikkanen, S. E., Nagraj, V. P., Donnelly, C. A., & Jombart, T. (2019). epiflows: an R package for risk assessment of travel-related spread of disease. https://doi.org/10.12688/f1000research.16032.3

The interactive nature of the epiflows allows users to click on individual nodes and lines to access more detailed information. Additionally, a drop-down window provides further exploration capabilities, enabling users to delve deeper into the data.

These stations represent the most active locations within the system. Fortunately, Section 3.11.2 explores a potential approach to gain insights into the typical high-traffic station locations and the underlying reasons behind their elevated activity levels.

Creating an EpiFlow

First, creates the frequency of trips taken to and from pairs of stations.
We are only going to look deeper into the top 50 most traveled pairs.
flowData <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(start_station_name, end_station_name) |>
    dplyr::group_by(start_station_name, end_station_name) |>
    dplyr::summarize(n = n()) |>
    dplyr::ungroup() |>
    dplyr::arrange(desc(n)) |>
    dplyr::rename(from_station = start_station_name, to_station = end_station_name) |>
    dplyr::collect() |>
    dplyr::slice_head(n = 50)
Second, we need statistics but also to combine the statistics for every unique station name.
locationData <- dplyr::tbl(dbconn, filtered_path) |>
    dplyr::select(start_station_name, end_station_name, started_at, ended_at, trip_time) |>
    dplyr::group_by(start_station_name, end_station_name) |>
    dplyr::mutate(trip_time = round(trip_time, digits = 0)) |>
    dplyr::summarize(trip_count = dplyr::n(), first_date = min(started_at), last_date = max(ended_at),
        ) |>
    dplyr::ungroup() |>
    dplyr::rename(from_station = start_station_name, to_station = end_station_name) |>
    dplyr::arrange(desc(trip_count)) |>
    dplyr::collect()

# Need to combine all names to single column and recalculate or retain other
# stats.
locationData_pivoted <- locationData |>
    tidyr::pivot_longer(cols = 1:2, values_to = "allNames") |>
    dplyr::group_by(allNames) |>
    dplyr::summarize(trips_toAndfrom = sum(trip_count), first_date = min(first_date),
        last_date = max(last_date), ) |>
    dplyr::arrange(trips_toAndfrom)
Third, creates epiflow objects, which take in a pair of dataframes and creates the flows between them.
# for all the pairs
ef_test <- epiflows::make_epiflows(flows = flowData, locations = locationData_pivoted,
    num_cases = "trips_toAndfrom")

Tables

First, just a quick view of the flow data table we made earlier.
flowData
# A tibble: 50 × 3
   from_station                      to_station                   n
   <chr>                             <chr>                    <dbl>
 1 Ellis Ave & 60th St               Ellis Ave & 55th St       6927
 2 Ellis Ave & 60th St               University Ave & 57th St  6600
 3 Ellis Ave & 55th St               Ellis Ave & 60th St       6349
 4 University Ave & 57th St          Ellis Ave & 60th St       6168
 5 Calumet Ave & 33rd St             State St & 33rd St        5417
 6 State St & 33rd St                Calumet Ave & 33rd St     5343
 7 DuSable Lake Shore Dr & Monroe St Streeter Dr & Grand Ave   4023
 8 Loomis St & Lexington St          Morgan St & Polk St       3719
 9 Morgan St & Polk St               Loomis St & Lexington St  3379
10 University Ave & 57th St          Kimbark Ave & 53rd St     3112
# ℹ 40 more rows
Second, another quick view, but for thethe location data we pivoted earlier.
locationData_pivoted |>
    dplyr::arrange(desc(trips_toAndfrom))
# A tibble: 1,567 × 4
   allNames              trips_toAndfrom first_date          last_date          
   <chr>                           <dbl> <dttm>              <dttm>             
 1 Streeter Dr & Grand …           86422 2023-01-01 00:05:43 2024-01-01 00:19:01
 2 Kingsbury St & Kinzi…           61277 2023-01-01 01:21:59 2023-12-31 21:30:50
 3 DuSable Lake Shore D…           60808 2023-01-01 02:12:09 2023-12-31 23:34:53
 4 Clark St & Elm St               60552 2023-01-01 01:06:48 2023-12-31 23:29:33
 5 Clinton St & Washing…           58278 2023-01-01 00:44:39 2023-12-31 18:03:02
 6 Wells St & Concord Ln           57642 2023-01-01 01:15:27 2023-12-31 23:51:50
 7 Michigan Ave & Oak St           54000 2023-01-01 00:59:17 2023-12-31 23:09:35
 8 Wells St & Elm St               52315 2023-01-01 00:59:22 2023-12-31 23:51:48
 9 DuSable Lake Shore D…           48833 2023-01-01 00:14:47 2023-12-31 16:49:56
10 Theater on the Lake             48349 2023-01-01 03:14:22 2023-12-31 22:53:53
# ℹ 1,557 more rows
Figure 35: EpiFlow Network

Code Steps

Table Preview

3.11.2 Checking the Map

This section was made possible thanks to the latitude and longitude coordinates data provided alongside the stations names. Coming from the epiflow diagram, this should help make the data less abstract. The accordion below expands and collapses four OpenStreet maps found in the callout section below. These maps were split for viewing logistics. They contain from the epiflow in the section above. These maps are interactive, so the default views are zoom-able and movable. The transparent burst buttons enable snappy zooming-in of the station groups.

Code for Mapping

Processing ‘flowData’ created earlier to include geolocation data for mapview plots.
# All distinct stations in one column
names <- flowData |>
    dplyr::select(from_station, to_station) |>
    tidyr::pivot_longer(cols = 1:2, names_to = NULL, values_to = "station_names") |>
    dplyr::distinct()

# The important geo-coordinates corresponding to station names
mapData <- dplyr::tbl(dbconn, filtered_path, check_from = FALSE) |>
    dplyr::select(start_station_name, start_lat, start_lng, end_station_name, end_lat,
        end_lng)

# Filter to include all observations that match the station names listed in
# 'names'. We need the geo-coordinates alongside the names.
mapData1 <- mapData |>
    dplyr::collect() |>
    # Filter, but through a vector of conditions.
dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>
    dplyr::select(start_station_name:start_lng)

# Had to split 'mapData' into two and pivot into a single table.
mapData2 <- mapData |>
    dplyr::collect() |>
    dplyr::filter(start_station_name %in% names[[1]], end_station_name %in% names[[1]]) |>
    dplyr::select(end_station_name:end_lng)

# Nice grouping
stations_groupMap <- dplyr::bind_rows(mapData1, mapData2) |>
    dplyr::select(start_station_name, start_lat, start_lng) |>
    dplyr::rename(station_names = start_station_name, lat = start_lat, lng = start_lng) |>
    dplyr::distinct() |>
    dplyr::group_by(station_names)

# Setting seed for sampling
set.seed(113)

# Taking 10 random samples from each station_name group
sampled_stations <- stations_groupMap |>
    dplyr::slice_sample(n = 10) |>
    tidyr::drop_na()
Creates a map coloring palette excluding grays.
# All of the r-colors
allPalette <- colors()

# The grays are vast so we don't want those watering down the samples.
colorfulPal <- purrr::discard(allPalette, stringr::str_detect(allPalette, "gr(a|e)y"))

# When we sample the colors, 10 should be slightly more than needed.
n_colors <- 10
First, sourcing the script needed to generate the maps and creating the list of vectors used as input. These vectors are the slices of the top most traveled stations.
slicerVector <- list(c(1:9), c(10:18), c(19:27), c(28:34))
source("Scripts/mapViewer.R")
The script used to generate the maps.
# ----
# Author: Eric Mossotti
# CC BY-SA
# ----
# I needed the stations groups' burst buttons to fit the viewing window in my 
# document and the only way I could think of is to split the stations into 
# multiple maps. This reduces duplicate code.
# ----
mapViewer <- function(x) {
    
    nameSlice <- sampled_stations |>
        dplyr::ungroup() |>
        dplyr::distinct(station_names) |>
        dplyr::slice(x)
    
    viewMap <- sampled_stations |>
        dplyr::filter(station_names %in% nameSlice$station_names) |>
        sf::st_as_sf(coords = c(3:2), crs = 4326) |>
        mapview::mapview(
            zcol = "station_names",
            col.regions = randomColors,
            map.types = "OpenStreetMap",
            burst = TRUE,
            legend = FALSE)
    
    return(viewMap)
}
Code
set.seed(240)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[1]])
Figure 36: Benson Ave & Church St - Ellis Ave & 60th St
Code
set.seed(241)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[2]])
Figure 37: Greenview Ave & Fullteron Ave - Loomis Ave & Lexington St
Code
set.seed(242)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[3]])
Figure 38: Michigan Ave & Oak St - State St & 33rd St
Code
set.seed(243)
randomColors <- sample(colorfulPal, n_colors)
mapViewer(slicerVector[[4]])
Figure 39: Street Dr & Grand Ave - Woodlawn Ave & 55th St

3.11.3 Summarzing Geographic EDA

For example, suppose the user selects University Ave & 57th St in the epiflow visualization. This intersection happens to be at the heart of the University of Chicago campus. The natural next question is: where does the traffic to and from this location typically flow? By selecting one of the other nodes highlighted with flows directing away from the previous node, the user can identify Kimbark Ave and 53rd St. As seen in the map view, this location is situated adjacent to the Vue 53 Apartments complex. By analyzing such connections between nodes, the user can gain insights into common routes and destinations originating from a particular point of interest, potentially revealing patterns related to student housing, campus facilities, or other points of interest in the vicinity.

The data suggests individual members utilize the service multiple times weekly. However, further analysis is needed to determine if a significantly larger volume of unique individuals are annual members. Verifying associations between specific locations and higher or lower traffic could be a next step. Preliminary observations indicate universities, shopping centers, major companies, and nearby apartment complexes tend to have the highest ridership volumes.

To improve membership, addressing factors deterring individuals from becoming annual members could be key. These may include a lack of stations within walking distance of residences or destinations, or concerns over electric bicycle battery life and charging station availability, potentially explaining their lower utilization compared to classic bikes. Offering trial periods could allow casual users to experience the service’s reliability and convenience, encouraging conversion to annual memberships.

3.12 Updated Database Tables List

Code
dbList2 <- duckdb::dbListTables(dbconn) |>
    as.data.frame() |>
    tabler(title = gt::md("Which tables have<br>been created so far?"), note_list = list(gt::md("Tables in `db/data.db` at this stage")),
        location_list = list("duckdb::dbListTables(dbconn)"), noteColumns = TRUE,
        source_note = gt::md("**Source**: `db/data.db`"), label_n = NULL) |>
    gt::cols_label(`duckdb::dbListTables(dbconn)` = "Table Paths") |>
    gt::cols_align(align = "left")

dbList2
Table 49: The list of tables created by the end of the analysis.
Which tables have
been created so far?
Table Paths1
db/bType.db
db/bType_wb.db
db/complete_data.db
db/data_unduped.db
db/distance.db
db/distance_wq.db
db/dow.db
db/dow_wq.db
db/duration.db
db/duration_wq.db
db/filtered_data.db
db/hod.db
db/hod_wq.db
db/membership.db
db/moy.db
db/moy_wq.db
db/original_data.db
db/speed.db
db/speed_wq.db
Source: db/data.db
1 Tables in db/data.db at this stage
Figure 40: dbList_2

Notice the size difference from the previous image. The database is still represented in the data folder as one file.

3.13 Export the Final Database

For future use, it might save a lot of time to not have to re-download the original data or recreate tables. The database is exported and compressed to parquet files.

A view of the data exporting script.
# ----
# CC BY-SA, Eric Mossotti 
# ----
# Description ----
# 
# Simplifies duckDB database export to a separate directory. Prevents the need
# for re-downloading and recreating db tables in some circumstances.

dbExporter <- function(dbdir, query) {
    
    if (dir.exists("db_exported") == FALSE ) {
        dir.create("db_exported")
    }
    
    conn <- DBI::dbConnect(duckdb::duckdb(), dbdir)
    
    DBI::dbExecute(conn, query)
    
}
Execute above script with a custom SQL query string parameter to the db_exported directory.
if (dir.exists("db_exported") == FALSE) {

    source("Scripts/dbExporter.R")

    queryString <- paste0("EXPORT DATABASE 'db_exported' (", "FORMAT PARQUET, ",
        "COMPRESSION ZSTD", ")")

    dbExporter(dbdir = "db/data.db", query = queryString)

}

4 Conclusion

These findings and recommendations are based on robust statistical analyses, including chi-square tests, binary logistic regression models, and visualization techniques. They provide a data-driven foundation for enhancing the Divvy bike-sharing service and better serving the residents of Chicago.

4.1 Key Findings

  • Membership status significantly influences bike usage patterns:

    • Members prefer classic bikes over electric bikes.
    • Casual users have a higher electric bike usage compared to members.
    • Members typically take shorter, faster trips.
    • Casual users tend to engage in longer, slower rides.
  • Temporal patterns reveal distinct user behaviors:

    • Weekdays and typical commuting hours see higher member activity.
    • Weekends and evenings show increased casual ridership.
    • Membership likelihood is highest during colder months and early spring.
    • Summer months see a shift towards more casual ridership.
  • Trip characteristics vary by user type:

    • Members are associated with shorter ride distances.
    • Trip speed shows a strong positive association with membership status.
    • The likelihood of membership decreases as trip duration increases.
  • High-traffic stations are often near universities, shopping centers, major companies, and apartment complexes.

  • The large sample size (nearly 4 million users) allows for high statistical significance in observed differences.

4.2 Recommendations

  1. Tailor marketing strategies to target potential members for short, frequent trips, especially for commuting purposes.

  2. Optimize bike distribution to meet the demand for classic bikes among members and electric bikes among casual users.

  3. Implement promotional campaigns during summer months to convert casual users to members.

  4. Consider offering trial periods to allow casual users to experience the benefits of membership.

  5. Investigate factors deterring individuals from becoming annual members, such as station proximity to residences or destinations.

  6. Address potential concerns over electric bicycle battery life and charging station availability.

  7. Focus on improving service near high-traffic areas like universities, shopping centers, and residential complexes.

  8. Develop targeted strategies to encourage casual users of longer, leisure rides to consider membership benefits.

  9. Utilize the epiflows visualization tool to identify and optimize popular routes and destinations.

  10. Continue to collect and analyze data to refine understanding of user behavior and preferences over time.