pmartinez8241 · August 19, 2024 00:07
diff --git a/bike_share.rmd b/bike_share.rmd

 ---
 title: "google_analytics_cyclistic_case_study"
 author: "Pete Martinez"
 date: "2024-08-07"
 output: 
  html_document:
    toc: true
    theme: united
 ---


 # Introduction

 ### Purpose

 * Test

 ### Scenario



 # Ask

  You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share
 company in Chicago. The director of marketing believes the company’s future success
 depends on maximizing the number of annual memberships. Therefore, your team wants to
 understand how casual riders and annual members use Cyclistic bikes differently. From these
 insights, your team will design a new marketing strategy to convert casual riders into annual
 members. But first, Cyclistic executives must approve your recommendations, so they must be
 backed up with compelling data insights and professional data visualizations.

                        [@Case-Study-1_How-does-a-bike-shared-navigate-speedy-success_.pdf]

 ### Stakeholders

 | Stakeholder name | Stakeholder Description |
 | ---------------- | ----------------------- |
 | Cyclistic        | A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. |
 | Lily Moreno | The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.|
 | Cyclistic Marketing Team
 | A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.|
 | Cyclistic Executive Team | The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.



 ### Scope / Major Project Activities

 | Activity | Description |
 | -------- | ----------- |
 | Collection of data from sources | The data is available in an AWS S3 Bucket at https://divvy-tripdata.s3.amazonaws.com/index.html|
 | Clean Data | - Verify that each column has the correct type for the values it contains <br/>- Remove duplicate rows <br /> - Remove rows with with missing need values (null or invalid) <br/> - Make sure dates, phone numbers and addresses have a common format throughout the data <br/> |
 | Data Analysis | - Using R Programming manipulate the data in a way that provides insight relevant to the business question being asked <br /> - Document all functions used in the analysis so they can be referred to in the future <br/> - Summarize results of the analysis |


 ```{r setup, include=FALSE}
 knitr::opts_chunk$set(echo = TRUE)
 ```


 # Prepare

 * Where is the data being stored?

    + The original data files are being stored in an AWS S3 Bucket at [this link](https://divvy-tripdata.s3.amazonaws.com/index.html). <br/> The downloaded datafiles are then uploaded to kaggle cloud storage.

 * How is the data organized?

    + The provided data files are in the CSV file format and all 12 contain the same 13 column names.

 * How was data integrity verified?

    + The document outling the case study provided a link to a secure AWS S3 Bucket.<br/> The link to AWS S3 is provided by google which by many is considered a credible source of data.

 * Will this data help answer the business question being asked?

    + The data provided contains all the information needed to answer the business questions being asked

 * Are there problems with the data?

    + There are a few problems:

        - started_at and ended_at date columns have dates in different formats

        - some of the latitude and longitude coordinates do not coincide with a start_station_name or an end_station_name

 # Process

 + Tools used:
 On my computer excel is acting buggy, due to the amount of data in each file, so I will be using the R Programming language.

 + Data Cleaning Process

 ### <u>Load packages needed for data cleaning and manipulation</u>

 ```{r message=FALSE, warning=FALSE}
 library(tidyverse)
 library(lubridate)
 library(janitor)
 ```

 ### <u> Load data from provided CSV files. </u>
 ```{r}

 data_files <- list.files(path=r"{/kaggle/input/cyclistic-raw-data}",full.names = TRUE)
 data_files
 # as you can see from the output we have our 12 files in the data_files variable
 # now we can load our data

 bike_share_data <- read_csv(data_files)

 ```
 ### <u> Clean the names of each column </u>
 As you can see from the printout from the function colnames(), the current names of the columns are already valid

 ```{r}
 colnames(bike_share_data)
 ```

 ### <u>Set a common date pattern for all date columns</u>

 We would like our dates to be formatted in the '%Y-%m-%d %H:%M:%s' pattern in every date column, but in some rows, the date format is in the '%m-%d-%Y %H:%M:%s' pattern, as shown below

 ```{r warning=FALSE}
 head(bike_share_data %>% 
    filter(is.na(as_date(started_at,r"{%Y-%m-%d %H:%M:%s}"))) %>% 
    select(started_at) %>%
    arrange(started_at))
 ```

 To fix this we are going to test if the current date format is valid with as_date() then where are going to change the format using mdy_hms() from the lubridate library and finally we are going to use mutate() to update the date field

 ```{r warning=FALSE}
 bike_share_cleaned <- bike_share_data %>% 
  mutate(started_at = case_when(
    is.na(as_date(started_at,"{%Y-%m-%d %H:%M:%s}")) ~ mdy_hms(started_at),
    TRUE ~ as_datetime(started_at))) %>%
  mutate(ended_at = case_when(
    is.na(as_date(ended_at,"{%Y-%m-%d %H:%M:%s}")) ~ mdy_hms(ended_at),
    TRUE ~ as_datetime(ended_at)))
 ```
 As you can see in the output below the dates in the started_at and ended_at columns are formatted in the %Y-%m-%d %H:%M:%s pattern
 ```{r warning=FALSE}
 head(bike_share_cleaned %>%
    filter(month(started_at)==6 | (month(started_at)==5 & day(started_at)==31 ))%>%
    select(started_at,ended_at) %>%
    arrange(started_at))
 ```
 ### <u> Find and remove duplicates from dataset </u>
 Using the get_dupes function from the janitor package, we will look for duplicate ride_ids, as each trip is supposed to have a unique ride_id. We determined that there are 104 duplicate rows, as shown in the output below.

 ```{r}
 duplicates <- bike_share_cleaned %>% get_dupes(ride_id)
 nrow(duplicates)
 head(duplicates,n=8)
 ```

 Now we will use the distinct function from the dplyr package to remove these duplicates.

 ```{r}
 bike_share_cleaned <- bike_share_cleaned %>% distinct(ride_id,.keep_all = TRUE)
 ```

 As you can see from the output below the duplicates have been removed

 ```{r}
 bike_share_cleaned %>% get_dupes(ride_id)
 ```

 ### <u>Create calculated time columns</u>
 The amount of time each biker uses their bike is a valuable piece of information so we will calculate the total ride length, number of hours of the ride, and number of days of a ride.
 ```{r}
 bike_share_cleaned <- bike_share_cleaned %>%
  mutate(ride_length = as.numeric(difftime(ended_at,started_at)/60)) %>%
  mutate(ride_hours = case_when( 
    ride_length>60 ~ ride_length/60,
    TRUE ~ 0)) %>%
  mutate(ride_days = case_when(
    ride_hours>=24 ~ ride_hours/24,
    TRUE ~ 0
  ))%>%
  mutate(day_of_week=wday(started_at))
 ```

 ### <u>Remove invalid rows</u>

 #### <u>Remove Started_at > Ended_at </u>
 There are some invalid rows where the start_at time is greater than the ended_at time, these rows can be removed. To remove these rows we will use a DPLYR filter.
 as you can see in the output below we have 434 row where started_at>ended_at.

 ```{r}
 as.numeric(count(bike_share_cleaned %>% filter(started_at>ended_at)))
 ```
 So now we can filter them out.
 ```{r}
 bike_share_cleaned <- bike_share_cleaned %>%
    filter(started_at<ended_at)
 ```
 Now we have zero row where started_at>ended_at
 ```{r}
 as.numeric(count(bike_share_cleaned %>% filter(started_at>ended_at)))
 ```
 #### <u>Remove invalid ride_id</u>
 We will be removing all ride_ids with a character count less than 16. Some of the ride_ids were long numbers that were in scientific notation and I have decided to brand these ride_ids as invalid rows. In the output below you can see an example of the invalid ride_ids.

 ```{r}
 invalid_ride_id <- bike_share_cleaned %>% filter(nchar(ride_id)<16)
 nrow(invalid_ride_id)
 head(invalid_ride_id,n=6)
 ```

 We will be filtering out these invalid ride_ids

 ```{r}
 bike_share_cleaned <- bike_share_cleaned %>% filter(nchar(ride_id)==16)
 ```

 As you can see below the rows with the invalid ride_ids have been removed.
 ```{r}
 bike_share_cleaned %>% filter(nchar(ride_id)<16)
 ```
 #### <u>Remove rows missing start_station_id and end_station_id </u>
 Some of the row have invalid coordinates that do not correspond to any start station or end station. I have chosen to remove these rows since I have more than an adequate sample size without them. As you can see below we have over 400,000 row with missing station data
 ```{r}
 as.numeric(count(bike_share_cleaned %>% filter(is.na(start_station_id),is.na(end_station_id))))
 ```
 Now we will remove these rows from bike_share_cleaned
 ```{r}
 bike_share_cleaned <- bike_share_cleaned %>% filter(!is.na(start_station_id),!is.na(end_station_id))
 ```

 now we have zero rows without a start_station and end_station

 ```{r}
 as.numeric(count(bike_share_cleaned %>% filter(is.na(start_station_id),is.na(end_station_id))))
 ```
 ### <u>Remove outliers</u>

 Using the GGPLOT2 we will use visualization to find outliers. First we will group and summarize ride_date and start_at

 ```{r warning=FALSE}
 search_for_outliers <- bike_share_cleaned %>% group_by(ride_length,ride_date=as_date(format(started_at,"%Y-%m-%d"))) %>% summarize() %>% arrange(desc(ride_length))
 ```

 Next we will create a Boxplot which will show us if there are outliers in the dataset.
 ```{r warning=FALSE}
 search_for_outliers %>% ggplot(aes(x = ride_date,y=ride_length, group=1)) + geom_boxplot()
 ```

 As you can see from the boxplot results, we have some major outliers that are skewing the shape of the boxplot and are well above the top boxplot whisker. The box part of the boxplot is flat because the Y axis is large to accommodate the large values created by the outliers.

 We will use the 'quartile()' function to get the value that 99.9% of all the values are beneath, and then we will filter out all the values on top of that.

 ```{r warning=FALSE}
 remove_row_above <-quantile(bike_share_cleaned$ride_length,.999)
 print(remove_row_above)
 ```

 As you can see above .999 of all values are below 490.2285, so we will filter out all rows with values above 490.2285
 ```{r warning=FALSE}
 bike_share_cleaned <- bike_share_cleaned %>% filter(ride_length <= as.numeric(remove_row_above))
 ```

 Now we will take a look at the Boxplot one more time

 ```{r warning=FALSE}
 bike_share_cleaned %>% 
  group_by(ride_length,ride_date=as_date(format(started_at,"%Y-%m-%d"))) %>% 
  summarize() %>% 
  ggplot(aes(x = ride_date,y=ride_length, group=1)) + geom_boxplot()
 ```

 The boxplot now shows that the outliers have been removed and the y axis is now much lower. The long upper whisker shows that there is a large variance among the greater values of the dataset.

 # Analysis

 ### <u>Mean ride length of members and non-members</u>
 ```{r}
 mean_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=mean(ride_length))
 print(mean_member_causal)
 ```

 ### <u>Median ride length of members and non-members</u>
 ```{r}
 median_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=median(ride_length))
 print(median_member_causal)
 ```

 ### <u>Max ride length of memebers and non-members</u>
 ```{r}
 max_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=max(ride_length))
 print(max_member_causal)
 ```

 ### <u>Percentile Distribution</u>
 ```{r}
 member_casual_precentile_dist <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(count=n(),percentile_dist=(n()/n_distinct(bike_share_cleaned$ride_id))*100)
 member_casual_precentile_dist
 ```

	---
	title: "google_analytics_cyclistic_case_study"
	author: "Pete Martinez"
	date: "2024-08-07"
	output:
	html_document:
	toc: true
	theme: united
	---


	# Introduction

	### Purpose

	* Test

	### Scenario



	# Ask

	You are a junior data analyst working on the marketing analyst team at Cyclistic, a bike-share
	company in Chicago. The director of marketing believes the company’s future success
	depends on maximizing the number of annual memberships. Therefore, your team wants to
	understand how casual riders and annual members use Cyclistic bikes differently. From these
	insights, your team will design a new marketing strategy to convert casual riders into annual
	members. But first, Cyclistic executives must approve your recommendations, so they must be
	backed up with compelling data insights and professional data visualizations.

	[@Case-Study-1_How-does-a-bike-shared-navigate-speedy-success_.pdf]

	### Stakeholders

	\| Stakeholder name \| Stakeholder Description \|
	\| ---------------- \| ----------------------- \|
	\| Cyclistic \| A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. \|
	\| Lily Moreno \| The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.\|
	\| Cyclistic Marketing Team
	\| A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy.\|
	\| Cyclistic Executive Team \| The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.



	### Scope / Major Project Activities

	\| Activity \| Description \|
	\| -------- \| ----------- \|
	\| Collection of data from sources \| The data is available in an AWS S3 Bucket at https://divvy-tripdata.s3.amazonaws.com/index.html\|
	\| Clean Data \| - Verify that each column has the correct type for the values it contains <br/>- Remove duplicate rows <br /> - Remove rows with with missing need values (null or invalid) <br/> - Make sure dates, phone numbers and addresses have a common format throughout the data <br/> \|
	\| Data Analysis \| - Using R Programming manipulate the data in a way that provides insight relevant to the business question being asked <br /> - Document all functions used in the analysis so they can be referred to in the future <br/> - Summarize results of the analysis \|


	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE)
	```


	# Prepare

	* Where is the data being stored?

	+ The original data files are being stored in an AWS S3 Bucket at [this link](https://divvy-tripdata.s3.amazonaws.com/index.html). <br/> The downloaded datafiles are then uploaded to kaggle cloud storage.

	* How is the data organized?

	+ The provided data files are in the CSV file format and all 12 contain the same 13 column names.

	* How was data integrity verified?

	+ The document outling the case study provided a link to a secure AWS S3 Bucket.<br/> The link to AWS S3 is provided by google which by many is considered a credible source of data.

	* Will this data help answer the business question being asked?

	+ The data provided contains all the information needed to answer the business questions being asked

	* Are there problems with the data?

	+ There are a few problems:

	- started_at and ended_at date columns have dates in different formats

	- some of the latitude and longitude coordinates do not coincide with a start_station_name or an end_station_name

	# Process

	+ Tools used:
	On my computer excel is acting buggy, due to the amount of data in each file, so I will be using the R Programming language.

	+ Data Cleaning Process

	### <u>Load packages needed for data cleaning and manipulation</u>

	```{r message=FALSE, warning=FALSE}
	library(tidyverse)
	library(lubridate)
	library(janitor)
	```

	### <u> Load data from provided CSV files. </u>
	```{r}

	data_files <- list.files(path=r"{/kaggle/input/cyclistic-raw-data}",full.names = TRUE)
	data_files
	# as you can see from the output we have our 12 files in the data_files variable
	# now we can load our data

	bike_share_data <- read_csv(data_files)

	```
	### <u> Clean the names of each column </u>
	As you can see from the printout from the function colnames(), the current names of the columns are already valid

	```{r}
	colnames(bike_share_data)
	```

	### <u>Set a common date pattern for all date columns</u>

	We would like our dates to be formatted in the '%Y-%m-%d %H:%M:%s' pattern in every date column, but in some rows, the date format is in the '%m-%d-%Y %H:%M:%s' pattern, as shown below

	```{r warning=FALSE}
	head(bike_share_data %>%
	filter(is.na(as_date(started_at,r"{%Y-%m-%d %H:%M:%s}"))) %>%
	select(started_at) %>%
	arrange(started_at))
	```

	To fix this we are going to test if the current date format is valid with as_date() then where are going to change the format using mdy_hms() from the lubridate library and finally we are going to use mutate() to update the date field

	```{r warning=FALSE}
	bike_share_cleaned <- bike_share_data %>%
	mutate(started_at = case_when(
	is.na(as_date(started_at,"{%Y-%m-%d %H:%M:%s}")) ~ mdy_hms(started_at),
	TRUE ~ as_datetime(started_at))) %>%
	mutate(ended_at = case_when(
	is.na(as_date(ended_at,"{%Y-%m-%d %H:%M:%s}")) ~ mdy_hms(ended_at),
	TRUE ~ as_datetime(ended_at)))
	```
	As you can see in the output below the dates in the started_at and ended_at columns are formatted in the %Y-%m-%d %H:%M:%s pattern
	```{r warning=FALSE}
	head(bike_share_cleaned %>%
	filter(month(started_at)==6 \| (month(started_at)==5 & day(started_at)==31 ))%>%
	select(started_at,ended_at) %>%
	arrange(started_at))
	```
	### <u> Find and remove duplicates from dataset </u>
	Using the get_dupes function from the janitor package, we will look for duplicate ride_ids, as each trip is supposed to have a unique ride_id. We determined that there are 104 duplicate rows, as shown in the output below.

	```{r}
	duplicates <- bike_share_cleaned %>% get_dupes(ride_id)
	nrow(duplicates)
	head(duplicates,n=8)
	```

	Now we will use the distinct function from the dplyr package to remove these duplicates.

	```{r}
	bike_share_cleaned <- bike_share_cleaned %>% distinct(ride_id,.keep_all = TRUE)
	```

	As you can see from the output below the duplicates have been removed

	```{r}
	bike_share_cleaned %>% get_dupes(ride_id)
	```

	### <u>Create calculated time columns</u>
	The amount of time each biker uses their bike is a valuable piece of information so we will calculate the total ride length, number of hours of the ride, and number of days of a ride.
	```{r}
	bike_share_cleaned <- bike_share_cleaned %>%
	mutate(ride_length = as.numeric(difftime(ended_at,started_at)/60)) %>%
	mutate(ride_hours = case_when(
	ride_length>60 ~ ride_length/60,
	TRUE ~ 0)) %>%
	mutate(ride_days = case_when(
	ride_hours>=24 ~ ride_hours/24,
	TRUE ~ 0
	))%>%
	mutate(day_of_week=wday(started_at))
	```

	### <u>Remove invalid rows</u>

	#### <u>Remove Started_at > Ended_at </u>
	There are some invalid rows where the start_at time is greater than the ended_at time, these rows can be removed. To remove these rows we will use a DPLYR filter.
	as you can see in the output below we have 434 row where started_at>ended_at.

	```{r}
	as.numeric(count(bike_share_cleaned %>% filter(started_at>ended_at)))
	```
	So now we can filter them out.
	```{r}
	bike_share_cleaned <- bike_share_cleaned %>%
	filter(started_at<ended_at)
	```
	Now we have zero row where started_at>ended_at
	```{r}
	as.numeric(count(bike_share_cleaned %>% filter(started_at>ended_at)))
	```
	#### <u>Remove invalid ride_id</u>
	We will be removing all ride_ids with a character count less than 16. Some of the ride_ids were long numbers that were in scientific notation and I have decided to brand these ride_ids as invalid rows. In the output below you can see an example of the invalid ride_ids.

	```{r}
	invalid_ride_id <- bike_share_cleaned %>% filter(nchar(ride_id)<16)
	nrow(invalid_ride_id)
	head(invalid_ride_id,n=6)
	```

	We will be filtering out these invalid ride_ids

	```{r}
	bike_share_cleaned <- bike_share_cleaned %>% filter(nchar(ride_id)==16)
	```

	As you can see below the rows with the invalid ride_ids have been removed.
	```{r}
	bike_share_cleaned %>% filter(nchar(ride_id)<16)
	```
	#### <u>Remove rows missing start_station_id and end_station_id </u>
	Some of the row have invalid coordinates that do not correspond to any start station or end station. I have chosen to remove these rows since I have more than an adequate sample size without them. As you can see below we have over 400,000 row with missing station data
	```{r}
	as.numeric(count(bike_share_cleaned %>% filter(is.na(start_station_id),is.na(end_station_id))))
	```
	Now we will remove these rows from bike_share_cleaned
	```{r}
	bike_share_cleaned <- bike_share_cleaned %>% filter(!is.na(start_station_id),!is.na(end_station_id))
	```

	now we have zero rows without a start_station and end_station

	```{r}
	as.numeric(count(bike_share_cleaned %>% filter(is.na(start_station_id),is.na(end_station_id))))
	```
	### <u>Remove outliers</u>

	Using the GGPLOT2 we will use visualization to find outliers. First we will group and summarize ride_date and start_at

	```{r warning=FALSE}
	search_for_outliers <- bike_share_cleaned %>% group_by(ride_length,ride_date=as_date(format(started_at,"%Y-%m-%d"))) %>% summarize() %>% arrange(desc(ride_length))
	```

	Next we will create a Boxplot which will show us if there are outliers in the dataset.
	```{r warning=FALSE}
	search_for_outliers %>% ggplot(aes(x = ride_date,y=ride_length, group=1)) + geom_boxplot()
	```

	As you can see from the boxplot results, we have some major outliers that are skewing the shape of the boxplot and are well above the top boxplot whisker. The box part of the boxplot is flat because the Y axis is large to accommodate the large values created by the outliers.

	We will use the 'quartile()' function to get the value that 99.9% of all the values are beneath, and then we will filter out all the values on top of that.

	```{r warning=FALSE}
	remove_row_above <-quantile(bike_share_cleaned$ride_length,.999)
	print(remove_row_above)
	```

	As you can see above .999 of all values are below 490.2285, so we will filter out all rows with values above 490.2285
	```{r warning=FALSE}
	bike_share_cleaned <- bike_share_cleaned %>% filter(ride_length <= as.numeric(remove_row_above))
	```

	Now we will take a look at the Boxplot one more time

	```{r warning=FALSE}
	bike_share_cleaned %>%
	group_by(ride_length,ride_date=as_date(format(started_at,"%Y-%m-%d"))) %>%
	summarize() %>%
	ggplot(aes(x = ride_date,y=ride_length, group=1)) + geom_boxplot()
	```

	The boxplot now shows that the outliers have been removed and the y axis is now much lower. The long upper whisker shows that there is a large variance among the greater values of the dataset.

	# Analysis

	### <u>Mean ride length of members and non-members</u>
	```{r}
	mean_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=mean(ride_length))
	print(mean_member_causal)
	```

	### <u>Median ride length of members and non-members</u>
	```{r}
	median_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=median(ride_length))
	print(median_member_causal)
	```

	### <u>Max ride length of memebers and non-members</u>
	```{r}
	max_member_causal <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(average_ride_length_minute=max(ride_length))
	print(max_member_causal)
	```

	### <u>Percentile Distribution</u>
	```{r}
	member_casual_precentile_dist <- bike_share_cleaned %>% group_by(member_casual) %>% summarize(count=n(),percentile_dist=(n()/n_distinct(bike_share_cleaned$ride_id))*100)
	member_casual_precentile_dist
	```