Bellabeat Data Analysis Capstone Project
1. Summary
1.1 Background
This is a capstone project for Google Data Analytics Professional, and the following is the given situation.
Urška Sršen and Sando Mur founded Bellabeat, a high-tech company that manufactures health-focused smart products. Sršen used her background as an artist to develop beautifully designed technology that informs and inspires women around the world. Collecting data on activity, sleep, stress, and reproductive health has allowed Bellabeat to empower women with knowledge about their own health and habits.
1.2 This Project
This study focuses on analyzing smart device usage data to gain insight into how consumers use non-Bellabeat smart devices. Insights gained will be applied to growth opportunities towards the Bellabeat products: primarily the Time smart watch, and subsequently the Membership.
- Time: This wellness watch combines the timeless look of a classic timepiece with smart technology to track user activity, sleep, and stress. The Time watch connects to the Bellabeat app to provide you with insights into your daily wellness.
- Bellabeat membership: Bellabeat also offers a subscription-based membership program for users. Membership gives users 24/7 access to fully personalized guidance on nutrition, activity, sleep, health and beauty, and mindfulness based on their lifestyle and goals.
2. Ask Phase
2.1 Business task statement
Garner insight from public data on use of wearable health-tracking technology that could influence and direct Bellabeat's marketing strategy, specifically for the Time smart watch and, subsequently, the Membership guidance.
Stakeholders - Urška Sršen: Bellabeat's cofounder and Chief Creative Officer - Sando Mur: Mathematician and Bellabeat's cofounder - Bellabeat marketing analytics team
3. Prepare Phase
3.1 Data Source
The data source used for this case study is the FitBit Fitness Tracker Data; a data source stored in Kaggle and was made available by Möbius.
3.2 Accessibility and privacy of data:
The data source is verified to be available for public use and are public domain CC0 1.0 Deed. The data source's author have waived their rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.
3.3 Information about our dataset:
- FitBit Fitness Tracker Data
- This dataset is generated by 30 respondents using a Fitbit to a distributed survey via Amazon Mechanical Turk between 03.12.2016-05.12.2016.
- Columns information available on Fitbit's data dictionary.
3.4 Data Organization:
Eighteen CSV files are available for analysis, each containing different quantitative data tracked by Fitbit. The data is organized in a long format, where each row represents a single time point per subject, resulting in multiple rows for each user. Each user has a unique ID, and the data is tracked by day and time.
3.5 Data Integrity and Credibility:
The dataset has limitations, including a small sample size (30 users) and a lack of demographic information, particularly gender data since Bellabeat is targeted for women, which may lead to sampling bias. This raises concerns about the sample's representativeness of the general population. Additionally, the dataset is not current, and the survey period was limited to two months. Therefore, the case study will adopt an operational approach.
4. Process Phase
For this analysis, I will be primarily using R due to ease of use, amount of data to be processed, easier documentation, and generation of data visualizations to share results with stakeholders.
4.1 Installing packages and opening libraries
Let's start by loading the necessary libraries that will aid our analysis.
library(tidyverse) # For data manipulation
library(skimr) # For data summary
library(janitor) # For cleaning column names
library(lubridate) # For date manipulation
library(readr) # For reading CSV files
library(dplyr) # For data manipulation
4.2 Loading the data
The data is stored in 18 CSV files, and we will load each file into a separate data frame. We will then combine the data frames into a single data frame for analysis.
# Load the data
daily_activity <- read_csv("data/dailyActivity_merged.csv") %>%
as.data.frame()
daily_sleep <- read_csv("data/sleepDay_merged.csv") %>%
as.data.frame()
hourly_intensities <- read_csv("data/hourlyIntensities_merged.csv") %>%
as.data.frame()
hourly_calories <- read_csv("data/hourlyCalories_merged.csv") %>%
as.data.frame()
hourly_steps <- read_csv("data/hourlySteps_merged.csv") %>%
as.data.frame()
weight <- read_csv("data/weightLogInfo_merged.csv") %>%
as.data.frame()
4.3 Preview the data
Let's take a look at the first few rows of each data frame to understand the structure of the data.
head(daily_activity)
head(daily_sleep)
head(hourly_intensities)
head(hourly_calories)
head(hourly_steps)
head(weight)
4.4 Check the data structure
Let's check the structure of each data frame to understand the variables and data types.
str(daily_activity)
str(daily_sleep)
str(hourly_intensities)
str(hourly_calories)
str(hourly_steps)
str(weight)
4.5 Data Cleaning
We will clean the data by addressing missing values, renaming columns, and converting data types to facilitate analysis.
4.5.1 Check number of participants
Let's check the number of participants in the dataset to ensure that the sample size is consistent across all data frames.
# Check # of participants in each data frame
length(unique(daily_activity$Id))length(unique(daily_sleep$Id))
length(unique(hourly_intensities$Id))
length(unique(hourly_calories$Id))
length(unique(hourly_steps$Id))
length(unique(weight$Id))
## [1] 33
## [1] 24
## [1] 33
## [1] 33
## [1] 33
## [1] 33
## [1] 8
Weight data has too little participants compared to the other data frames. We will exclude this data frame from the analysis to avoid bias since the sample size is too small. All other data frames have 33 participants, except for daily_sleep which has 24 participants.
4.5.2 Check for Duplicates
Let's check for duplicates in each data frame to ensure data integrity.
# Check for duplicates in each data frame
sum(duplicated(daily_activity))
sum(duplicated(daily_sleep))
sum(duplicated(hourly_intensities))
sum(duplicated(hourly_calories))
sum(duplicated(hourly_steps))
4.5.3 Remove Duplicates & Missing Values
Let's remove duplicates and address missing values in each data frame.
# Remove duplicates and missing values
daily_activity <- daily_activity %>% distinct() %>% drop_na()
daily_sleep <- daily_sleep %>% distinct() %>% drop_na()
hourly_intensities <- hourly_intensities %>% distinct() %>% drop_na()
hourly_calories <- hourly_calories %>% distinct() %>% drop_na()
hourly_steps <- hourly_steps %>% distinct() %>% drop_na()
4.5.4 Rename Columns
Let's standardize the column names in each data frame to ensure consistency and ease of analysis.
clean_names(daily_activity)
daily_activity <- rename_with(daily_activity, tolower)
clean_names(daily_sleep)
daily_sleep <- rename_with(daily_sleep, tolower)
clean_names(hourly_intensities)
hourly_intensities <- rename_with(hourly_intensities, tolower)
clean_names(hourly_calories)
hourly_calories <- rename_with(hourly_calories, tolower)
clean_names(hourly_steps)
hourly_steps <- rename_with(hourly_steps, tolower)
4.5.5 Convert Date Columns
Let's convert the date columns to the appropriate date format for analysis.
daily_activity <- daily_activity %>%
rename(date = activitydate) %>%
mutate(date = mdy(date))
daily_sleep <- daily_sleep %>%
rename(date = sleepday) %>%
mutate(date = mdy_hms(date))
hourly_intensities <- hourly_intensities %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
hourly_calories <- hourly_calories %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
hourly_steps <- hourly_steps %>%
rename(date_time = activityhour) %>%
mutate(date_time = mdy_hms(date_time))
4.6 Merge Data Sets
Let's merge the daily data sets into a single data frame for simplicity during analysis.
daily_data <- merge(daily_activity,daily_sleep, by =c ("id","date"))
Now let's merge the hourly data sets into a single data frame as well.
hourly_data <- merge(hourly_intensities,hourly_calories, by =c ("id","date_time")) %>%
merge(hourly_steps, by =c ("id","date_time"))
# See column structures
str(daily_data)
## 'data.frame': 410 obs. of 18 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date : Date, format: "2016-04-12" "2016-04-13" ...
## $ totalsteps : num 13162 10735 9762 12669 9705 ...
## $ totaldistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ trackerdistance : num 8.5 6.97 6.28 8.16 6.48 ...
## $ loggedactivitiesdistance: num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactivedistance : num 1.88 1.57 2.14 2.71 3.19 ...
## $ moderatelyactivedistance: num 0.55 0.69 1.26 0.41 0.78 ...
## $ lightactivedistance : num 6.06 4.71 2.83 5.04 2.51 ...
## $ sedentaryactivedistance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ veryactiveminutes : num 25 21 29 36 38 50 28 19 41 39 ...
## $ fairlyactiveminutes : num 13 19 34 10 20 31 12 8 21 5 ...
## $ lightlyactiveminutes : num 328 217 209 221 164 264 205 211 262 238 ...
## $ sedentaryminutes : num 728 776 726 773 539 775 818 838 732 709 ...
## $ calories : num 1985 1797 1745 1863 1728 ...
## $ totalsleeprecords : num 1 2 1 2 1 1 1 1 1 1 ...
## $ totalminutesasleep : num 327 384 412 340 700 304 360 325 361 430 ...
## $ totaltimeinbed : num 346 407 442 367 712 320 377 364 384 449 ...
str(hourly_data)
## 'data.frame': 22099 obs. of 6 variables:
## $ id : num 1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
## $ date_time : POSIXct, format: "2016-04-12 00:00:00" "2016-04-12 01:00:00" ...
## $ totalintensity : num 20 8 7 0 0 0 0 0 13 30 ...
## $ averageintensity: num 0.333 0.133 0.117 0 0 ...
## $ calories : num 81 61 59 47 48 48 48 47 68 141 ...
## $ steptotal : num 373 160 151 0 0 ...
# Preview the merged data sets
head(daily_data)
## id date totalsteps totaldistance trackerdistance
## 1 1503960366 2016-04-12 13162 8.50 8.50
## 2 1503960366 2016-04-13 10735 6.97 6.97
## 3 1503960366 2016-04-15 9762 6.28 6.28
## 4 1503960366 2016-04-16 12669 8.16 8.16
## 5 1503960366 2016-04-17 9705 6.48 6.48
## 6 1503960366 2016-04-19 15506 9.88 9.88
## loggedactivitiesdistance veryactivedistance moderatelyactivedistance
## 1 0 1.88 0.55
## 2 0 1.57 0.69
## 3 0 2.14 1.26
## 4 0 2.71 0.41
## 5 0 3.19 0.78
## 6 0 3.53 1.32
## lightactivedistance sedentaryactivedistance veryactiveminutes
## 1 6.06 0 25
## 2 4.71 0 21
## 3 2.83 0 29
## 4 5.04 0 36
## 5 2.51 0 38
## 6 5.03 0 50
## fairlyactiveminutes lightlyactiveminutes sedentaryminutes calories
## 1 13 328 728 1985
## 2 19 217 776 1797
## 3 34 209 726 1745
## 4 10 221 773 1863
## 5 20 164 539 1728
## 6 31 264 775 2035
## totalsleeprecords totalminutesasleep totaltimeinbed
## 1 1 327 346
## 2 2 384 407
## 3 1 412 442
## 4 2 340 367
## 5 1 700 712
## 6 1 304 320
head(hourly_data)
## id date_time totalintensity averageintensity calories
## 1 1503960366 2016-04-12 00:00:00 20 0.333333 81
## 2 1503960366 2016-04-12 01:00:00 8 0.133333 61
## 3 1503960366 2016-04-12 02:00:00 7 0.116667 59
## 4 1503960366 2016-04-12 03:00:00 0 0.000000 47
## 5 1503960366 2016-04-12 04:00:00 0 0.000000 48
## 6 1503960366 2016-04-12 05:00:00 0 0.000000 48
## steptotal
## 1 373
## 2 160
## 3 151
## 4 0
## 5 0
## 6 0
5. Analyze & Share Phase
Let's conduct exploratory data analysis to gain insights into the data and identify trends that could inform Bellabeat's marketing strategy.
5.1 Daily Activity
Let's start by analyzing daily activity data to understand user behavior.
# Summary statistics for daily activity data
daily_activity %>%
select(totalsteps, calories, sedentaryminutes, lightlyactiveminutes, fairlyactiveminutes, veryactiveminutes) %>%
skim()
Table: Data summary
Name | Piped data |
Number of rows | 940 |
Number of columns | 6 |
_______________________ | |
Column type frequency: | |
numeric | 6 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
totalsteps | 0 | 1 | 7637.91 | 5087.15 | 0 | 3789.75 | 7405.5 | 10727.00 | 36019 | ▇▇▁▁▁ |
calories | 0 | 1 | 2303.61 | 718.17 | 0 | 1828.50 | 2134.0 | 2793.25 | 4900 | ▁▆▇▃▁ |
sedentaryminutes | 0 | 1 | 991.21 | 301.27 | 0 | 729.75 | 1057.5 | 1229.50 | 1440 | ▁▁▇▅▇ |
lightlyactiveminutes | 0 | 1 | 192.81 | 109.17 | 0 | 127.00 | 199.0 | 264.00 | 518 | ▅▇▇▃▁ |
fairlyactiveminutes | 0 | 1 | 13.56 | 19.99 | 0 | 0.00 | 6.0 | 19.00 | 143 | ▇▁▁▁▁ |
veryactiveminutes | 0 | 1 | 21.16 | 32.84 | 0 | 0.00 | 4.0 | 32.00 | 210 | ▇▁▁▁▁ |
Insights:
- Total Steps: The average number of steps taken by users is 7638, Walking 10,000 steps daily is associated with several health benefits, including improved cardiovascular health, weight management, better mood, and enhanced joint health. Regular walking can lower the risk of heart disease, diabetes, and high blood pressure, while also helping to reduce stress and improve overall mental well-being [1][2][3]. This suggests that users are not meeting the recommended daily step count.
- Activity Levels: While some participants meet recommended physical activity levels, many do not. There is a significant variation in physical activity levels among participants, with some being highly active and others largely sedentary. This indicates that there is an opportunity to encourage more users to engage in physical activity.
5.2 Daily Sleep
Next, let's analyze daily sleep data to understand user sleep patterns.
# Summary statistics for daily sleep data
daily_sleep %>%
select(totalminutesasleep, totaltimeinbed) %>%
skim()
Table: Data summary
Name | Piped data |
Number of rows | 410 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
totalminutesasleep | 0 | 1 | 419.17 | 118.64 | 58 | 361.00 | 432.5 | 490 | 796 | ▁▂▇▃▁ |
totaltimeinbed | 0 | 1 | 458.48 | 127.46 | 61 | 403.75 | 463.0 | 526 | 961 | ▁▃▇▁▁ |
Let's create a visualization grouped by weekday.
# Create a new column for the weekday
daily_sleep <- daily_sleep %>%
mutate(weekday = wday(date, label = TRUE))
# Plot total minutes asleep by weekday
daily_sleep %>%
ggplot(aes(x = weekday, y = totalminutesasleep, fill = weekday)) +
geom_boxplot() +
labs(title = "Total Minutes Asleep by Weekday",
x = "Weekday",
y = "Total Minutes Asleep") +
theme_minimal()
Summary of the data by weekday
# Summary of total minutes asleep by weekday
daily_sleep %>%
group_by(weekday) %>%
summarize(avg_total_minutes_asleep = mean(totalminutesasleep))
## # A tibble: 7 × 2
## weekday avg_total_minutes_asleep
## <ord> <dbl>
## 1 Sun 453.
## 2 Mon 420.
## 3 Tue 405.
## 4 Wed 435.
## 5 Thu 401.
## 6 Fri 405.
## 7 Sat 419.
Insights:
- Total Minutes Asleep: The average total minutes asleep is 419.8, which is below the recommended 7-9 hours of sleep per night for adults. Sleep is essential for overall health and well-being, with insufficient sleep linked to various health issues, including obesity, heart disease, and mental health problems [4][5].
- Weekday vs. Weekend Sleep: Users tend to sleep longer on weekends compared to weekdays. Sleep time during weekdays are mostly less than the minimum of 7 hours. This suggests that users may be catching up on sleep during the weekend, indicating that they may not be getting enough sleep during the week.
5.3 Daily Steps v. Calories Burned
Let's analyze the relationship between daily steps and calories burned to understand the impact of physical activity on energy expenditure.
# Summary statistics for hourly activity data
daily_activity %>%
select(totalsteps, calories) %>%
skim()
Table: Data summary
Name | Piped data |
Number of rows | 940 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
numeric | 2 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
totalsteps | 0 | 1 | 7637.91 | 5087.15 | 0 | 3789.75 | 7405.5 | 10727.00 | 36019 | ▇▇▁▁▁ |
calories | 0 | 1 | 2303.61 | 718.17 | 0 | 1828.50 | 2134.0 | 2793.25 | 4900 | ▁▆▇▃▁ |
Let's create a visualization to check correlation between steps and calories burned.
# Create a scatter plot of steps vs. calories
ggplot(data = daily_activity, aes(x = totalsteps, y = calories)) +
geom_point() +
geom_smooth() +
labs(title = "Total Steps vs. Calories") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
Insights:
- Steps vs. Calories: There is a positive correlation between the number of steps taken and the number of calories burned. This suggests that users who take more steps tend to burn more calories, which is essential for weight management and overall health. Encouraging users to increase their daily step count could help improve their overall health and well-being.
5.4 Hourly Intensity
Let's now take a look at data on hourly intensity to understand activity patterns. We first need to split date and time values.
hourly_intensities <- hourly_intensities %>%
separate(date_time, into = c("date", "hour"), sep= " ")
head(hourly_intensities)
## id date hour totalintensity averageintensity
## 1 1503960366 2016-04-12 <NA> 20 0.333333
## 2 1503960366 2016-04-12 01:00:00 8 0.133333
## 3 1503960366 2016-04-12 02:00:00 7 0.116667
## 4 1503960366 2016-04-12 03:00:00 0 0.000000
## 5 1503960366 2016-04-12 04:00:00 0 0.000000
## 6 1503960366 2016-04-12 05:00:00 0 0.000000
hourly_intensities <- hourly_intensities %>%
group_by(hour) %>%
drop_na() %>%
summarise(avg_total_int = mean(totalintensity))
Let's make a visualization off this data.
ggplot(data = hourly_intensities, aes(x = hour,y = avg_total_int)) +
geom_histogram(stat='identity',fill = '#350352') +
labs(title = "Average Total Intensity vs Hour") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Insights:
- Hourly Intensity: The average total intensity varies throughout the day, with peaks in the morning and evening. This suggests that users are more active during these times, which could be due to work schedules, exercise routines, or other factors. Understanding these patterns can help Bellabeat tailor their marketing strategies to target users during peak activity times.
- Peak Activity Times: The data shows that users are most active in the morning and evening, which are common times for exercise and physical activity. In the evenings, specifically around 5:00pm to 7:00pm, are times when people usually get off work. This information can be used to target users with marketing messages promoting physical activity during these peak times.
5.5 Hourly Steps
Let's analyze hourly steps data to understand user step patterns throughout the day. We first need to split date and time values.
hourly_steps <- hourly_steps %>%
separate(date_time, into = c("date", "hour"), sep= " ")
head(hourly_steps)
## id date hour steptotal
## 1 1503960366 2016-04-12 <NA> 373
## 2 1503960366 2016-04-12 01:00:00 160
## 3 1503960366 2016-04-12 02:00:00 151
## 4 1503960366 2016-04-12 03:00:00 0
## 5 1503960366 2016-04-12 04:00:00 0
## 6 1503960366 2016-04-12 05:00:00 0
hourly_steps <- hourly_steps %>%
group_by(hour) %>%
drop_na() %>%
summarise(avg_total_steps = mean(steptotal))
Let's make a visualization off this data.
ggplot(data = hourly_steps, aes(x = hour,y = avg_total_steps, fill = avg_total_steps)) +
geom_histogram(stat='identity') +
labs(title = "Average Total Steps vs Hour") +
theme_minimal() +
scale_fill_gradient(low = "red", high = "green")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Insights:
- Hourly Steps: This data shows the same pattern as hourly intensity, with peaks in the morning and evening. Users tend to take more steps during these times, which means we could also suggest to target users with marketing messages promoting physical activity regarding step count during these peak times.
5.6 Steps by Weekday
Let's analyze the average number of steps taken by users on each weekday to understand weekly activity patterns.
# Create a new column for the weekday
daily_activity <- daily_activity %>%
mutate(weekday = wday(date, label = TRUE))
Let's create a visualization to show the average steps taken by users on each weekday, with a horizontal line at both 7.5k and 10k steps.
# Plot average steps by weekday
daily_activity %>%
ggplot(aes(x = weekday, y = totalsteps, fill = weekday)) +
geom_boxplot() +
geom_hline(yintercept = 7500, linetype = "dashed", color = "red") +
geom_hline(yintercept = 10000, linetype = "dashed", color = "green") +
labs(title = "Average Steps by Weekday",
x = "Weekday",
y = "Total Steps") +
theme_minimal()
Insights:
- Steps by Weekday: Users tend to take more steps on weekends compared to weekdays. This suggests that users may be more active on weekends, which could be due to having more free time to engage in physical activities. Bellabeat could leverage this information to encourage users to maintain their activity levels during the week.
- Average Steps: Although a lot of data suggests that 10k steps is the recommended daily step count, a minimum of 7.5k steps is also beneficial for health. The data shows that users are mostly just below the 7.5k steps mark, indicating that they may not be meeting the minimum recommended daily step count. [6][7][8]
6 Recommendations
Bellabeat's mission is to empower women's health through technology and data. Based on the data analysis, here are key marketing strategy recommendations:
- Monthly Events: Organize monthly challenges or events to encourage users to increase their daily step count and physical activity levels. Offer rewards or incentives to motivate participation when they use Bellabeat products.
- Target Peak Activity Times: Use notifications to engage users during peak times (morning and evening) to encourage physical activity. Weekends are also a good time to promote wellness activities since users tend to be more active during this time.
- Goal Setting: Encourage users to set daily step goals and track progress to motivate them to stay active.
Specifically For Bellabeat's Time Smart Watch:
- Improve Activity Tracking: Provide real-time feedback and encourage daily activity. Maybe a vibration alert when users are inactive for too long, or a notification when they reach their daily step goal to celebrate their achievement.
- Enhance Sleep Monitoring: Offer insights and recommendations to improve sleep quality. Provide bedtime reminders to help users establish a healthy sleep routine.
- Introduce Stress Management: Provide tools to help manage stress and promote relaxation. Offer guided breathing exercises or mindfulness activities to reduce stress levels.
Specifically For Bellabeat's App:
- Personalized Guidance: Offer tailored advice on wellness, as well as data visualization to help users understand their health and wellness trends.
- Resources and Tips: Provide articles, videos, and resources on physical activity, sleep, nutrition, and mental health to educate and motivate users. Could also become another revenue stream through partnerships with health and wellness brands.
- Community Support: Create a user community for shared experiences and motivation. Encourage users to share their progress, challenges, and successes and provide a platform for peer support.
7. References
- Mayo Clinic - Walking: Trim your waistline, improve your health
- American Heart Association - Is 10,000 steps really a magic number for health?
- Cleveland Clinic - Do You Really Need 10,000 Steps a Day?
- CDC - How Much Sleep Do I Need?
- National Sleep Foundation - How Much Sleep Do We Really Need?
- JAMA Network - Association of Step Volume and Intensity With All-Cause Mortality in Older Women
- NIH Research Matters - How Many Steps Are Better for Health?
- Harvard Health - 10,000 steps a day — or fewer?