thatanalyst.net Home Google Data Analytics Capstone, Bellabeat
Post
Cancel

Google Data Analytics Capstone, Bellabeat

Language :

INTRODUCTION

In this case study, we worked as a junior data analyst working on the marketing analyst team at Bellabeat, a high-tech manufacturer of health-focused products for women, and were given the task to analyze smart device data to gain insight into how consumers are using their smart devices. The insights we discover will then help guide marketing strategy for the company. We have to present our analysis as well as the insights and high-level recommendations to the executive team at Bellabeat.



INSIGHTS

Insight #1: Distribution of participants based on steps

On an average, a person in the dataset walks 8338 steps in a day, with the median steps being 8054. Research by the mayo clinic suggests an average American walks 3000 to 4000 steps per day. The people in the dataset are definitely much more active than the general population, but it is still below the recommended goal of 10000 steps per day, which help reduce the risk of a number of common health problems including:

  • Heart disease
  • Obesity
  • Diabetes
  • High blood pressure
  • Depression

Average alone as a number does not give us much information about the distribution. Taking a look at how the number of people changes with different steps per day for the 33 participants -

We can see that the distribution is closer to a normal distribution. Instead of analysing the entire population as a single group, it will be better to divide them into three groups:

  • Group A: Less than 6000 steps per day.
  • Group B: Between 6000 and 12000 steps per day.
  • Group C: More than 12000 steps per day.

Taking a look at the distribution of each group -

This tells us that there are close to 30%(A) people which are not very active and are close to a sedentary lifestyle, around 60%(B) of the people are moderately active which is the biggest chunk, and a little over 10%(C) of the people are very active.

This will help us to analyze the population better and make suitable recommendations based on the group instead of just suggesting recommendations to the entire group as a whole.


Insight #2: Analysis of steps on the basis of days of the week

We were given the date for each observation, which is a very valuable piece of information, since it can help us to add granularity to our analysis. Instead of looking at the collective dataset at once, we can divide it based on days of the week and find out usage patterns as they vary across the entire week. Taking a look at the average steps again but divided on the week -

This gives us very interesting insights! The most active day on average with the most number of steps is Saturday with almost 9000 steps, and last one is Sunday with only 7400 steps. This information is gold, because it tells us that the users of the fitness device partake in physical activities not only on the weekday, but also on the weekend. One hypothesis that can be drawn from this is that most people in this dataset give special attention to their health on Saturday after a week of work, and have a rest day on Sunday. This will potentially help us to make business recommendations later on.

Taking a glance at the same data for individual participants for any surprises -

The data is quite varied across the population, therefore adding our useful groupings to the same helps make sense of it much better -

For the first group A- the most active days are Monday and Saturday, for B- Tuesday and Saturday, and for C- Tuesday, Wednesday and Saturday. This tells us that Saturday is an important day for people from all the three groups, followed by a high activity day in the week. Also, Sunday is the least active, supporting our assumption that it is a rest day for everyone.

To drive our point home and make this data a little easier to read, we can combine the weekdays into a group and show the same results -


Insight #3: Sleep vs Activity of the day

It would be easy(and wrong!) to assume that more activity or more steps that day would lead to more time asleep. Sleep is a very subjective quality, and varies not just across individuals, but also for the same person, and can depend on numerous factors including health, work, psychological and mental state, and so much more.

Checking the correlation between these variables gives us a number of -0.24, and this is further supported by this scatter plot -

This does not indicate any correlation between the two, and it is safe to conclude that total time slept does not depend on the activity (steps taken) for that day.


Insight #4: Analysis of sleep on the basis of days of the week

Continuing a similar analysis for what we did on the number of steps, we take a look at the average time slept per day of the week -

Further supporting our hypothesis of the rest day, we see that participants slept the most on Sunday, with the average being around 8 hours. This was followed by around 7.5 hours on Saturday and less than 7 hours on the weekdays.

Applying our groupings based on steps and further weeks on this analysis -

This chart gives us wonderful insights into how the sleeping pattern varies across the different groups-

  • For the least active Group A, sleeping time was almost constant throughout the week, around 7.4 hours.
  • For moderately active Group B, sleep was lowest on the weekdays around 7 hours, and around 8 hours on the weekends, with Sundays taking a slight edge.
  • For the very active Group C, sleep was very low on the weekdays at around 5.4 hours, around 6 hours on Saturday, which was the most active day, and at a stark contrast from others at around 9 hours on Sunday, the rest day!

Insight #5: Sleep vs Sedentary time of the day

The column including Sedentary data had a number of errors, and was quite misleading. Upon inspection, it was found out that 28% of the observations had Sedentary time more than 20 hours from the 24 hours of the day, more than 55% of the observations had it more than 16 hours. This led me to a hypothesis that this value is in fact the complete sedentary activity throughout the day, including sleep time. This was further supported by the fact that over 8% of the observations had Sedentary time as 24 hours, with a few hours for sleep.

Therefore, to get the actual sedentary time, i.e., time spent is sedentary activity, apart from the time spent on the bed for sleeping, I subtracted the time spent on bed from the given sedentary time.

Next, I plotted a scatterplot for Sleep and the corrected Sedentary time for each observation -

This again proved to be a very important result as checking the correlation between corrected sedentary time and sleep hours gave a value of -0.76 on a scale of -1 to +1. This is a high negative value, and indicates that the more time you spent on Sedentary Activity throughout the day, the less time you actually slept. In other words, people engaged in some form of activity slept more (and arguably better) than the people who chose a sedentary lifestyle. This will also be used to guide business recommendations later.



RECOMMENDATIONS

Based on the insights above, we have come up with the following recommendations. These features could be added to the mobile application of these devices, as well as the reminders can be added as a vibration to the current devices, as well as a digital message on the on future devices which may have a display.

  • Since we divided the population into three groups as A-Sedentary, B-Moderately Active and C-Very Active, it would be best to tailor different recommendations and marketing strategies for each group.
    • For group A, where the number of steps are much lower than the average, we can motivate them to be more active by creating awareness about the benefits of being more active, the positive effects it will have on their life, as well as informing them of the risks associated with sedentary lifestyle from time to time.
    • For group B, who are moderately active, we can give them positive reinforcement that they are doing well, and add timely messages to inform them about the little extra that they need to do to get above the recommended numbers. Also, as these are regular users, we can add options of streaks for continuous days worked, which would further enhance their chances to be regular in their workout, and use this device to maintain their streak.
    • For group C, which is the most active group, the participants regularly engage in physical activities, which could span a number of categories like gym, trekking, running, etc. So, if we did a survey on these people to collect the top categories, and added ways to analyze those complex modes, incorporating them as workout modes, they would be more probable to take this device with them the next time they engage in those.
  • Majority of the people across all the groups were most active on Saturday, followed by a day in the mid-week, and least active on Sunday. We can motivate them with positive messages and quotes on Saturday morning so that they are more likely to go ahead with their workout, and could help to make up their mind if they were deciding between working out or skipping workout that day. Also, since most people took a rest on Sunday, we can give them timely reminders about the benefits of a good rest day in the week throughout that day.
  • Since there was an obvious negative correlation between the Sedentary time spent throughout the day and the time slept, this could be used to motivate the participants to choose a more active lifestyle in order to sleep more and sleep better. Their sleep schedule can also be analyzed, and give them reminders an hour before their sleep time to reduce screen time and work for a better sleep at night.


METHODOLOGY

Here, we will summarize the process we went through to complete this analysis. This is divided into six phases - Ask, Prepare, Process, Analyze, Share and Act.

Ask

Q: What is the problem we are trying to solve?
A: The big-picture problem we are trying to solve is to unlock Bellabeat’s potential and convert it to a large player in the global smart device market. The company needs marketing strategies and a complete business plan for this to happen. The analytics team needs to come up with some of those strategies, mainly - analyzing how non-Bellabeat smart-device owners are using their devices, and implementing the insights from the trends in that data to improve Bellabeat products.


Primary Stakeholder: Urška Sršen- Bellabeat’s co founder and Chief Creative Officer Secondary Stakeholders: Sando Mur- Mathematician and Bellabeat’s cofounder (key member of the Bellabeat executive team), as well as Bellabeat marketing analytics team.


Business task: Analyze the smart device data provided to draw insights into device usage patterns, and make recommendations to drive marketing strategies.


Prepare

The data for this analysis has been downloaded from ‘https://www.kaggle.com/arashnic/fitbit’.

The dataset contains personal fitness tracker data from thirty fitbit users. This includes information about daily activity, steps, and heart rate that can be used to explore users’ habits. Some of these metrics are on the basis of days, i.e., one reading per day, while some of them contain one reading per minute.

This data is available in the Public Domain (CC0), and adheres to the standards of data ethics and privacy. The users consented to the submission of personal tracker data, including minute-level output for physical activity, heart rate, and sleep monitoring, and no personal information was included in any of these datasets.

Apart from this, sadly the data does not ROCCC. It wasn’t very reliable, with multiple errors and inconsistencies within the dataset. Although it was original, as fitbit collected it using their own devices, it wasn’t comprehensive as several observations for multiple users were very limited or not available at all. This was also for one month of 2016, and wasn’t current data. Also, this data wasn’t vetted or cited, as there was no mention of fitbit checking or approving this dataset.


Process

For this analysis, we will be using R for the complete data analysis process. Its package ‘tidyverse’ contains a number of useful packages and is extremely versatile, covering all the tools needed to work with the dataset. For visualizations, we will be using ggplot2, and later on Tableau to create a dashboard.

The Data Cleaning Process:

  • For the first dataset DayActivity:
    • There were 33 participants, but the records for each participant were not equal. Participant Id - 4057192912 contained only 4 observations, and would have affected our analysis creating a bias in favor of only 4 values, which did not span even a week, and thus was removed from the dataset. The remaining participants had at least 18 records.
    • There were a number of observations where the TotalSteps, and all the corresponding variables were 0. The SedentaryTime for these rows was also 1440 minutes or 24 hours. There must have been some error in collection of these records as they must have been skipped, and these had to be removed too.
    • Similarly, there were certain observations where the Sedentary Time was 24 hours indicating no movement throughout the day, and the steps being 3000-12000. This was also clearly an error and these records were removed too. This was also done where the sleep was more than a minute and SedentaryTime was 0.
    • An outlier with steps = 36019 from participantId = 2, where the average and the mean for that participant were below 5000 steps, as it would have affected the averages considerably.
    • Date was converted from “m/d/y” to “m-d-y” to make it compatible with further functions.
  • For the second dataset SleepDay:
    • There were 24 participants, out of which 8 participants had 5 records or less, and one had 8 records. All these participants were removed and the new minimum records for a participant was 15, which was at least two weeks worth of records.
    • Outliers for the sleep dataset were not removed because they did seem plausible, as sleep is a very subjective quality in itself.
    • While plotting SedentaryTime vs TotalTimeAsleep, a handful of records had a higher value for TotalTimeinBed than SedentaryTime. This was again part of the ambiguity with the Sedentary time column, and in accordance to our previous assumption, supported by a number of key factors, all those records were removed too.
    • Sleep data had date-time instead of data. String functions were used to split on “ “ (space), and the extracted data was used to merge with the activity dataset.
    • Observations where the corrected sedentary time was less than 0 were not considered for the correlation.

Analyze and Share

I have combined the analysis and visualization part, because one often gives insights into the other.

  • Starting with the clean data, the first step was to get an idea of the distribution of data. The summary() function was of help here to get information about the mean, median, maximum and minimum values, etc of all the required columns.
  • To get more information than just a mean, Total steps were plotted against their frequency in a binned bar chart. Three groups of width 6000 steps were made from this - A, B and C, and these were used for all further analysis.
  • wday function of lubridate was used to calculate which day of the week that date belonged to, and a column with a label for each date was added, with 1=Sunday, 2=Monday, and so on till 7=Saturday.
  • The dataset was grouped on these weekdays to get a more granular analysis of activity throughout the week.
  • Using the previously assigned groups, a column to assign a group label (A, B, C) was added, and the data was grouped into these, then again divided on the weekday.
  • To make it a bit easier to understand, a new label was assigned where Mon-Fri was assigned WEEK, and the same analysis was carried out as above. It has three bars now - Saturday, Sunday and Weekday.
  • Activity and sleep datasets were merged on Id and ActivityDate.
  • Correlation scatter-plot was drawn for time slept and steps taken.
  • Sleep distribution was calculated over the days of the week, and was further divided into our groups and again plotted on a bar chart.
  • A new column for corrected Sedentary time was introduced with the value being the difference between the corrected time and total time in bed, and a scatter plot was drawn for the same, along with a correlation value.

A detailed analysis with all the code and the output is available in the appendix.


Act

  • Several insights were drawn from this analysis, and are shared comprehensively in the Insights Section of this page.
  • 3 key recommendations were also made for the mobile app or display of future smart products, shared on the Recommendations Section.


APPENDIX: WALKTHROUGH

Step 0: Installing and importing packages

1
2
3
install.packages(tidyverse)
install.packages(skimr)
install.packages(lubridate)
1
library(tidyverse)
1
2
3
4
5
6
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.0     v forcats 0.5.1
1
2
3
library(skimr)
library(tibble)
library(lubridate)

Step 1: Importing and Cleaning Activity

This will be the first dataset we will be analyzing.

Importing the dataset

1
activity <- read.csv('C://Users/hamda/Desktop/Portfolio Projects/3 Google/data/dailyActivity_merged.csv')

Quick summary of the data

1
str(activity)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
## 'data.frame':    940 obs. of  15 variables:
##  $ Id                      : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ ActivityDate            : chr  "4/12/2016" "4/13/2016" "4/14/2016" "4/15/2016" ...
##  $ TotalSteps              : int  13162 10735 10460 9762 12669 9705 13019 15506 10544 9819 ...
##  $ Total Distance           : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ TrackerDistance         : num  8.5 6.97 6.74 6.28 8.16 ...
##  $ LoggedActivitiesDistance: num  0 0 0 0 0 0 0 0 0 0 ...
##  Very Active Distance      : num  1.88 1.57 2.44 2.14 2.71 ...
##  $ ModeratelyActiveDistance: num  0.55 0.69 0.4 1.26 0.41 ...
##  $ LightActiveDistance     : num  6.06 4.71 3.91 2.83 5.04 ...
##  Sedentary Activity Distance : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ VeryActiveMinutes       : int  25 21 30 29 36 38 42 50 28 19 ...
##  $ FairlyActiveMinutes     : int  13 19 11 34 10 20 16 31 12 8 ...
##  $ LightlyActiveMinutes    : int  328 217 181 209 221 164 233 264 205 211 ...
##  $ SedentaryMinutes        : int  728 776 1218 726 773 539 1149 775 818 838 ...
##  $ Calories                : int  1985 1797 1776 1745 1863 1728 1921 2035 1786 1775 ...

Working with the dataset

Number of participants:

1
n_distinct(activity$Id)
1
## [1] 33

This tells us that there are 33 distinct participants in this dataset.

To make it easier to work with and visualize the data, we will be adding an identifier for the ID, from 1-33

1
2
3
4
5
activity <- activity %>% 
  mutate(partID = as.integer(as.factor(Id)))
activity %>% 
  group_by(partID) %>% 
  summarize(Id = mean(Id))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 33 x 2
##    partID         Id
##     <int>      <dbl>
##  1      1 1503960366
##  2      2 1624580081
##  3      3 1644430081
##  4      4 1844505072
##  5      5 1927972279
##  6      6 2022484408
##  7      7 2026352035
##  8      8 2320127002
##  9      9 2347167796
## 10     10 2873212765
## # ... with 23 more rows

Before we proceed with the analysis, we need to count the number of observations for each participant. This will ensure that the results are fair and doesn’t skew in favour of one participant.

1
2
3
4
activity %>% 
  group_by(Id, partID) %>% 
  count() %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 33 x 3
## # Groups:   Id, partID [33]
##            Id partID     n
##         <dbl>  <int> <int>
##  1 4057192912     14     4
##  2 2347167796      9    18
##  3 8253242879     29    19
##  4 3372868164     11    20
##  5 6775888955     24    26
##  6 7007744171     26    26
##  7 6117666160     22    28
##  8 6290855005     23    29
##  9 8792009665     32    29
## 10 1644430081      3    30
## # ... with 23 more rows

We listed the number of observations for each participant in ascending order and found that for partID = 14, there were only 4 observations in the dataset. This will therefore not be useful in further analysis and has to be removed. There are a few more with 18-20 days of data, which although not ideal, will give us their information for around 3 weeks and will enable us to find trends in their usage pattern.

Removing observations for participant-14.

1
activity <- activity[ !(activity$partID == 14), ]

Checking removed observations

1
2
3
4
activity %>% 
  group_by(Id, partID) %>% 
  count() %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 32 x 3
## # Groups:   Id, partID [32]
##            Id partID     n
##         <dbl>  <int> <int>
##  1 2347167796      9    18
##  2 8253242879     29    19
##  3 3372868164     11    20
##  4 6775888955     24    26
##  5 7007744171     26    26
##  6 6117666160     22    28
##  7 6290855005     23    29
##  8 8792009665     32    29
##  9 1644430081      3    30
## 10 3977333714     12    30
## # ... with 22 more rows

Taking a look at the statistics of the dataset now

1
2
3
activity %>% 
  select(TotalSteps, Calories, SedentaryMinutes) %>% 
  summary()
1
2
3
4
5
6
7
##    TotalSteps       Calories    SedentaryMinutes
##  Min.   :    0   Min.   :   0   Min.   :   0.0  
##  1st Qu.: 3790   1st Qu.:1830   1st Qu.: 729.0  
##  Median : 7441   Median :2134   Median :1057.0  
##  Mean   : 7654   Mean   :2305   Mean   : 990.2  
##  3rd Qu.:10734   3rd Qu.:2794   3rd Qu.:1226.8  
##  Max.   :36019   Max.   :4900   Max.   :1440.0

A closer look at the summary statistics tell us that there are observations in the data where the TotalSteps for the day is zero. This is not possible and we hypothesise that these must be erroneous observations. We need to take a closer look at those rows.

1
2
3
activity %>% 
  filter(TotalSteps==0) %>% 
  head()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
##           Id ActivityDate TotalSteps TotalDistance TrackerDistance
## 1 1503960366    5/12/2016          0             0               0
## 2 1844505072    4/24/2016          0             0               0
## 3 1844505072    4/25/2016          0             0               0
## 4 1844505072    4/26/2016          0             0               0
## 5 1844505072     5/2/2016          0             0               0
## 6 1844505072     5/7/2016          0             0               0
##   LoggedActivitiesDistance VeryActiveDistance ModeratelyActiveDistance
## 1                        0                  0                        0
## 2                        0                  0                        0
## 3                        0                  0                        0
## 4                        0                  0                        0
## 5                        0                  0                        0
## 6                        0                  0                        0
##   LightActiveDistance SedentaryActiveDistance VeryActiveMinutes
## 1                   0                       0                 0
## 2                   0                       0                 0
## 3                   0                       0                 0
## 4                   0                       0                 0
## 5                   0                       0                 0
## 6                   0                       0                 0
##   FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes Calories partID
## 1                   0                    0             1440        0      1
## 2                   0                    0             1440     1347      4
## 3                   0                    0             1440     1347      4
## 4                   0                    0             1440     1347      4
## 5                   0                    0             1440     1348      4
## 6                   0                    0             1440     1347      4

This gives us a very important insight into the dataset. It could have been assumed from the mean of SedentaryMinutes (990 mins = 16.5 hours) these are Sedentary Minutes for the day, which would be a very large value. We see from the following rows that they have value for them as 1440 minutes or 24 hours. Therefore: - These observations have to be removed. - Sedentary Minutes are for the complete duration of the day, including sleep.

Removing such rows:

1
activity <- activity[ !(activity$TotalSteps == 0), ]

Again taking a look at the summary statistics:

1
2
3
activity %>% 
  select(TotalSteps, Calories, SedentaryMinutes) %>% 
  summary()
1
2
3
4
5
6
7
##    TotalSteps       Calories    SedentaryMinutes
##  Min.   :    4   Min.   :  52   Min.   :   0.0  
##  1st Qu.: 4924   1st Qu.:1856   1st Qu.: 721.0  
##  Median : 8056   Median :2220   Median :1020.5  
##  Mean   : 8331   Mean   :2362   Mean   : 955.1  
##  3rd Qu.:11100   3rd Qu.:2833   3rd Qu.:1189.0  
##  Max.   :36019   Max.   :4900   Max.   :1440.0

Checking other statistics

1
2
3
4
activity %>% 
  arrange(desc(SedentaryMinutes)) %>% 
  select(Id, partID, TotalSteps, SedentaryMinutes) %>% 
  head()
1
2
3
4
5
6
7
##           Id partID TotalSteps SedentaryMinutes
## 1 4319703577     15       7753             1440
## 2 4388161847     16      10122             1440
## 3 8583815059     31       5319             1440
## 4 8583815059     31       3008             1440
## 5 8583815059     31       8469             1440
## 6 8583815059     31      12015             1440

These are further erroneous observations where the SedentaryMinutes are mentioned as 1440 or 24 hours, meaning there was no movement at all throughout the day, but the user took more than 3000 steps at least.

This was probably some bug in the device or database, and these need to be removed too.

There are several more observations with only a minute of movement in the entire 24 hour duration. At first glance, we wanted to remove these as they don’t seem plausible, but we have to put our biases aside and consider them, since their steps match that minute mark too.

1
activity <- activity[ !(activity$SedentaryMinutes == 1440 | activity$SedentaryMinutes == 0), ]

Checking data after removing observations observations

1
2
3
4
activity %>% 
  group_by(Id, partID) %>% 
  count() %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
## # A tibble: 32 x 3
## # Groups:   Id, partID [32]
##            Id partID     n
##         <dbl>  <int> <int>
##  1 1927972279      5    17
##  2 4020332650     13    17
##  3 6775888955     24    17
##  4 2347167796      9    18
##  5 8253242879     29    18
##  6 8792009665     32    19
##  7 3372868164     11    20
##  8 1844505072      4    21
##  9 6117666160     22    23
## 10 6290855005     23    24
## # ... with 22 more rows

Our total observation set has decreased by quite a bit, but at least if will not guide us towards wrong outcomes. This highlights the importance of clean data.

1
2
3
activity %>% 
  select(TotalSteps, Calories, SedentaryMinutes) %>%
  summary()
1
2
3
4
5
6
7
##    TotalSteps       Calories    SedentaryMinutes
##  Min.   :    4   Min.   :  52   Min.   :   2.0  
##  1st Qu.: 4928   1st Qu.:1855   1st Qu.: 720.8  
##  Median : 8062   Median :2214   Median :1019.5  
##  Mean   : 8350   Mean   :2362   Mean   : 952.2  
##  3rd Qu.:11102   3rd Qu.:2832   3rd Qu.:1187.0  
##  Max.   :36019   Max.   :4900   Max.   :1439.0

The data is mainly clean now, but we should still take special care of the corner cases when doing our analysis, and do a sanity check on the outcomes we get.

A box plot will be ideal now to get a visual sense of the distribution of data and spotting the remaining outliers.

1
2
3
4
5
6
7
8
9
10
activity %>% 
  ggplot(mapping = aes(
    x=as.character(partID), 
    y=TotalSteps,
    fill="Red")
    ) +
  geom_boxplot() + 
  guides(fill="none") + 
  labs(x="Participants", y="Total Number of Steps",
       title = "Statistical analysis of Average Steps vs Participants")

1
2
3
  #filter(TotalSteps != 36019) to remove the outlier and condense the plot
activity <- activity %>% 
  filter(TotalSteps != 36019)

The data looks statistically alright now, not ideal but we can work with this.

Step 2: Analysis and Viz of Activity

The mean of average steps came out to be 8350. This number alone doesn’t tell us much, and we will have to dive deeper to make sense of the distribution of this number.

1
2
3
4
5
6
7
activity_avg <- activity %>% 
  group_by(partID) %>% 
  summarize(Id = mean(Id),
            AvgSteps = mean(TotalSteps),
            AvgCalories = mean(Calories),
            )
activity_avg
1
2
3
4
5
6
7
8
9
10
11
12
13
14
## # A tibble: 32 x 4
##    partID         Id AvgSteps AvgCalories
##     <int>      <dbl>    <dbl>       <dbl>
##  1      1 1503960366   12521.       1877.
##  2      2 1624580081    4735.       1443.
##  3      3 1644430081    7283.       2811.
##  4      4 1844505072    3809.       1714.
##  5      5 1927972279    1671.       2303.
##  6      6 2022484408   11371.       2510.
##  7      7 2026352035    5567.       1541.
##  8      8 2320127002    4717.       1724.
##  9      9 2347167796    9520.       2043.
## 10     10 2873212765    7556.       1917.
## # ... with 22 more rows
1
2
3
4
5
6
7
8
activity_avg %>% 
  ggplot(mapping = aes(x=AvgSteps/1000, fill="Red")) + 
  geom_bar(bins = 5) +
  scale_x_binned() +
  guides(fill="none") + 
  labs(x="Average steps taken per day x1000", 
       y = "Number of participants",
       title = "Distribution of participants based on average steps")

This gives the first concrete information about the users-

Out of the 32 users, 9 took less than 6000 steps per day, 19 took between 6000-12000 steps per day, and 4 took more than 12000 steps per day.

Adding more granularity to our analysis, we will add a weekday to the observations based on the date and see how their analysis varies from weekday to weekend.

1
2
3
# First we have to convert date to standard %m-%d-%y format
dashed <- gsub("/", "-", activity$ActivityDate)
head(dashed)
1
## [1] "4-12-2016" "4-13-2016" "4-14-2016" "4-15-2016" "4-16-2016" "4-17-2016"
1
2
3
4
# Then add a label to it, SUN = 1, MON = 2 using lubridate's wday
daylabel <- 
  wday((mdy(dashed)),week_start=getOption("lubridate.week.start", 7))
head(daylabel)
1
## [1] 3 4 5 6 7 1
1
2
3
4
5
#Appending those values back to the table
activity <- mutate(activity, weekday = daylabel)
activity %>% 
  select(partID, Id, ActivityDate, weekday) %>% 
  head()
1
2
3
4
5
6
7
##   partID         Id ActivityDate weekday
## 1      1 1503960366    4/12/2016       3
## 2      1 1503960366    4/13/2016       4
## 3      1 1503960366    4/14/2016       5
## 4      1 1503960366    4/15/2016       6
## 5      1 1503960366    4/16/2016       7
## 6      1 1503960366    4/17/2016       1

We can indeed confirm that 4/12/2016(3) was a Tuesday and 4/17/2016 was a Sunday(1).

Creating a summarized dataset based on the weekdays:

1
2
3
4
5
6
7
8
9
weekActivity <- activity %>% 
  group_by(weekday) %>% 
  summarize(
    avgWsteps = mean(TotalSteps), 
    sedWmins = mean(SedentaryMinutes),
    avgWveryact = mean(VeryActiveMinutes),
    avgWcals = mean(Calories)
  )
weekActivity
1
2
3
4
5
6
7
8
9
10
## # A tibble: 7 x 5
##   weekday avgWsteps sedWmins avgWveryact avgWcals
##     <dbl>     <dbl>    <dbl>       <dbl>    <dbl>
## 1       1     7406.     940.        20.7    2306.
## 2       2     8488.     986.        25.4    2381.
## 3       3     8953.     961.        26.0    2435.
## 4       4     8207.     948.        22.8    2338.
## 5       5     8247.     928.        21.6    2290.
## 6       6     7853.     979.        21.2    2359.
## 7       7     8979.     925.        24.0    2426.

Plotting weekday vs Average Steps for that day

1
2
3
4
5
6
7
8
weekActivity %>% 
  ggplot(mapping = aes(x=weekday, y=avgWsteps, fill=-avgWsteps)) +
  geom_col() +
  labs(y="Average steps per day", 
       x="Day of the week",
       title = "Average steps taken on each day of the week") + guides(fill="none") + 
  geom_text(aes(label = lapply(avgWsteps, as.integer)), vjust=2, colour = "white") +
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7), labels = c("SUN", "MON", "TUE", "WED", "THU", "FRI", "SAT"))

A number of insights can be derived from this data, but first we need to see that the observations of each day are evenly distributed so they don’t create a bias.

1
2
activity %>% 
  count(weekday)
1
2
3
4
5
6
7
8
##   weekday   n
## 1       1 108
## 2       2 109
## 3       3 134
## 4       4 137
## 5       5 132
## 6       6 119
## 7       7 112

There are some more observations for the middle days of the week, but since every observation is over 100, and the difference in number is less, this can be considered.

1
2
3
4
5
6
7
8
weekPartActivity <- activity %>% 
  group_by(partID, weekday) %>% 
  summarize(
    Id = mean(Id),
    steps = mean(TotalSteps),
    seden = mean(SedentaryMinutes),
    cals = mean(Calories)
  )
1
## `summarise()` has grouped output by 'partID'. You can override using the `.groups` argument.
1
head(weekPartActivity)
1
2
3
4
5
6
7
8
9
10
## # A tibble: 6 x 6
## # Groups:   partID [1]
##   partID weekday         Id  steps seden  cals
##    <int>   <dbl>      <dbl>  <dbl> <dbl> <dbl>
## 1      1       1 1503960366 10102.  638  1769 
## 2      1       2 1503960366 13781.  899  1939.
## 3      1       3 1503960366 13947.  780. 1968.
## 4      1       4 1503960366 12657.  910  1869.
## 5      1       5 1503960366 11876.  924. 1852 
## 6      1       6 1503960366 11466.  878  1826.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
#Plotting these:
weekPartActivity %>% 
  ggplot(mapping = aes(
    x=as.character(weekday), 
    y=steps, 
    fill=-steps)
    ) +
  guides(fill="none") +
  geom_col() + facet_wrap(~partID, nrow = 4) + 
  labs(
    x="Days of the week (1-SUN)", 
    y="Average steps taken per day",
    title = "A look at steps per weekday for each participant"
    )

1
2
3
4
5
6
7
8
9
10
groupings <- activity %>% 
  group_by(partID) %>% 
  summarize(avgstep = mean(TotalSteps)) %>% 
  mutate(stepGroup = case_when(
    avgstep < 6000 ~ 'A',
    avgstep > 6000 & avgstep < 12000 ~ 'B',
    avgstep > 12000 ~ 'C'
  ))
groupings %>% 
  head()
1
2
3
4
5
6
7
8
9
## # A tibble: 6 x 3
##   partID avgstep stepGroup
##    <int>   <dbl> <chr>    
## 1      1  12521. C        
## 2      2   4735. A        
## 3      3   7283. B        
## 4      4   3809. A        
## 5      5   1671. A        
## 6      6  11371. B

Creating a function similar to VLOOKUP in spreadsheets

1
2
3
4
activity <- merge(groupings, activity, by = 'partID')
activity %>% 
  select(partID, TotalSteps, avgstep, stepGroup) %>% 
  head()
1
2
3
4
5
6
7
##   partID TotalSteps  avgstep stepGroup
## 1      1      13162 12520.63         C
## 2      1      10735 12520.63         C
## 3      1      10460 12520.63         C
## 4      1       9762 12520.63         C
## 5      1      12669 12520.63         C
## 6      1       9705 12520.63         C
1
2
3
groupedActivity <- activity %>% 
  group_by(stepGroup, weekday) %>% 
  summarize( avgsteps = mean(TotalSteps))
1
## `summarise()` has grouped output by 'stepGroup'. You can override using the `.groups` argument.
1
2
groupedActivity %>% 
  head()
1
2
3
4
5
6
7
8
9
10
## # A tibble: 6 x 3
## # Groups:   stepGroup [1]
##   stepGroup weekday avgsteps
##   <chr>       <dbl>    <dbl>
## 1 A               1    3545 
## 2 A               2    4694.
## 3 A               3    4223.
## 4 A               4    4520.
## 5 A               5    4273.
## 6 A               6    3780.
1
2
3
4
5
6
7
8
9
10
11
groupedActivity %>% 
  ggplot(mapping = aes(
    x=as.character(weekday), 
    y=avgsteps/1000,
    fill=stepGroup)
    ) +
  geom_col() + facet_wrap(~stepGroup) +
  labs(x = "Weekday", 
       y = "Average Steps x1000",
       title = "Grouped distibution of average steps per weekday",
       subtitle = "A: <6000,                      B: 6000-12000,           C: >12000")

1
2
3
4
5
6
7
8
9
10
11
12
13
msun <- mean(activity$TotalSteps[activity$weekday %in% c(1)])
msat <- mean(activity$TotalSteps[activity$weekday %in% c(7)])
mweek <- mean(activity$TotalSteps[activity$weekday %in% c(2:6)])

data.frame("Day" = c("Sunday", "Weekday", "Saurday"),
           "average" = c(msun, mweek, msat)) %>% 
  ggplot(mapping = aes(x=Day, y=average, fill=-average))+
  geom_col() + 
  geom_text(aes(label = lapply(average, as.integer)), 
            vjust=2, colour = "white") + 
  guides(fill="none") + 
  labs(y="Average steps taken",
       title = "Average steps taken per weekend/weekday")

1
2
3
4
5
6
7
8
9
10
11
activity <- activity %>% 
  mutate(
    breaklabel = case_when(
      activity$weekday %in% c(1) ~ 'SUN',
      activity$weekday %in% c(7) ~ 'SAT',
      activity$weekday %in% c(2:6) ~ 'WEEK'
    )
  )
activity %>% 
  select(weekday, breaklabel) %>% 
  head()
1
2
3
4
5
6
7
##   weekday breaklabel
## 1       3       WEEK
## 2       4       WEEK
## 3       5       WEEK
## 4       6       WEEK
## 5       7        SAT
## 6       1        SUN
1
2
3
4
5
activ_step_week <- activity %>% 
  group_by(stepGroup, breaklabel) %>% 
  summarize(
    steps = mean(TotalSteps)
  )
1
## `summarise()` has grouped output by 'stepGroup'. You can override using the `.groups` argument.
1
activ_step_week
1
2
3
4
5
6
7
8
9
10
11
12
13
## # A tibble: 9 x 3
## # Groups:   stepGroup [3]
##   stepGroup breaklabel  steps
##   <chr>     <chr>       <dbl>
## 1 A         SAT         4605.
## 2 A         SUN         3545 
## 3 A         WEEK        4288.
## 4 B         SAT         9669.
## 5 B         SUN         7823.
## 6 B         WEEK        8715.
## 7 C         SAT        14307 
## 8 C         SUN        12238.
## 9 C         WEEK       14265.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
activ_step_week %>% 
  ggplot(mapping = aes(
    x = breaklabel,
    y = steps,
    fill = stepGroup
  )) +
  geom_col() +
  geom_text(aes(label = lapply(steps, as.integer)), 
            vjust=2, size=3.5, colour = "white", ) + 
  facet_wrap(~stepGroup) +
  labs(x = "Days", 
       y = "Average steps for that day",
       title = "Grouped distibution of average steps per weekday/weekend",
       subtitle = "A: <6000,                      B: 6000-12000,           C: >12000")

Step 3 Importing and Cleaning Sleep

Reading the dataset

1
2
sleep <- read.csv('C://Users/hamda/Desktop/Portfolio Projects/3 Google/data/sleepDay_merged.csv')
head(sleep)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
##           Id              SleepDay TotalSleepRecords TotalMinutesAsleep
## 1 1503960366 4/12/2016 12:00:00 AM                 1                327
## 2 1503960366 4/13/2016 12:00:00 AM                 2                384
## 3 1503960366 4/15/2016 12:00:00 AM                 1                412
## 4 1503960366 4/16/2016 12:00:00 AM                 2                340
## 5 1503960366 4/17/2016 12:00:00 AM                 1                700
## 6 1503960366 4/19/2016 12:00:00 AM                 1                304
##   TotalTimeInBed
## 1            346
## 2            407
## 3            442
## 4            367
## 5            712
## 6            320
1
str(sleep)
1
2
3
4
5
6
## 'data.frame':    413 obs. of  5 variables:
##  $ Id                : num  1.5e+09 1.5e+09 1.5e+09 1.5e+09 1.5e+09 ...
##  $ SleepDay          : chr  "4/12/2016 12:00:00 AM" "4/13/2016 12:00:00 AM" "4/15/2016 12:00:00 AM" "4/16/2016 12:00:00 AM" ...
##  $ TotalSleepRecords : int  1 2 1 2 1 1 1 1 1 1 ...
##  $ TotalMinutesAsleep: int  327 384 412 340 700 304 360 325 361 430 ...
##  $ TotalTimeInBed    : int  346 407 442 367 712 320 377 364 384 449 ...

Cleaning and analyzing the dataset

1
2
3
sleep %>% 
  count(Id) %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
##            Id  n
## 1  2320127002  1
## 2  7007744171  2
## 3  1844505072  3
## 4  6775888955  3
## 5  8053475328  3
## 6  1644430081  4
## 7  1927972279  5
## 8  4558609924  5
## 9  4020332650  8
## 10 2347167796 15
## 11 8792009665 15
## ..

This contains 24 distinct participants, and this was another reason why joining them in the beginning was not preferred. As the observations for each individual, as well as the number of individuals are both already very less, combining them would have further reduced that number drastically.

Out of the 24 participants with sleep data, there are 8 with less than a week of data. We will need to remove these as well as one with 8 observations as this will only hinder our analysis.

One immediate difference we see is that there is timestamp in this data, but there was only one date in the activity dataset. We will also need to remove the time from this to be able to merge with activity.

1
2
3
4
5
6
7
8
9
#Removing participants with less than 9 records
sleep <- sleep %>% 
  group_by(Id) %>% 
  filter(n() > 8)

#Checking new table
sleep %>% 
  count(Id) %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## # A tibble: 15 x 2
## # Groups:   Id [15]
##            Id     n
##         <dbl> <int>
##  1 2347167796    15
##  2 8792009665    15
##  3 6117666160    18
##  4 4388161847    24
##  5 7086361926    24
##  6 1503960366    25
##  7 4319703577    26
##  8 5577150313    26
##  9 2026352035    28
## 10 3977333714    28
## 11 4445114986    28
## 12 4702921684    28
## 13 5553957443    31
## 14 6962181067    31
## 15 8378563200    32

The sleep data now contains 15 participants with at least 2 weeks of records for each.

Datetime to date:

1
2
3
4
sleep <- sleep %>% 
  separate(SleepDay, c("ActivityDate", "Time", "AM/PM"), " ") %>% 
  select(Id, ActivityDate, TotalMinutesAsleep, TotalTimeInBed)
head(sleep)
1
2
3
4
5
6
7
8
9
10
## # A tibble: 6 x 4
## # Groups:   Id [1]
##           Id ActivityDate TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                     <int>          <int>
## 1 1503960366 4/12/2016                   327            346
## 2 1503960366 4/13/2016                   384            407
## 3 1503960366 4/15/2016                   412            442
## 4 1503960366 4/16/2016                   340            367
## 5 1503960366 4/17/2016                   700            712
## 6 1503960366 4/19/2016                   304            320

Combining this dataset with activity dataset

1
head(activity)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
##   partID  avgstep stepGroup         Id ActivityDate TotalSteps TotalDistance
## 1      1 12520.63         C 1503960366    4/12/2016      13162          8.50
## 2      1 12520.63         C 1503960366    4/13/2016      10735          6.97
## 3      1 12520.63         C 1503960366    4/14/2016      10460          6.74
## 4      1 12520.63         C 1503960366    4/15/2016       9762          6.28
## 5      1 12520.63         C 1503960366    4/16/2016      12669          8.16
## 6      1 12520.63         C 1503960366    4/17/2016       9705          6.48
##   TrackerDistance LoggedActivitiesDistance VeryActiveDistance
## 1            8.50                        0               1.88
## 2            6.97                        0               1.57
## 3            6.74                        0               2.44
## 4            6.28                        0               2.14
## 5            8.16                        0               2.71
## 6            6.48                        0               3.19
##   ModeratelyActiveDistance LightActiveDistance SedentaryActiveDistance
## 1                     0.55                6.06                       0
## 2                     0.69                4.71                       0
## 3                     0.40                3.91                       0
## 4                     1.26                2.83                       0
## 5                     0.41                5.04                       0
## 6                     0.78                2.51                       0
##   VeryActiveMinutes FairlyActiveMinutes LightlyActiveMinutes SedentaryMinutes
## 1                25                  13                  328              728
## 2                21                  19                  217              776
## 3                30                  11                  181             1218
## 4                29                  34                  209              726
## 5                36                  10                  221              773
## 6                38                  20                  164              539
##   Calories weekday breaklabel
## 1     1985       3       WEEK
## 2     1797       4       WEEK
## 3     1776       5       WEEK
## 4     1745       6       WEEK
## 5     1863       7        SAT
## 6     1728       1        SUN
1
head(sleep)
1
2
3
4
5
6
7
8
9
10
## # A tibble: 6 x 4
## # Groups:   Id [1]
##           Id ActivityDate TotalMinutesAsleep TotalTimeInBed
##        <dbl> <chr>                     <int>          <int>
## 1 1503960366 4/12/2016                   327            346
## 2 1503960366 4/13/2016                   384            407
## 3 1503960366 4/15/2016                   412            442
## 4 1503960366 4/16/2016                   340            367
## 5 1503960366 4/17/2016                   700            712
## 6 1503960366 4/19/2016                   304            320

Step 4: Merging Activity and Sleep

1
2
3
4
5
6
7
activ_sleep <- merge(sleep, activity, 
                     by = c('Id', 'ActivityDate'))
activ_sleep %>% 
  select(
    Id, partID, TotalSteps, TotalMinutesAsleep
  ) %>% 
  head()
1
2
3
4
5
6
7
##           Id partID TotalSteps TotalMinutesAsleep
## 1 1503960366      1      13162                327
## 2 1503960366      1      10735                384
## 3 1503960366      1       9762                412
## 4 1503960366      1      12669                340
## 5 1503960366      1       9705                700
## 6 1503960366      1      15506                304

The tables have been merged on Id and date, and therefore the Sleep column also has associated weekday information used previously. Before proceeding, checking number of observations for each participant and each day to remove biases.

1
2
3
4
5
# Participants
activ_sleep %>% 
  group_by(partID) %>% 
  count() %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
## # A tibble: 15 x 2
## # Groups:   partID [15]
##    partID     n
##     <int> <int>
##  1      9    15
##  2     32    15
##  3     22    18
##  4     16    24
##  5     27    24
##  6      1    25
##  7     15    25
##  8     21    26
##  9      7    28
## 10     12    28
## 11     17    28
## 12     19    28
## 13     20    31
## 14     25    31
## 15     30    32
1
2
3
4
5
# Weekdays
activ_sleep %>% 
  group_by(weekday) %>% 
  count() %>% 
  arrange(n)
1
2
3
4
5
6
7
8
9
10
11
## # A tibble: 7 x 2
## # Groups:   weekday [7]
##   weekday     n
##     <dbl> <int>
## 1       2    46
## 2       1    49
## 3       6    51
## 4       7    51
## 5       3    59
## 6       5    60
## 7       4    62

We have a good number of observations for each of our dataset and can Now proceed with our analysis.

1
2
3
activ_sleep %>% 
  select(TotalSteps, TotalTimeInBed, TotalMinutesAsleep) %>% 
  summary()
1
2
3
4
5
6
7
##    TotalSteps    TotalTimeInBed  TotalMinutesAsleep
##  Min.   :   42   Min.   : 65.0   Min.   : 59.0     
##  1st Qu.: 5655   1st Qu.:414.2   1st Qu.:374.8     
##  Median : 9170   Median :468.0   Median :437.5     
##  Mean   : 8733   Mean   :467.4   Mean   :429.0     
##  3rd Qu.:11422   3rd Qu.:529.2   3rd Qu.:492.0     
##  Max.   :22770   Max.   :843.0   Max.   :775.0

On average, a person in this dataset sleeps for 429 minutes or 7.15 hours.

Checking for outliers

1
2
3
4
5
6
7
8
9
activ_sleep %>% 
  ggplot(mapping = aes(y=TotalMinutesAsleep, 
                       x=as.character(partID),
                       fill="Red")) +
  guides(fill="none") +
  geom_boxplot() +
  labs(x = "Participants", 
       y="Total Minutes Asleep",
       title = "Statistical distribution of Sleep vs Participants")

The boxplot suggests there are some outliers present, but we cannot make assumptions about this data as sleep time is very subjective, and both extremes are possible for this column.

Step 5: Analysis and Viz of Merged

Trying to see if there is a correlation between steps and sleep time

1
2
3
4
5
6
7
8
9
10
11
activ_sleep %>% 
  ggplot(mapping = aes(
    x = TotalSteps,
    y = TotalMinutesAsleep,
    color = partID
  )) +
  geom_point() + guides(color="none") + 
  geom_smooth(color="Red") + 
  labs(x = "Total Steps Taken", 
       y="Total Minutes Asleep",
       title = "Correlation between time slept and steps taken")
1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Looking at this graph, there doesn’t appear to be any correlation between the number of steps walked that day and the minutes asleep. This seems logical because sleep is a very subjective quality and depends on a number of physical, mental and psychological reasons.

Seeing the sleeping pattern across the weekday

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
activ_sleep %>% 
  group_by(weekday) %>% 
  summarize(avgsleep = mean(TotalMinutesAsleep)) %>% 
  ggplot(mapping = aes( 
    x = weekday,
    y = avgsleep/60,
    fill = -avgsleep
  )) +
  geom_col() + guides(fill="none") + 
  labs(x = "Weekday", 
       y = "Average sleep in hours",
       title = "Sleep distribution on each day of the week") +
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7),
                     labels = c("SUN", "MON", "TUE", 
                                "WED", "THU", "FRI", "SAT"))

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
activ_sleep %>% 
  group_by(partID, weekday) %>% 
  summarize(
    sleepav = mean(TotalMinutesAsleep)
  ) %>% 
  ggplot(mapping = aes(
    x=as.character(weekday), 
    y=sleepav/60, 
    fill=-sleepav)
    ) +
  guides(fill="none") +
  geom_col() + facet_wrap(~partID, nrow = 3) + 
  labs(
    x = "Days of the week (1-SUN)", 
    y = "Average sleep per day in hours",
    title = "A look at the distribution of sleep vs weekday for each participant"
    )
1
## `summarise()` has grouped output by 'partID'. You can override using the `.groups` argument.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
activ_sleep %>% 
  group_by(stepGroup, weekday) %>% 
  summarize(
    sleep = mean(TotalMinutesAsleep)
  ) %>% 
  ggplot(mapping = aes(
    x = weekday,
    y = sleep/60,
    fill = stepGroup
  )) +
  geom_col() + 
  scale_x_continuous(breaks = c(1,2,3,4,5,6,7)) + 
  facet_wrap(~stepGroup) + 
  labs(x = "Days of the week", 
       y = "Average sleep in hours",
       title = "Grouped distribution of average sleep per weekday",
       subtitle = "A: <6000,                      B: 6000-12000,           C: >12000")
1
## `summarise()` has grouped output by 'stepGroup'. You can override using the `.groups` argument.

1
2
3
4
5
6
7
8
9
10
11
12
activ_sleep %>% 
  filter( SedentaryMinutes - TotalTimeInBed > 0) %>% 
  ggplot(mapping = aes(
    x = SedentaryMinutes-TotalTimeInBed,
    y = TotalMinutesAsleep/60,
    color = partID
  )) +
  geom_point() + guides(color="none") + 
  geom_smooth(color="Red") + 
  labs(x = "Corrected Sedentary Time", 
       y = "Total sleep in hours",
       title = "Correlation between time slept vs sedentary minutes in the day")
1
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

1
2
3
4
5
as1 <- activ_sleep %>% 
  mutate(sed = SedentaryMinutes-TotalTimeInBed,
         slp = TotalMinutesAsleep) %>% 
  filter(sed > 0)
cor(as1$sed, as1$slp)
1
## [1] -0.7607567

This tells us that there is a strong correlation between the two variables

This post is licensed under CC BY 4.0 by the author.