The Project

My husband has various temperature and humidity sensors scattered throughout the house, recording data points to a MySQL server. The data is stored on a table that looks like this:

id
<int>
date
<chr>
sensorname
<chr>
sensorvalue
<dbl>
1 31 2016-12-18 22:20:23 temp5 63.6116
2 32 2016-12-18 22:20:23 finalDHTTempF2 68.0000
3 33 2016-12-18 22:20:23 humidity2 36.0000
4 34 2016-12-18 22:25:23 temp5 64.1750
5 35 2016-12-18 22:25:23 finalDHTTempF2 68.0000
6 36 2016-12-18 22:25:23 humidity2 36.0000
7 37 2016-12-18 22:30:23 temp5 63.7250
8 38 2016-12-18 22:30:23 finalDHTTempF2 69.8000
9 39 2016-12-18 22:30:23 humidity2 35.0000
10 40 2016-12-18 22:35:23 temp5 63.3866

I wanted to use his dataset to test my adventures in applying R.

Our current dataset data is a data frame with 198164 rows.

The Problem

Looking at this data, the first thing I thought was untidy. There has to be a better way. When I think of tidy data, I think of the tidyr package, which is used to help make data tidy, easier to work with. Specifically, I thought of the spread() function, where I could break things up. Once data was spread into appropriate columns, I figure I can operate on the data a bit better.

The Adventures so far…

As seen in the date field, the values are logged with their times. This is why we have so many data points. The first thing I wanted to do was group the values into daily means.

Cleaning up Dates

I am using lubridate to make some of my date management a bit easier. I am using dplyr to do the chaining with %>%. I grouped my data by sensor then by date parts – year, month, and day. After grouping the data, I summarized the data to get daily means. Once the data was summarized, I spread it out to make it more meaningful:

year(date)
<dbl>
month(date)
<dbl>
day(date)
<int>
finalDHTTempF1
<dbl>
finalDHTTempF2
<dbl>
finalDHTTempF3
<dbl>
humidity1
<dbl>
1 2016 12 18 NA 68.34286 NA NA
2 2016 12 19 NA 67.77578 NA NA
3 2016 12 20 NA 67.88750 NA NA
4 2016 12 21 NA 68.95625 NA NA
5 2016 12 22 NA 69.74375 NA NA
6 2016 12 23 NA 69.71875 NA NA
7 2016 12 24 NA 70.97500 NA NA
8 2016 12 25 NA 70.85625 NA NA
9 2016 12 26 NA 71.78750 NA NA
10 2016 12 27 NA 71.08750 NA NA
finalDHTTempF1
<dbl>
finalDHTTempF2
<dbl>
finalDHTTempF3
<dbl>
humidity1
<dbl>
humidity2
<dbl>
humidity3
<dbl>
temp4
<dbl>
temp5
<dbl>
NA 68.34286 NA NA 35.80952 NA NA 63.08703
NA 67.77578 NA NA 35.55709 NA NA 62.37841
NA 67.88750 NA NA 35.50347 NA NA 62.41281
NA 68.95625 NA NA 35.46528 NA NA 63.40109
NA 69.74375 NA NA 35.24306 NA NA 64.36713
NA 69.71875 NA NA 35.25000 NA NA 64.33000

Cleaning up NAs

Now some of the data shows NA. If there’s anything I’ve learned with data, NULL and NA can be problematic, depending on the data tool and the user operating said tool. In this case, I can easily convert my NA values to 0 without ruining the data meaning:

finalDHTTempF1
<dbl>
finalDHTTempF2
<dbl>
finalDHTTempF3
<dbl>
humidity1
<dbl>
humidity2
<dbl>
humidity3
<dbl>
temp4
<dbl>
temp5
<dbl>
0 68.34286 0 0 35.80952 0 0 63.08703
0 67.77578 0 0 35.55709 0 0 62.37841
0 67.88750 0 0 35.50347 0 0 62.41281
0 68.95625 0 0 35.46528 0 0 63.40109
0 69.74375 0 0 35.24306 0 0 64.36713
0 69.71875 0 0 35.25000 0 0 64.33000

Presentation

So now that I have daily averages in a format that I can work with, let’s do something meaningful with the data – let’s plot it! I am using ggplot2 for plotting.

Conclusion

So far, I’m having fun putting my skills to work, especially with this dataset at. I’m at the tail end of the 2nd course of an R specialization on Coursera. Between CodeMash and Coursera, I’ve been enjoying my exploRation into R. Here’s to many adventures ahead!