Cleaning/preparing personal weight data02 Jan 2020 —
If you don’t remember my previous post about my custom Bluetooth scale from a couple of months ago, I’ve been collecting a large amount of fine-grained information about my weight for the past couple of months.
In this post, I’ll walk through my initial look at it, some problems I had with cleaning the data, and what I did to fix them.
Part 2: Cleaning/preparing personal weight data
Data cleaning and sanity checking
Let’s load up the data and put it into the right format, making the
time column actual time data (in this case, Unix epoch time, which the numbers in that column represent), and making sure each weight measurement was tagged and from me.
The first thing you’ll notice is that there are a lot of columns in this data frame (35)—the way my setup currently converts its JSON data into R-readable files just dumps all the values into their own columns.
Let’s remove a lot of those right now to make things easier for you:
In order to not ruin any surprises let’s just look at the first couple of rows and columns:
Notice that the first column is named “X”, R’s default for unnamed columns—this comes from the unnamed index column of the
pandas data frame that generated it. “ID” is a uniquely generated ID for each measurement, “sleepwake” is a factor we’ll talk about later, and the rest of the columns are self-explanatory.
Outliers: medians are more robust
Each weight “measurement” (i.e., row) is the aggregate of a hundred or so weight samples from the Wii Fit board over the few seconds I was standing on it. These samples are a bit noisy, and probably include samples where I was getting on or off the board. The columns we saw above (
mean) reflect the summary statistics of these samples.
Here are the mean values:
As you can see, there are some values that seem very suspect. But what about the medians?
Yes, what you are taught in intro stats classes is true: the median IS more robust to outliers than the mean! From the plot above, you might want to move on and just stick to using the median values, but something about those huge outliers of the means seems suspicious.
A deeper cleaning
Let’s look at the individual samples, which are stored as comma-separated strings in
weight_values for each measurement to see what’s going on:
Clearly, many of these samples are just wrong and one measurement’s interquartile range extends down to almost 0 kg!
A go-to move for cleaning outliers might be to exclude samples by z-score thresholds, but removing sample outliers by z-scores won’t always correctly clean the data. For that particularly outlier-filled measurement, there are no samples with z-scores > 3 SD:
However, I know roughly how much I weigh. I know for a fact that I will likely never weigh less than 60 kg (~132 lb). Therefore, any sample < 60 kg is an error, and I can exclude it before calculating z-scores.
When we apply these cleaning steps to all the measurements, the individual samples look much better.
Importantly, after I noticed this problem, I made a few changes to my weight-capturing system, so I won’t have to clean new measurements.
So let’s get cracking, what kind of insights can we glean from this data?
Background: clothing matters for micro-measurements
When you’re weighed at the doctor’s, the nurses don’t care if you take your shoes off—it’s just a few pounds—but in my case I’m making enough measurements that a few pounds of shoes, wallet, and phone is going to make a difference.
In order to get around this, my weight-capturing system lets me “tag” weights, giving them additional metadata/context. Since I know the amount of clothing I wear will influence my weight, I’ve broken my clothing status into three categories:
When I’m wearing “full” outfits (what I would wear outside, shoes, phone, jacket, etc.)
When I’m wearing lighter outfits (loungewear, sweatpants, no shoes)
When I’m wearing only my birthday suit 😳
These are pretty loose categories, but right away, we can see that they explain a lot of the variance of the data. Since being naked is the closest I can get to measuring my “true” weight and varying clothing per day adds extra noise to the data, I’ll generally stick with my “naked” measurements as being the gold standard.
The Actually Interesting Questions
Now that we’ve gotten the boring bits out of the way, we’re ready to get to the juicier questions. Sadly, this post is getting too long, so you’re going to have to wait until I publish the next installment. Look forward to it soon!
If you want to see how I did what I did in this post, check out the source code of the R Markdown file!