Update: with three additional months of data (80% training, 20% test), our results have improved significantly:
The random forest is still the most accurate, with a mean absolute error of 0.34736842105263155.
The linear model comes in close second, with a mean absolute error of 0.3631925298277533.
And the decision tree is the least accurate, with an error of 0.5.
For Part 1 of the documentation on data collection and analysis:
After going through the pandas/scikitlearn wringer this past week, I now realize that collecting and visualizing my data were the first baby steps of this arduous journey. My first foray into analysis was met immediately with a big obstacle: my data was a disparate, misaligned mess. So there was a lot of preprocessing to do, for every dataset, before I could merge them all into one pandas dataframe for analysis.
After the data was cleaned and nicely tied up with a dataframe bow, I needed to merge them all into one. Pandas has a convenient merge method that does such things in one fell swoop, but the problems were 1) I had a ton of missing values, and 2) discarding incomplete rows reduced my dataset to a mere shadow of its former self. Since there was no way that pandas could make reasonable decisions about combining rows with corresponding missing values, I had to do it manually.
Or rather, my boyfriend did:
Not that I totally understand this function, let alone capable of reproducing it myself, but the crux of it is that it merges incomplete rows within 3 hours of each other. The result is this beauty:
The merged dataframe, with all of my data, ended up being 78 rows × 45 columns. Not too bad.
Now that the data was all in one big, happy dataframe, I could finally start on the analysis. Granted, 78 rows of data is not nearly enough to find relationships that are statistically significant, but consider the following a proof of concept.
Here’s a heat map of correlations between every combination of variables.
The strongest correlations are what one intuitively might expect, like the number of interactions I have being positively correlated with being a school, my mood being positively correlated with my boyfriend’s company, and stress being positively correlated with productivity (aka, the only motivator). So that’s good.
Since my greater goal for this project is predicting mood, this column is the most relevant: