Final pt 2: Data Analysis

Update: with three additional months of data (80% training, 20% test), our results have improved significantly:

The random forest is still the most accurate, with a mean absolute error of 0.34736842105263155.

The linear model comes in close second, with a mean absolute error of 0.3631925298277533.

And the decision tree is the least accurate, with an error of 0.5.

 

 


For Part 1 of the documentation on data collection and analysis:

Final

After going through the pandas/scikitlearn wringer this past week, I now realize that collecting and visualizing my data were the first baby steps of this arduous journey. My first foray into analysis was met immediately with a big obstacle: my data was a disparate, misaligned mess. So there was a lot of preprocessing to do, for every dataset, before I could merge them all into one pandas dataframe for analysis.

After the data was cleaned and nicely tied up with a dataframe bow, I needed to merge them all into one. Pandas has a convenient merge method that does such things in one fell swoop, but the problems were 1) I had a ton of missing values, and 2) discarding incomplete rows reduced my dataset to a mere shadow of its former self. Since there was no way that pandas could make reasonable decisions about combining rows with corresponding missing values, I had to do it manually.

Or rather, my boyfriend did:

Not that I totally understand this function, let alone capable of reproducing it myself, but the crux of it is that it merges incomplete rows within 3 hours of each other. The result is this beauty:

The merged dataframe, with all of my data, ended up being 78 rows × 45 columns. Not too bad.

Now that the data was all in one big, happy dataframe, I could finally start on the analysis. Granted, 78 rows of data is not nearly enough to find relationships that are statistically significant, but consider the following a proof of concept.

Here’s a heat map of correlations between every combination of variables.

The strongest correlations are what one intuitively might expect, like the number of interactions I have being positively correlated with being a school, my mood being positively correlated with my boyfriend’s company, and stress being positively correlated with productivity (aka, the only motivator). So that’s good.

Since my greater goal for this project is predicting mood, this column is the most relevant:

From here, I was eager to do some preliminary training on machine learning models. Here are the results for a linear model:
Decision tree:
And random forests:
The random forest clearly made the most accurate predictions. Here are the predictions, along with the test values, displayed in a lovely data frame:
Pretty exciting already! Hopefully the results will continue to improve as I continue to gather more data.

Impossible Maps Final: Mapping as Autobiography

For this final, I wanted to visualize my Google Maps location history. I’ve been using different android devices since I acquired my first in December 2012, less than three months after I moved to NYC. Being a creature of habit/obsessive compulsions, I figured my location history had captured the passing fancies and preoccupations that shaped my development into an independent adult, and my (reluctantly assumed) identity as a New Yorker.

So I downloaded this history from takeout.google.com as a 325MB json file (lol):


(Shoutout to emacs for being the only text editor on my computer capable of opening it.)

Since I wanted to practice my newfound mapbox gl skills, my second obstacle was simply using this file at all, as it wasn’t in geojson format. What I ended up doing was using the d3 library to load the json for some reason (other than being a creature of habit), looping through the data, and pushing the relevant and correctly formatted data into a javascript array.

Here’s what I got when I logged the data in the console:


So that wasn’t going to happen. To make it easier on the browser, I ended up filtering out coordinates outside of (approximately calculated) NY bounds. I also divided the data into six arrays, one for each year:

Apparently, the period from 2013-present accounts for 982,154 locations out of a total of 1,170,453— which means 188,299 locations (16% of the total) were filtered out for being beyond NYC. The reason why array[2], array[3] and array[4] contain less than half of what array[0] and array[1] do is precisely that—I spent the majority of those years traveling. Array[5] is even smaller because it contains 2018 data.

Okay, so the next challenge was injecting this data into a mapbox source layer. Since mapbox expects geojson formatting, I had to hack it a little (ie, steal someone else’s hack from stackoverflow):

Then, I adapted the filtering legend from this mapbox demo to my page. Here’s what I ended up with:

Here’s the breakdown by year: