Final pt 2: Data Analysis

Update: with three additional months of data (80% training, 20% test), our results have improved significantly:

The random forest is still the most accurate, with a mean absolute error of 0.34736842105263155.

The linear model comes in close second, with a mean absolute error of 0.3631925298277533.

And the decision tree is the least accurate, with an error of 0.5.



For Part 1 of the documentation on data collection and analysis:


After going through the pandas/scikitlearn wringer this past week, I now realize that collecting and visualizing my data were the first baby steps of this arduous journey. My first foray into analysis was met immediately with a big obstacle: my data was a disparate, misaligned mess. So there was a lot of preprocessing to do, for every dataset, before I could merge them all into one pandas dataframe for analysis.

After the data was cleaned and nicely tied up with a dataframe bow, I needed to merge them all into one. Pandas has a convenient merge method that does such things in one fell swoop, but the problems were 1) I had a ton of missing values, and 2) discarding incomplete rows reduced my dataset to a mere shadow of its former self. Since there was no way that pandas could make reasonable decisions about combining rows with corresponding missing values, I had to do it manually.

Or rather, my boyfriend did:

Not that I totally understand this function, let alone capable of reproducing it myself, but the crux of it is that it merges incomplete rows within 3 hours of each other. The result is this beauty:

The merged dataframe, with all of my data, ended up being 78 rows × 45 columns. Not too bad.

Now that the data was all in one big, happy dataframe, I could finally start on the analysis. Granted, 78 rows of data is not nearly enough to find relationships that are statistically significant, but consider the following a proof of concept.

Here’s a heat map of correlations between every combination of variables.

The strongest correlations are what one intuitively might expect, like the number of interactions I have being positively correlated with being a school, my mood being positively correlated with my boyfriend’s company, and stress being positively correlated with productivity (aka, the only motivator). So that’s good.

Since my greater goal for this project is predicting mood, this column is the most relevant:

From here, I was eager to do some preliminary training on machine learning models. Here are the results for a linear model:
Decision tree:
And random forests:
The random forest clearly made the most accurate predictions. Here are the predictions, along with the test values, displayed in a lovely data frame:
Pretty exciting already! Hopefully the results will continue to improve as I continue to gather more data.



This semester, I began work on a system of trackers for a whole host of potential/evidenced metrics of depression, in hopes of monitoring its cyclical nature and identifying correlations with my activity and environment. Because I had done a lot of prior research, there were specific metrics that I had in mind, but oftentimes appropriate apps were either only available for iOS, didn’t provide an API, didn’t track with enough granularity, or didn’t exist at all.

Being a grad student, I have not the funds for an iPhone (new or old), and so I decided to put my newly acquired python skills to the test.


Data collection with homemade trackers:

  1. Mood Reporter: Because affect is difficult to measure, psychiatry traditionally employs self-administered questionnaires as diagnostic tools for mood disorders; these usually attempt to quantify the severity of DSM-IV criteria. The module for depression is called the PHQ-9, and I’ve adapted several of its questions into my own questionnaire, which python deploys every hour via the command line:

    The responses are then appended to a tsv:
  2. Productivity: via python and the RescueTime API, my productivity score is appended to a json every hour:
  3. Facial analysis: Via my laptop’s webcam, the Affectiva API analyses my face for a minute every hour; all its responses are saved to a json file. My python script grabs the min and max attention and valence values, as well as the expressions made (plotted with emoji) and the amount of times I blinked (calculated by dividing the number of times the eyeClosure variable hit 99.9%, divided by 2). These calculations are then appended to another JSON file that feeds into my visualization. The final entry for each hour looks like this:

  4. Keylogger Sentiment Analysis: The idea for this is simply to discern the sentiment of everything I type. I wrote a keylogger in python, which collects any coherent phrase to be sent to IBM Watson’s Tone Analyzer every hour. The response looks like this:

    The API provides several sentiment categories: joy, confidence, analysis, tentativeness, sadness, fear, and anger.


The Dashboard:

In order to understand any of this data, I would need to create a dashboard. What was important to me was to create an environment where potential correlations could be seen; since much of this is speculative, this basically meant doing a big data dump into the browser. I visualized everything in d3js.

My local dashboard has access to the hourly updated data, which is unbelievably satisfying; the public version has about 2.5 weeks worth.


Next steps:

I’m in the process of building yet another tracker: a Chrome extension which will record my tab/window activity (the amount of which is probably/definitely positively correlated with stress and anxiety in my life!).

I would also like to add a chart that allows me to compare the trendlines of all the metrics, as a preliminary attempt to guess at correlations. This will definitely require me to do a lot of data reformatting.

I also need to visualize the data from the tracking apps I did download (Google Fit and, and include other environmental information like weather, calendar events, etc.

Honestly, I will probably be working on this for the rest of my life lol

Project Development Studio W1 HW: Dream, Vision, Goal, Plan

Dream: My dream is quantify depression, and use those numbers to establish a centralized and comprehensive system that empowers both the inflicted and their medical professionals to be better able to understand, manage, and treat the cyclical nature of the disorder. I imagine a tool that will put an ever-on-call psychologist, neurologist, psychiatrist, and personal assistant in the pocket of patients who lack the energy and concern to care for themselves. 

Vision: I would like to build a system of wearables that monitor the tracked biometrics and self-reported markers of depression. With the user’s baseline state as reference, the system would employ machine learning and the user’s self-reported corroborations to label biometric deviations. Once the system learns to read the user’s mood, it will provide recommendations on self-care and subsequently learn which methods work best for the user, and when. The system will also retain an archive of visualized data for medical professional to assess during appointments.

Goal: My goal for this course is to create an EEG wearable that I can use on a daily basis. The headset will be 3d-printed and use a bluetooth arduino to send data to my computer or phone, but if this proves unreliable I will just save the data on a SD card to upload at the end of each day. The EEG will record my brainwaves, which will hopefully reveal when I blink (an indicator of mind-wandering) and whether I’m focused (theta dominance in the prefrontal ACC).

In addition, I will be using some ready-made and beta trackers for my Quant Humanists class, and hope to export all the tracked data into one system that visualizes them together. This endeavor will also be supported by the course API of You, which will start at the second half of the semester.

Plan – What is your game plan to achieve the goal in 10 weeks, what is the research required, what are the milestones. Try to come up with about 5 milestone dates towards a completion beginning of May.