Final pt 2: Data Analysis

Update: with three additional months of data (80% training, 20% test), our results have improved significantly:

The random forest is still the most accurate, with a mean absolute error of 0.34736842105263155.

The linear model comes in close second, with a mean absolute error of 0.3631925298277533.

And the decision tree is the least accurate, with an error of 0.5.

 

 


For Part 1 of the documentation on data collection and analysis:

Final

After going through the pandas/scikitlearn wringer this past week, I now realize that collecting and visualizing my data were the first baby steps of this arduous journey. My first foray into analysis was met immediately with a big obstacle: my data was a disparate, misaligned mess. So there was a lot of preprocessing to do, for every dataset, before I could merge them all into one pandas dataframe for analysis.

After the data was cleaned and nicely tied up with a dataframe bow, I needed to merge them all into one. Pandas has a convenient merge method that does such things in one fell swoop, but the problems were 1) I had a ton of missing values, and 2) discarding incomplete rows reduced my dataset to a mere shadow of its former self. Since there was no way that pandas could make reasonable decisions about combining rows with corresponding missing values, I had to do it manually.

Or rather, my boyfriend did:

Not that I totally understand this function, let alone capable of reproducing it myself, but the crux of it is that it merges incomplete rows within 3 hours of each other. The result is this beauty:

The merged dataframe, with all of my data, ended up being 78 rows × 45 columns. Not too bad.

Now that the data was all in one big, happy dataframe, I could finally start on the analysis. Granted, 78 rows of data is not nearly enough to find relationships that are statistically significant, but consider the following a proof of concept.

Here’s a heat map of correlations between every combination of variables.

The strongest correlations are what one intuitively might expect, like the number of interactions I have being positively correlated with being a school, my mood being positively correlated with my boyfriend’s company, and stress being positively correlated with productivity (aka, the only motivator). So that’s good.

Since my greater goal for this project is predicting mood, this column is the most relevant:

From here, I was eager to do some preliminary training on machine learning models. Here are the results for a linear model:
Decision tree:
And random forests:
The random forest clearly made the most accurate predictions. Here are the predictions, along with the test values, displayed in a lovely data frame:
Pretty exciting already! Hopefully the results will continue to improve as I continue to gather more data.

Impossible Maps Final: Mapping as Autobiography

For this final, I wanted to visualize my Google Maps location history. I’ve been using different android devices since I acquired my first in December 2012, less than three months after I moved to NYC. Being a creature of habit/obsessive compulsions, I figured my location history had captured the passing fancies and preoccupations that shaped my development into an independent adult, and my (reluctantly assumed) identity as a New Yorker.

So I downloaded this history from takeout.google.com as a 325MB json file (lol):


(Shoutout to emacs for being the only text editor on my computer capable of opening it.)

Since I wanted to practice my newfound mapbox gl skills, my second obstacle was simply using this file at all, as it wasn’t in geojson format. What I ended up doing was using the d3 library to load the json for some reason (other than being a creature of habit), looping through the data, and pushing the relevant and correctly formatted data into a javascript array.

Here’s what I got when I logged the data in the console:


So that wasn’t going to happen. To make it easier on the browser, I ended up filtering out coordinates outside of (approximately calculated) NY bounds. I also divided the data into six arrays, one for each year:

Apparently, the period from 2013-present accounts for 982,154 locations out of a total of 1,170,453— which means 188,299 locations (16% of the total) were filtered out for being beyond NYC. The reason why array[2], array[3] and array[4] contain less than half of what array[0] and array[1] do is precisely that—I spent the majority of those years traveling. Array[5] is even smaller because it contains 2018 data.

Okay, so the next challenge was injecting this data into a mapbox source layer. Since mapbox expects geojson formatting, I had to hack it a little (ie, steal someone else’s hack from stackoverflow):

Then, I adapted the filtering legend from this mapbox demo to my page. Here’s what I ended up with:

Here’s the breakdown by year:

Final

Background:

This semester, I began work on a system of trackers for a whole host of potential/evidenced metrics of depression, in hopes of monitoring its cyclical nature and identifying correlations with my activity and environment. Because I had done a lot of prior research, there were specific metrics that I had in mind, but oftentimes appropriate apps were either only available for iOS, didn’t provide an API, didn’t track with enough granularity, or didn’t exist at all.

Being a grad student, I have not the funds for an iPhone (new or old), and so I decided to put my newly acquired python skills to the test.

 

Data collection with homemade trackers:

  1. Mood Reporter: Because affect is difficult to measure, psychiatry traditionally employs self-administered questionnaires as diagnostic tools for mood disorders; these usually attempt to quantify the severity of DSM-IV criteria. The module for depression is called the PHQ-9, and I’ve adapted several of its questions into my own questionnaire, which python deploys every hour via the command line:

    The responses are then appended to a tsv:
  2. Productivity: via python and the RescueTime API, my productivity score is appended to a json every hour:
  3. Facial analysis: Via my laptop’s webcam, the Affectiva API analyses my face for a minute every hour; all its responses are saved to a json file. My python script grabs the min and max attention and valence values, as well as the expressions made (plotted with emoji) and the amount of times I blinked (calculated by dividing the number of times the eyeClosure variable hit 99.9%, divided by 2). These calculations are then appended to another JSON file that feeds into my visualization. The final entry for each hour looks like this:

  4. Keylogger Sentiment Analysis: The idea for this is simply to discern the sentiment of everything I type. I wrote a keylogger in python, which collects any coherent phrase to be sent to IBM Watson’s Tone Analyzer every hour. The response looks like this:

    The API provides several sentiment categories: joy, confidence, analysis, tentativeness, sadness, fear, and anger.

 

The Dashboard:

In order to understand any of this data, I would need to create a dashboard. What was important to me was to create an environment where potential correlations could be seen; since much of this is speculative, this basically meant doing a big data dump into the browser. I visualized everything in d3js.

My local dashboard has access to the hourly updated data, which is unbelievably satisfying; the public version has about 2.5 weeks worth.

 

Next steps:

I’m in the process of building yet another tracker: a Chrome extension which will record my tab/window activity (the amount of which is probably/definitely positively correlated with stress and anxiety in my life!).

I would also like to add a chart that allows me to compare the trendlines of all the metrics, as a preliminary attempt to guess at correlations. This will definitely require me to do a lot of data reformatting.

I also need to visualize the data from the tracking apps I did download (Google Fit and Exist.io), and include other environmental information like weather, calendar events, etc.

Honestly, I will probably be working on this for the rest of my life lol

Hacking the Browser W5 HW

For my Hacking the Browser final, I would like to create a Chrome extension that can monitor my browser activity (to add to my suite of trackers) and produce hourly values for:

  1. the total number of tabs open at the end of the hour
    • chrome.tabs.query(object queryInfo, function callback)
  2. the total number of windows open at the end of the hour
    • chrome.windows.getAll(object getInfo, function callback)
  3. the total number of tabs opened during the hour
    • chrome.tabs.onCreated.addListener(function callback)
  4. the total number of windows opened during the hour
    • chrome.windows.onCreated.addListener(function callback)
  5. the total number of tabs looked at during the hour
    • chrome.tabs.onActivated.addListener(function callback)
  6. the favicon from every updated tab
    • chrome.tabs.onUpdated.addListener(function callback)
    • tab.favIconUrl (this requires the “tabs” permission)

I believe I’ll only require a background script for this project, as I won’t be inserting any code into the pages I visit, and won’t need a browser or page action. The difficult part will be figuring out how to access the data every hour. There must be an easier way, but my only idea at the moment is to do an AJAX post to MongoDB….

Impossible Maps W4 HW

Feminist data viz notes:

  • Feminist standpoint theory: all knowledge is socially situated; the perspectives of oppressed groups (women, minorities, etc) are systematically excluded from “general” knowledge
  • Feminist data viz could:
    1. invent new ways to represent uncertainty, outsides, missing data, and flawed methods
      • can we collect and represent data that was never collected?
      • can we find the population that was excluded?
      • can we critically examine the methods of study rather than accepting the JSON as is?
    2. invent new ways to reference the material economy behind the data
      • what are the conditions that make data viz possible?
      • who are the funders?
      • who collected the data?
      • interested/stakeholders behind the data?
    3. make dissent possible
      • data viz = stable images/facts
      • re-situate data viz by destabilizing, ie making dissent possible
        • how can we talk back to the data?
        • how can we question the facts?
        • how can we present alternative views and realities?

Representation and the Necessity of Interpretation notes:

  • satellite imagery were only until recently military secrets
  • in 2000, the nyt for the first time used the newly available Ikonos satellite “as a sort of alternative investigative journalist in Chechnya” but “failed to arouse public sympathy or outrage”; however, before/after images have still become commonplace in reporting from zones of conflict
  • Sept 1999: Space Imaging launched Ikonos, the first satellite to make hires image data publicly available
  • We need to be alert to what is being highlighted and pointed toward, to the ways in which satellite evidence is used in making assertions and arguments; for every image, we should be able to inquire about its technology, location data, ownership, legibility, and source

 

Response: 

  • I never realized that satellite imagery was born from the agenda of the US military, yet it’s not surprising. What struck me most from the latter reading was learning that Colin Powell used satellite images as incontrovertible proof that there were weapons of mass destruction in Iraq—I don’t think you can get a much better example of “interpreted data”.
  • One year later, in 2003, Ross McNutt’s team put a 44 mega-pixel camera on a small plane to watch over Fallujah, Iraq. Its images were high-def enough to track the sources of roadside bombs, and it was on all day, every day. After the war, Ross did a piloted this technology in Dayton, Ohio, as a way for the local police to identify criminals and gang members.
  • When I first heard this story, I didn’t feel too conflicted about it—bad guys were being caught and brought to justice, what’s the problem here? However, after reading Laura Kurgan’s chapter on representation and interpretation, now it feels like Ross was just thinking locally about persecuting colored people. Especially considering that a program like his would only be implemented in larger urban areas, ie where most minorities live.

 

Final project idea:

  • I’d like to download my location history from Google, and visualize it to get a sense of my navigational habits/biases and identify opportunities for breaking out of my comfort zone
  • I thought this was a nice use of satellite imagery; this view shows the dramatic urbanization of Shanghai over 30 years, particularly the waterfront along the Huangpu River. Also fascinating is the expanding, presumably manmade coastline

 

API of You W2 + W3 Homework

For the third week’s assignment, I finessed last week’s viz into something much more coherent.

For the final, I would like to create a meaningful, comprehensive dashboard for all the data I’ve collected with my homemade trackers. I’ve chosen to measure several facets of my life, motivated by scientific evidence and/or personal belief that they may be metrics for stress, anxiety, and/or depression. Currently, this data is either scattered in isolated visualizations, or just sitting around in json/csv/tsv files. Additionally, this data is only tracked and available on my local machine.

We have my foregoing keylogger data:

This “mind wandering” viz that receives data from my chrome history and the RescueTime API:

Data from Affectiva’s emotion recognition model, which I am mostly using for valence and engagement (the viz for which clearly needs work):

Most importantly, I’d like to figure out some way to visualize this self-reported mood data, which prompts me hourly:

Time allowing, I would also like to include a report on my daily photo subjects, similar to this flickr archive analysis I did with the Clarifai API:

There’s also geolocation and physical activity/sleep data that I’d also like to include, which is being tracked by apps on my phone.

 

API of You W1 HW

I’ve worked a lot with JSONs before, so for this week’s assignment, I decided to just visualize a big JSON file that I’ve been putting off for a while: the sentiment analysis results of about 20 day’s worth of key logs. Below is an example of what a (smaller) object might look like for an hour’s worth of selected logs:

I’ve been procrastinating on visualizing this data because it’s nested enough to require effort, but here it is:

Not sure why I have a four-day gap in the middle of my data!

There are eight “tones” possible with IBM Watson’s Tone Analyzer API (color-coded above), and sentences are often given many tones. I added a rollover tooltip that displays the sentence in question—for this reason, I’m not yet ready to put the live viz online 🙂

Impossible Maps W1 HW

So I have some comments.

I may be a bit biased, being a near-daily Google Maps user myself, but I quite like Google Maps better than Apple Maps. For Part 1, the author points out that far more cities are labeled in Apple Maps, particularly in zoom 8. Labelling 44 cities in such a small space completely clutters up the tile and renders it all but illegible. You don’t need that much detail at a higher-level view. Also, at a higher-level view, chances are you’re driving and not walking—which is probably why Google prioritized “shields” over cities.

However, at a lower level, Apple has these interesting high-fidelity, individual landmark markers rather than using a generic marker for each type of POIs. As a person who navigates by landmark and gets confused by street names, I actually do appreciate this detail.

Because of this, and because Google Maps tends to label far more roads and “shields” than Apple, I want to hypothesize that perhaps Apple Maps is prioritizing the pedestrian, while Google Maps is prioritizing the driver. But Apple Maps seems to give you more information at higher level zooms, then dissolves into minimalism as you zoom in and expect to get more information. As a Manhattanite, I need those subway station markers!

I would also like to express my horror at this “Frakenstein map”:

What good is that much information when you can’t read it? And when do you ever need that much information?

If users do indeed crave “the whole picture”, perhaps there should be two map modes: one for navigation, which emphasizes roads and their labels; and the other for general exploration, which emphasizes cities and POI’s. As a chronic pedestrian and global traveller, I honestly have no need for the former information—I’m either walking to a building or subway station and therefore only need street names at a low level zoom, or I’m zoomed all the way out and planning my next vacation, and therefore only need political borders and major city names.

QH W7 HW: Final Project Proposal Outline

Digital phenotyping + self-care/intervention

  • Background research/ project landscape:
    • Mindstrong: This is basically what I came to ITP to do: develop a tracking system that can detect/predict/prevent the onset of depression, although Mindstrong is aiming to tackle many mental illnesses with only smartphone data, whereas I am primarily gathering data from laptop use. Co-founded by Dr. Tom Insel, formerly the lead of Verily’s mental health team.
    • PRIME app: an app developed by UCSF researchers for clinical trials studying the effect of social support on the severity on schizophrenia
    • Fine: An mood reporter app that tracks self-reported data (not available for use)
    • trackyourhappiness.org: an ongoing doctoral research project by Matt Killingsworth at Harvard, which prompts you throughout the day to find out what factors are associated with happiness.
    • PHQ-9: standard self-reported questionnaire for depression severity
    • Exist: links all your tracking apps to find correlations
    • I feel like shit game: an interactive self-care flow chart; asks you questions about your state and offers self-care suggestions
    • Headspace: there’s a meditative exercise tailored for nearly every mood possible
  • Hypothesis / Definition of question(s):
    • What factors in my life contribute to stress, anxiety, low morale/motivation, and negative affect in general?
    • What factors contribute to high morale, motivation, positive affect, and a more balanced feeling of well-being?
    • How might I facilitate the latter factors?
    • What interventions are appropriate?
  • Objectives:
    1. a system that tracks:
    2. a dashboard for data viz
    3. machine-learned correlations (ultimately)
    4. self-care recommendations
  • Goals:
    • To address issues with current treatments:
      • Therapy:
        • lag time (“A therapist, the joke goes, knows in great detail how a patient is doing every Thursday at 3 o’clock.”)
        • No performance feedback outside of potentially dishonest self-reporting
        • patient fear of disappointing therapist
        • variations in individual therapist efficacy
      • Medication:
    • To catalyze self-awareness of emotions and their triggers
    • To facilitate self-care
    • To encourage healthier digital habits
  • Technical considerations/next steps:
    • Finish trackers:
      • tab/window counter: chrome extension
      • new affectiva approach: chrome extension background page?
      • unreturned messages
      • self-reported mood
    • Research visualizations
    • Research more behavioral metrics

SLIDES HERE