A2Z W3 Class Notes

Some javascript functions take regex

paragraph.match(/quick/g);

replace() + regex + callback: https://github.com/shiffman/A2Z-F18/blob/master/week2-regex/08_replace_with_callback/sketch.js

 

fetch(url)
.then(response => response.json())
.then(data => console.log(data))
.catch(error => console.error(error))

 

CORS workaround: cors anywhere https://github.com/Rob–W/cors-anywhere

GMOs: genetically-edited crops

Gene editing agriculture:

  • USDA regulations on GMOs apply only to those constructed using plant pathogens like bacteria, or their DNA; gene-edited plants are not regulated
  • Calyxt: startup that edits the genes of thousands of plants
    • scientists create designer plants that don’t have foreign DNA; adds or deletes snippets of genes—”accelerated breeding”
    • uses TALEN, co-developed by Calyxt founder, which was developed two years earlier than CRISPR, and as such has advanced further toward commercial crops
    • has designed 19 plants
      • edited soybeans to use in healthier oils (without trans fat)
        • will face competition with similar beans, ie a Monsanto GMO
      • including a wheat plant that grinds into a white flour with 3x more fiber
    • fast-to-market business model
  • obstacles:
    • easier to design and make DNA strands than to get them inside plants
    • uncertainty over which genes should be edited
      • Scientists know how oils are synthesized and why fruit turns brown, but genetic causes for other plant traits that are both well understood and easy to alter are unknown

GMOs

  • 90% of the soybean crop in the US are GMOs, genetically enhanced to be immune to Roundup
  • stigma: 40% of US adults think GMOs are less healthy
    • warring messages from scientists, agriculture lobbies, and NGOs like Greenpeace
  • legal in the US, Brazil, Argentina, and India, but banned throughout much of the rest of the world
  • unclear whether gene-edited crops are considered GMOs
    • no way to tell a gene-edited plant from a natural one
    • Lack of scrutiny of whether the plants could harm insects, spread their genetic enhancements to wild populations, or create superweeds
    • New Zealand and USDA’s organic council decided they are GMOs; the Netherlands and Sweden decided they weren’t; China and EU have yet to decide

https://www.technologyreview.com/s/609230/these-are-not-your-fathers-gmos/

Neural Aesthetic W3 Class Notes

convolutional neural networks allow you to look for patterns anywhere in the image

Measuring cost

  • look at the shape of the loss function for all combinations of m and b
  • bottom of the “bowl” is the best fit
  • gradient descent
    • start at a random point
    • calculate its gradient (generalization of a slope in multiple dimensions; which direction is the slope going?)
    • go down the gradient until the loss stops decreasing
  • gradient descent for NN
    • backpropagation
    • calculate gradient using chain rule
    • relates the gradient to the individual activations, error is distributed to the weights
    • problem: local minima; no way of finding the global minimum (“batch gradient descent” is not used because of this)
      • how to deal:
        • calculate the gradient on subsets of the dataset: stochastic gradient descent, mini-batch gradient descent
        • momentum: able to roll out of a local minimum to the next
          • Nesterov momentum
        • adaptive methods: AdaGrad, AdaDelta, RMSprop, ADAM (when in doubt, use ADAM)
    • overfitting

Hello, Computer Week 1 / A2Z Week 2 Homework

https://xujenna.github.io/a2z/wk2/index.html

For this week’s homework, I decided to rebuild a markov model with RiTa.js that I had previously created in python with markovify and NLTK. This time, it would respond (loosely) to a user’s input, and with a voice via the Web Speech API.

I had initially experimented with markov models in python because I had the idea to create a sort of self-care assistant as the final phase of my mood prediction project, and had dreams of it being this omnipotent and omnipresent keeper. While I have yet to figure out how to implement such a presence, I did have an idea of what I wanted it to sound like: a mixture of the exercises in Berkeley’s Greater Good in Action, NY Mag’s Madame Clairevoyant, and Oprah. I had assembled corpuses for each of these personalities manually.

It was incredibly easy to build this markov model with RiTa, and the results were surprisingly coherent—with markovify, it was necessary to POS-ify the text with NLTK in order to force some semblance of grammar into a model. However, there didn’t seem to be a native option to provide seed text, so in order to make the model responsive to a user’s input, I utilized RiTa’s KWIC model to gather all of the sentences from the source text that contain each stemmed word from the input, and loaded what the KWIC returned back into the markov model as an additional source with a high weight. The resulting generated text was consistent enough in making subtle reference to the user’s input.

The last step was to feed the markov’s response into the speech synthesizer, which was pretty straightforward, but the creepy, male, pixelated voice gives this experience the uncanny feeling which every divine being deserves.

GMOs

Two gene drive approaches:

  • replacement: alters a specific trait
  • suppression: suppresses a gene

CRISPR breaks DNA at a targeted location; the DNA heals itself in two ways:

  • nonhomologous end joining: two ends that were broken get stitched together in a random way
    • eventually confuses CRISPR, which is designed to locate a specific stretch of DNA
  • homology-directed repair: DNA uses a genetic template to heal

CRISPR potential:

  • could stop the spread of disease
  • could correct genes for inherited diseases or disabilities
  • could treat or prevent disease or disability
  • unlimited possibilities

CRISPR concerns:

  • no way to undo a gene drive once it is released in a wild population
  • uncertainty over how it may affect an ecosystem
  • population would likely develop a resistance to the gene drive
  • if carrier populations are edited to withstand diseases, the parasites may mutate
  • can damage DNA that is far from the target location
  • potential cell death after DNA editing
  • p53 protein could activate from stress from CRISPR activity and thwart it
  • some people may have already developed a resistance to CRISPR, which is a bacterial protein, during common bacterial infections
  • use for “enhancements” that could exacerbate social inequities

a2z wk2 class notes

New way of loading data (jsons) to avoid callback hell:

fetch(url).then(gotData).catch(error);

async await for sequential execution, avoids promise hell

() => replacement for anonymous function

=> for one line of code

button.mousePressed(() => background(255,0,0));

loadJSON('data.json', data => console.log(data));

for…of loop:

for(let word of words) {
let span =
createSpan(word);
  span.mouseOver(() => span.style("background-color", "red"));
}

 

REGEX

  • All words: \w
  • Match beginning of the line: ^
  • Match first word of a line: ^\w+

Neural Aesthetic W2 Class Notes

Features are:

  1. patterns in data
  2. implicit
  3. indicative of salient aspects of objects
  4. closely related to bias

Fitting

  • Linear regression doesn’t give much flexibility; you can give a neuron more by outputting it through a non-linearity, ie a sigmoid function
    • ReLU (rectified linear unit) is preferred over a sigmoid function
  • adding a hidden layer gives y (the output) even more flexibility

Convolutional NNs: scans for certain patterns throughout the entire image

activation= value of the neuron

weights on the connections

Final pt 2: Data Analysis

Update: with three additional months of data (80% training, 20% test), our results have improved significantly:

The random forest is still the most accurate, with a mean absolute error of 0.34736842105263155.

The linear model comes in close second, with a mean absolute error of 0.3631925298277533.

And the decision tree is the least accurate, with an error of 0.5.

 

 


For Part 1 of the documentation on data collection and analysis:

Final

After going through the pandas/scikitlearn wringer this past week, I now realize that collecting and visualizing my data were the first baby steps of this arduous journey. My first foray into analysis was met immediately with a big obstacle: my data was a disparate, misaligned mess. So there was a lot of preprocessing to do, for every dataset, before I could merge them all into one pandas dataframe for analysis.

After the data was cleaned and nicely tied up with a dataframe bow, I needed to merge them all into one. Pandas has a convenient merge method that does such things in one fell swoop, but the problems were 1) I had a ton of missing values, and 2) discarding incomplete rows reduced my dataset to a mere shadow of its former self. Since there was no way that pandas could make reasonable decisions about combining rows with corresponding missing values, I had to do it manually.

Or rather, my boyfriend did:

Not that I totally understand this function, let alone capable of reproducing it myself, but the crux of it is that it merges incomplete rows within 3 hours of each other. The result is this beauty:

The merged dataframe, with all of my data, ended up being 78 rows × 45 columns. Not too bad.

Now that the data was all in one big, happy dataframe, I could finally start on the analysis. Granted, 78 rows of data is not nearly enough to find relationships that are statistically significant, but consider the following a proof of concept.

Here’s a heat map of correlations between every combination of variables.

The strongest correlations are what one intuitively might expect, like the number of interactions I have being positively correlated with being a school, my mood being positively correlated with my boyfriend’s company, and stress being positively correlated with productivity (aka, the only motivator). So that’s good.

Since my greater goal for this project is predicting mood, this column is the most relevant:

From here, I was eager to do some preliminary training on machine learning models. Here are the results for a linear model:
Decision tree:
And random forests:
The random forest clearly made the most accurate predictions. Here are the predictions, along with the test values, displayed in a lovely data frame:
Pretty exciting already! Hopefully the results will continue to improve as I continue to gather more data.

Impossible Maps Final: Mapping as Autobiography

For this final, I wanted to visualize my Google Maps location history. I’ve been using different android devices since I acquired my first in December 2012, less than three months after I moved to NYC. Being a creature of habit/obsessive compulsions, I figured my location history had captured the passing fancies and preoccupations that shaped my development into an independent adult, and my (reluctantly assumed) identity as a New Yorker.

So I downloaded this history from takeout.google.com as a 325MB json file (lol):


(Shoutout to emacs for being the only text editor on my computer capable of opening it.)

Since I wanted to practice my newfound mapbox gl skills, my second obstacle was simply using this file at all, as it wasn’t in geojson format. What I ended up doing was using the d3 library to load the json for some reason (other than being a creature of habit), looping through the data, and pushing the relevant and correctly formatted data into a javascript array.

Here’s what I got when I logged the data in the console:


So that wasn’t going to happen. To make it easier on the browser, I ended up filtering out coordinates outside of (approximately calculated) NY bounds. I also divided the data into six arrays, one for each year:

Apparently, the period from 2013-present accounts for 982,154 locations out of a total of 1,170,453— which means 188,299 locations (16% of the total) were filtered out for being beyond NYC. The reason why array[2], array[3] and array[4] contain less than half of what array[0] and array[1] do is precisely that—I spent the majority of those years traveling. Array[5] is even smaller because it contains 2018 data.

Okay, so the next challenge was injecting this data into a mapbox source layer. Since mapbox expects geojson formatting, I had to hack it a little (ie, steal someone else’s hack from stackoverflow):

Then, I adapted the filtering legend from this mapbox demo to my page. Here’s what I ended up with:

Here’s the breakdown by year: