The Future of Data Analysis, Tukey 1962

“Individual parts of mathematical statistics must look for their justification toward either data analysis or pure mathematics.”

Large parts of data analysis are (but as a whole is larger and more varied than):

  • inferential in the sample-to-population sense
  • incisive, revealing indications imperceptible by simple examination of raw data
  • allocation, guiding us in observation, experimentation, or analysis

How can new data analysis be initiated?

  1. seek out wholly new questions to be answered
  2. tackle old problems in more realistic frameworks
  3. seek out unfamiliar summaries of observational material, and establish their useful properties
  4. still more novelty can come from finding, and evading, still deeper lying constraints

data analysis is a science because it has 1) intellectual content, 2) organization into an understandable form, and 3) reliable upon the test of experience as the ultimate standard of validity

(Mathematics is not a science: standard of validity is an agreed-upon logical consistency and provability)

Data analysis, and the parts of statistics which adhere to it, must…take on the characteristics of science rather than those of mathematics:

  1. must seek for scope and usefulness rather than security
  2. must be willing to err moderately often in order that inadequate evidence shall more often suggest the right answer
  3. must use mathematical argument and mathematical results as bases for judgment rather than as bases for proof or stamps of validity

data analysis is intrinsically an empirical science

data analysis must look to a very heavy emphasis on judgement:

  1. judgement based upon the experience of the particular field of subject matter from which data come
  2. judgement based upon a broad experience with how particular techniques of data analysis have worked out in a variety of fields of application
  3. judgement based upon abstract results about the properties of particular techniques, whether obtained by mathematical proofs or empirical sampling

a scientist’s actions are guided, not determined, by what has been derived from theory of established by experiment

scientists know that they will sometimes be wrong; they try not to err too often, but they accept some insecurity as the price of wider scope; data analysts must do the same

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” Data analysis must progress by approximate answers, at best, since its knowledge of what the problem really is will at best be approximate.

electronic rituals wk1 notes

phenomenology: not necessarily observable; what can’t be observed but only recorded by the person experiencing it

bracketing/epoche: act of setting aside whether the world exists in favor of focusing on subjective experiences; no need to prove what happened

hermeneutic(s): ways of experiencing something; methods of interpretations; approaches of extracting information; ie different tarot readings of the same hand

ethnography: methodology for conducting social science; emphasizes field work and case studies, rather than lab experiments and statistical conclusions

“the numinous”: set apart space, working with a different set of rules, in order to evoke a phenomenological experience; labyrinth Rituals:


  • rites of passage (baptism)
  • seasonal/calendrical (halloween)
  • political (elections)
  • religious (prayers)
  • interpersonal (TNO)

Rituals are related to performance, play, gesture, discourse, studies, etc.

Divination: ritualized practices of finding things out; “subjunctive”

  • cleromancy: casting and reading (dice, cards, i ching)
  • augury, prophecy and prediction: interpreting natural phenomenon
  • mediums and messages: reading at a distance; Ouija board
  • spells, hexes

How New York are you? [documentation]


“So, do you feel like a real New Yorker yet?”

How can a recent transplant possibly answer this question without sounding like as much of a jerk as the other recent transplant who just asked it? For the past six years, my go-to has been “fuck that, I’m from Chicago”—but as a wise friend recently advised me, if you don’t have anything nice to say, just respond with a number.

How New York are you? is a voice-controlled browser game where two players compete to be crowned the realest New Yorker. The computer volleys hot topic keywords from the past year, and each player will have one shot per topic to prove how aligned they are with most common New York opinions. The quicker and closer the response, the more points earned.

In order to make this game, I first used twint, a twitter-scraping python module, to gather tweets originating from New York during 2018 that were relevant to popular topics on Twitter this year. Then I used this corpora to train word2vec models for each topic using gensim.

When building my initial idea, I had uploaded word2vec models directly to the browser with tensorflowjs/some code stolen from ml5js, then used tensorflowjs’s tsne library to reduce the vectors to two dimensions for visualization (beware your array types when using this library!). However, these calculations proved to be too burdensome to perform before each game, so for the final iteration, I ended up doing the tsne reduction in python (adapting a script from Yuli Cai’s workshop last year)—then uploading the two dimensional vectors to the browser instead. On Gene’s suggestion, I plan to reduce the models to three dimensions instead, then reduce to two dimensions with tensorflowjs during gameplay, in order to get more accurate results.

I used Chrome’s Speech Synthesis API to announce the topic for each round, as well as their Speech Recognition API to capture each player’s responses (recognition.interimResults is everything). I hope to someday make a version for Firefox as well.

Once a player responds to a topic and the API transcribes the response, tensorflowjs calculates the distances between each word in their response and the original keyword, then averages the distances in order to calculate a final score for their turn. The longer the distance and slower the response, the lower the score.

d3js then plots the respective embeddings in the browser. At the end, if the winner’s score surpasses the tenth highest score in history, they can add their name to the high score board for eternal fame and glory.

Play the game here.

NLP (neural aesthetic class notes)

skip-gram: predicts next/prev word(s) based on present word

CBOW: opposite of skip-gram; input is a sequence of words, output is the next word

embedding size = possible relational directions

universal sentence encoder: colab, arxiv

hierarchal neural story generator (fairseq): repo

tracking the drift of words

wiki-tSNE: groups wikipedia articles by topic

python library wikipedia

  • text ="New  York University")

spacy: better than nltk? can parse entities, ie organizations (New York University) time (12pm), etc

Final Project Proposal

See the final here

For my final project, I’d like to create a browser game in which two players compete to think of words as unrelated to each other as possible, as quickly as possible. The browser will keep score, which is determined by 1) the distance between two words as defined by word2vec models, and 2) the time it takes for the player to think of their word. The browser will also map the players’ words based on a tsne reduction of the word2vec model, in order to provide a visual indicator of performance.

Collect inspirations: How did you become interested in this idea? 

I love the idea of statistically analyzing text, and have really enjoyed building Markov Models and training LSTMs in the past. Word2Vec is especially interesting because it’s able to map words semantically, and does this solely through the analysis of large amounts of corpora. Depending on the dataset, visualizing these relationships can reveal a lot about how the source perceives the world.

Collect source material:

  1. text sources: Wikimedia dump, Google News (pre-trained word2vec), kanye tweets, wiki-tSNE for different topics (art movements, periods of history, celebrities, movies, etc)
  2. nltk + punkt to clean data, remove stop words
  3. gensim to train word2vec model
  4. tensorflowjs to calculate distance
  5. tensorflowjs tsne library to visualize

Collect questions for your classmates.

  • What should the title be?
  • Game features?
  • Text sources?

What are you unsure of? Conceptually and technically.

  • How to use pre-trained GloVe models with tensorflowjs/ml5?
  • Is this a fun game??

Class Notes:

  • show averages between words (as explanations)
  • narrative

Neural Aesthetic Class Notes wk8

limitations of feed-forward NNs:

  • static, does not change over time
  • does not take advantage of context
  • inputs and outputs are fixed length

sequence to sequence: language translation

unit to sequence: image captioning

skip-thought vectors: arbitrary sequences of words (image to story)

dense captioning: multiple captioning for objects within images

text to image (stackGAN):

Talking and Storytelling w1 reading notes


  1. Monomoyth (the hero’s journey):
    • Structure:
      • leaves home to a threatening, unknown place
      • overcomes great trail
      • returns home with newfound wisdom
    • Good for:
      • explaining what has brought you to the wisdom you’re sharing
      • bringing the message alive
      • showing the benefits of taking risks
  2. The Mountain
    • way of mapping the tension and drama in a story
    • doesn’t necessarily have a happy ending
    • Structure
      • scene is set
      • series of small challenges and rising action
      • climatic conclusion
    • like a TV series: each episode has ups and downs, all building up to a big season finale
    • Good for:
      • showing how you overcame a series of challenges
      • slowly building tension
      • delivering a satisfying conclusion
  3. Nested Loops
    • three or more narratives are layered within each other
    • Structure
      • the center = the most important story with the core of your message
      • outside layers elaborate or explain the central principle
      • the first story you begin with is the last story you finish, the second story you start is the penultimate you finish, etc
    • Good for:
      • explaining the process of how you were inspired/came to a conclusion
      • using analogies to explain a central concept
      • showing how a piece of wisdom was passed to you
  4. Sparklines
    • way of mapping presentation structures
    • very best speeches succeed because they contrast our ordinary world with an ideal, improved world—comparing what is with what could be
    • Good for:
      • inspiring the audience to action
      • creating hope and excitement
      • creating a following
    • MLK’s I Have a Dream speech
  5. In Media Res
    • Structure
      • narrative begins in the heat of the action
      • starts over at the beginning to explain how you got there
    • try hinting at something bizarre or unexpected, something that needs more explanation, to hook the audience
    • only works for shorter presentations
    • Good for:
      • grabbing attention from the start
      • keeping an audience craving resolution
      • focusing attention on a pivotal moment in your story
  6. Converging Ideas
    • shows the audience how different strands of thinking came together to form one idea
    • can be used to show the birth of a movement, explain how a single idea was the culmination of several minds working towards one goal
    • Good for:
      • showing how great minds came together
      • demonstrating how a development occurred at a certain point in hitsory
      • showing how symbiotic relationships formed
  7.  False Start
    • begin to tell a seemingly predictable story, before unexpectedly disrupting it and beginning it over again
    • good for talking about failures where you were forced to go back to square one and reassess; ideal for talking about the things that you learned from the experience, or some innovative way you solved a problem
    • quick attention hack which will disrupt your audience’s expectations
    • Good for:
      • disrupting audience expectations
      • showing the benefits of a flexible approach
      • keeping the audience engaged
  8. Petal Structure
    • organizes multiple speakers or stories around one central concept
    • useful if you have several unconnected stories you want to tell, or things you want to reveal, that all relate back to a single message
    • each petal should be a complete narrative in itself; evidence around your central theory
    • Good for:
      • demonstrating how strands of a story or process are interconnected
      • showing how several scenarios relate back to one idea
      • letting multiple speakers talk around a central theme

citizen science final proposal

Frontal alpha symmetry neurofeedback:

  1. test positive memory recall for alpha idling in right frontal area
  2. test negative memory recall for alpha idling in left frontal area
  3. test neurofeedback protocols for frontal alpha symmetry
  4. implement best protocol as regular training to test influence on mood


Other methods of altering brain activity

  • active:
    • alpha/theta training
    • meditation (mindfulness vs focused attention)
    • gratitude logging
    • positive autobiographical recall
  • passive (alpha band vs gamma):
    • photic driving
    • binaural beats
    • aromatherapy