LM W6: Class Notes


(Adam optimizer helps avoid local minima)


  • sessions are run-time environments (what happens in your session, stays in your session)

convolutional neural network

  • https://en.wikipedia.org/wiki/Kernel_(image_processing)

recurrent neural network:

  • sequence to sequence


  • language in vector space; can perform algebra on concepts
    • king – man + woman = queen
  • https://en.wikipedia.org/wiki/Principal_component_analysis

LM W6: Class Notes

unsupervised learning

  • unsupervised learning is supervised learning, but the output is the same as the input


  • https://github.com/Hebali/learning_machines/tree/master/hyperparameter_hunt
  • main.py: choose hyperparameters that will beget the best accuracy
    • 700 > 600 will result in a more accurate reconstruction than 700 > 5
    • batch size: won’t have too much influence
    • think of value that you think is too small and too large, try both and middle, choose best two values and repeat
    • can layer RBM
    • continuous version allows probabilities; binary compares probability to random number between 1 and 0,  chooses between 1 and 0 based on whether the probability is less
  • Docker: virtual machine
  • launchbot: web interface for that
  • jupyter

Learning Machine W4: Class Notes

(Grapher: Mac Os app)

Multilayer Perceptron:

  • Perceptron changes the weights to get better answers; multilayer perceptron gets better outputs with hidden weights that represent facets of a problem that the machine determines
    • rather than inputs going straight into the output, there is a hidden layer of factors of indeterminate length that sit between input and output
    • inputs are what are visible to the computer; computer assumes there’s more than what is visible and compensates with the hidden layer to explain what’s visible
      • “wires” are layers; each layer has inputs and outputs
      • “input” is output of the previous layer
  • backpropagation: “error” of one layer is a function of previous layer, so for loop runs backward through layers to adjust weights
    • relates to gradient descent: moves toward lowest point (error rate) one step at a time
    • HW visualizing error rate is helpful (over epochs, aka training iterations)
      • if error rate flattens out well before zero, it’s hit a local minimum or is overfit
      • if error rate nears zero then suddenly increases, it’s overfit to your training examples
        • solutions: give more examples, lower the learning rate; most effective are drop out (every time we train, we block some neurons from changing) and regularization (Occam’s razor: simplest solution is usually the correct one; penalizes extreme conclusions)
          • can attach an unsupervised learner to the supervised learner
  • learning rate is just a multiplier so we don’t learn too much too quickly (would require fewer examples/iterations, and therefore develop perceptions too early that are hard to back out of)

Supervised Learning:

  • two categories of problems: regression and classification problems
    • regression: function (curve, ie stock prices)
      • linear regression: take a curve and fit it to a straight line that is the best approximation
      • nonlinear regression: try to fit trend to a curve that is the best approximation
      • deals with a continuous function
    • classification: discreet (as opposed to continuous) outputs
      • one-hot encoding: each category has its own dimension
        • as many output dimensions as there are categories; training data= 1 for the thing it’s representing, 0 for things it’s not
      • one-cold encoding: inverse of one-hot
  • 90-95% accuracy is best
  • can create sub-datasets within historic data set and loop inside subsets as training: three-day input /one-day output

Activation functions:

  • Sigmoid (0 to 1) and tanh (-1 to 1)
    • squashes any values above max and below min
    • outputs need to be within the domain (use mapping function) of activation function
    • tanh twice as much precision (ie twice as many numbers)

HW data sets: http://archive.ics.uci.edu/ml/index.php

  • Just use one hidden layer, variation can be how many nodes
    • number of nodes = somewhere between average number of input dimensions + number of output dimensions and twice the largest dimension (x+y/2 and 2x)

Learning Machines W3 HW: Perceptron Implementation

This week’s homework was to implement the Perceptron algorithm and train it on data sets based on the AND, OR, and XOR logic gates. Since the outputs we want the Perceptron to predict are known, this is considered supervised training. Here is my spaghetti code:

This code returns the results once its predictions match the known outputs. I had it print each step to see what it was thinking:

AND results

OR results

Then, as expected, my Perceptron was not able to reach 100% accuracy for the XOR dataset, and thus does not exit the loop. Here it is trying very hard:

Full code here: https://github.com/xujenna/learning_machines/blob/master/perceptron.py

Learning Machines: W3 Class Notes

Graph theory describes hierarchy of related elements:

  • vertices (nodes) are entities; edges represent the relationship between nodes
  • can represent grouped paths in photoshop; a computer program;
  • perceptron is a directed graph (info flows in one direction)
  • recurrent neural networks allow cycles (loops) in their flow of info

Perceptron Implementation Notes:

  • sign activation function: if number is greater than zero, output is 1; less than zero, output is -1
  • bias input: always equal to one
  • supervised training procedure:
    • make predictions (ie, weights are random)
    • have perceptron guess outputs; compare to actual known outputs
    • compute the error; adjust all weights accordingly
    • repeat
  • HW:
    • construct data sets, train it on all three (AND and OR should be 100% accurate)
      • extra column of 1’s for bias input
      • input is 2 columns (3 for bias input), output is 1 column
      • AND set: true = 1, false = -1; two column pairs
        • input [1,1], output = [1]; input [1, -1], output [-1], etc
      • OR set
      • XOR set will not be 100% accuracy (probably 50%)
    • class Perceptron
      • initializer function (number_of_input_dimensions, num_of_output_dimensions)
        • weight = np.rand(num_input)
      • predict(inputs)
        • return array of output predictions
      • training function(iterations, inputs, known outputs)
        • for iter in range(num_iters):
          • predict = (
      • myPerceptron = Perceptrion()
        • myPerceptron.train()
        • myPerceptron.predict()

Linear separation:

  • Exclusive Or: (a OR b) AND (NOT (a AND b))
    • Both variables are dependent on each other, whereas in AND and OR models neither nodes need to know about each other
      • not linearly separated like AND and OR
  • If machine-learning about whether pixels compose a picture of a person, the perceptron asks each pixel if it may be part of a picture of a person, and if more than 50% say yes, then the output is yes, the picture is of a person
    • does not account for interdependency of pixels

Calculus Primer:

  • Calculus is about approximating the analog world
  • derivative: rate of change in some phenomenon
    • power rule: multiply power by variable’s coefficient, reduce power by 1
      • derivative of x^2 is 2x
    • chain rule: f(g(x)) = f'(g(x))g'(x) >> nested function, able to compute derivative by splaying it out

Learning Machines w2 hw: k-means clustering

Not gonna lie, I pair-programmed this with a software engineer who specializes in machine learning. I was able to code the initial setup just fine, but somehow failed to make the connection that the data points were vectors, and all the calculation required was just vector math. Being new to python—and having forgotten all my high school math—this assignment was pretty overwhelming, but with some help I feel like I now understand k-means clustering pretty well.

Would like to update this later so that the data points change color based on which cluster set it belongs to. The plots currently are not very legible.

Click for full image.

Steps taken to group 50 data points into 4 clusters.

Steps taken to group 100 data points into 5 clusters.

Full code here: https://github.com/xujenna/learning_machines/blob/master/kmeans_clustering.py

Learning Machines: w2 notes


huffman encoding: scan entire doc, look at frequency of occurrence in overall document, then encode based on frequencies

  • scalar: individual number
  • vector: one dimensional list of numbers
  • matrix: a list of vectors, two dimensional list of numbers
  • tensor: a list of matrices is one example of a tensor

k-means clustering (viz demonstration here)

  1. choose random points to serve as center of cluster
  2. measure distances between “center” points and data points
  3. sort points into clusters based on distances
  4. move center points to actual center of clusters
  5. measure new distances between “center” points and data points (newly adjusted points may change which points belong to which cluster)
  6. repeat until stable (points no longer jump between clusters)

machine learning day 1—class notes

entropy: natural tendency of nature toward chaos (milk stirs into coffee forever); randomness is the likely outcome of things

math of machine learning is rooted in contorting the physical world to imitate what we want it to represent (bringing organization to entropy)

rationalism: knowledge comes from truth; anything we know, we know through the ether (or the gods)
empiricism: all knowledge is derived from experience; you can’t know anything without experiencing it—no truth, just opinions

kant integrated both by saying that minds have been constructed in a similar way, bodies are born in a similar way

time, space, causality; we have to operate in a spacial, temporal, causal world; truth comes from these conditions

truth in machine learning outputs is relative to its experience; not objectively correct

pixels > face too great of a dissonance, start with surfaces, edges, shapes

machine learning = induction