(Grapher: Mac Os app)
Multilayer Perceptron:
 Perceptron changes the weights to get better answers; multilayer perceptron gets better outputs with hidden weights that represent facets of a problem that the machine determines
 rather than inputs going straight into the output, there is a hidden layer of factors of indeterminate length that sit between input and output
 inputs are what are visible to the computer; computer assumes there’s more than what is visible and compensates with the hidden layer to explain what’s visible

 “wires” are layers; each layer has inputs and outputs
 “input” is output of the previous layer
 backpropagation: “error” of one layer is a function of previous layer, so for loop runs backward through layers to adjust weights
 relates to gradient descent: moves toward lowest point (error rate) one step at a time
 HW visualizing error rate is helpful (over epochs, aka training iterations)
 if error rate flattens out well before zero, it’s hit a local minimum or is overfit
 if error rate nears zero then suddenly increases, it’s overfit to your training examples
 solutions: give more examples, lower the learning rate; most effective are drop out (every time we train, we block some neurons from changing) and regularization (Occam’s razor: simplest solution is usually the correct one; penalizes extreme conclusions)
 can attach an unsupervised learner to the supervised learner
 solutions: give more examples, lower the learning rate; most effective are drop out (every time we train, we block some neurons from changing) and regularization (Occam’s razor: simplest solution is usually the correct one; penalizes extreme conclusions)
 learning rate is just a multiplier so we don’t learn too much too quickly (would require fewer examples/iterations, and therefore develop perceptions too early that are hard to back out of)
Supervised Learning:
 two categories of problems: regression and classification problems
 regression: function (curve, ie stock prices)
 linear regression: take a curve and fit it to a straight line that is the best approximation
 nonlinear regression: try to fit trend to a curve that is the best approximation
 deals with a continuous function
 classification: discreet (as opposed to continuous) outputs
 onehot encoding: each category has its own dimension
 as many output dimensions as there are categories; training data= 1 for the thing it’s representing, 0 for things it’s not
 onecold encoding: inverse of onehot
 onehot encoding: each category has its own dimension
 regression: function (curve, ie stock prices)
 9095% accuracy is best
 can create subdatasets within historic data set and loop inside subsets as training: threeday input /oneday output
Activation functions:
 Sigmoid (0 to 1) and tanh (1 to 1)
 squashes any values above max and below min
 outputs need to be within the domain (use mapping function) of activation function
 tanh twice as much precision (ie twice as many numbers)
HW data sets: http://archive.ics.uci.edu/ml/index.php
 Just use one hidden layer, variation can be how many nodes
 number of nodes = somewhere between average number of input dimensions + number of output dimensions and twice the largest dimension (x+y/2 and 2x)