# Learning Machine W4: Class Notes

(Grapher: Mac Os app)

Multilayer Perceptron:

• Perceptron changes the weights to get better answers; multilayer perceptron gets better outputs with hidden weights that represent facets of a problem that the machine determines
• rather than inputs going straight into the output, there is a hidden layer of factors of indeterminate length that sit between input and output
• inputs are what are visible to the computer; computer assumes there’s more than what is visible and compensates with the hidden layer to explain what’s visible
• • “wires” are layers; each layer has inputs and outputs
• “input” is output of the previous layer
• backpropagation: “error” of one layer is a function of previous layer, so for loop runs backward through layers to adjust weights
• relates to gradient descent: moves toward lowest point (error rate) one step at a time
• HW visualizing error rate is helpful (over epochs, aka training iterations)
• if error rate flattens out well before zero, it’s hit a local minimum or is overfit
• if error rate nears zero then suddenly increases, it’s overfit to your training examples
• solutions: give more examples, lower the learning rate; most effective are drop out (every time we train, we block some neurons from changing) and regularization (Occam’s razor: simplest solution is usually the correct one; penalizes extreme conclusions)
• can attach an unsupervised learner to the supervised learner
• learning rate is just a multiplier so we don’t learn too much too quickly (would require fewer examples/iterations, and therefore develop perceptions too early that are hard to back out of)

Supervised Learning:

• two categories of problems: regression and classification problems
• regression: function (curve, ie stock prices)
• linear regression: take a curve and fit it to a straight line that is the best approximation
• nonlinear regression: try to fit trend to a curve that is the best approximation
• deals with a continuous function
• classification: discreet (as opposed to continuous) outputs
• one-hot encoding: each category has its own dimension
• as many output dimensions as there are categories; training data= 1 for the thing it’s representing, 0 for things it’s not
• one-cold encoding: inverse of one-hot
• 90-95% accuracy is best
• can create sub-datasets within historic data set and loop inside subsets as training: three-day input /one-day output

Activation functions:

• Sigmoid (0 to 1) and tanh (-1 to 1)
• squashes any values above max and below min
• outputs need to be within the domain (use mapping function) of activation function
• tanh twice as much precision (ie twice as many numbers)

HW data sets: http://archive.ics.uci.edu/ml/index.php

• Just use one hidden layer, variation can be how many nodes
• number of nodes = somewhere between average number of input dimensions + number of output dimensions and twice the largest dimension (x+y/2 and 2x)