## Class overview

You learning to teach the computer to learn from data

• How?
• Find a general rule (i.e., 일반화 성능 향상, 오버피팅 방지)
• That explains data given only as a sample of limited size
• According to some measurement of accuracy or error
• 한마디로, finding signals among the noise
• 지금까지 배운 방법론들
• Supervised learning
• Data are sample of input-output pairs
• Find input-output mapping
• Regression, classification, etc
• Unsupervised learning
• Data are sample of objects
• Find some common structure
• Clustering, etc.
• Text mining
• 오늘 배울 것: 여전히 핫한 알고리즘, SVM, and yet another supervised learning algorithm
• 앞으로 남은 세 시간:
• classes[-3]: Semisupervised learning + A touch of visualization
• classes[-2]: Big data technologies: Hadoop & spark + Tips for your exam
• classes[-1]: 대망의 기말고사

## SVM: Support vector machines

### Classification examples

1. Sheep vector machines
• Using the existing sheep distributions (the training set), determine whether the new sheep belongs with the white sheep or the black sheep.
2. Spam filtering
• Using word occurrences in existing email documents, determine whether a new email is spam or ham.
• Instance space: $x \in X$ ($|X| = n$ data points)
• Binary or real-valued feature vector $x$ of word occurrences
• $d$ features (words + other things, d~100,000+)
• Class: $y \in Y$
• $y$: Spam (+1), Ham (-1)

viagralearningthedatingnigeriais_spam
101001
01100-1
000011

### Linear binary classification

• There may exist many solutions that separate the classes exactly
• Usually, we find the one that will give the smallest generalization error
• This is the problem of choosing the or decision boundary, or hyperplane
• Input: Binary/real valued vectors $x$ and labels $y$
• Goal: Find real valued vector $w$
• Each feature has a weight $w_i$
• Prediction is based on the weighted sum: $f(x) = \sum w_i x_i = w \cdot x$
• If the f(x) is
• Positive: Predict +1 (i.e., is sheep, is spam)
• Negative: Predict -1 (i.e., is not sheep, is ham)

### SVM, the maximal margin classifier

• Idea:
• Distance from the separating hyperplane corresponds to the "confidence" of prediction
• In the image below, we are more sure about the class of A and B than of C
• SVM finds the decision boundary with concept of maximizing this distance, or margins
• Margin $\gamma$: The perpendicular distance between the decision boundary and the closest of the data points (left figure below).
• Support vectors: Maximizing the margin leads to a particular subset of existing data points (right figure below).
• Why is maximizing $\gamma$ a good idea?
• Remember: Dot product $A \cdot B = |A||B|cos\theta$
• Let:
• Line $L = w \cdot x + b = 0$
• Weight $w = [w_1, w_2]$
• Point $A = [x_1^{(A)}, x_2^{(A)}]$
• Point $M = [x_1^{{(M)}}, x_2^{{(M)}}]$
• Then the distance between A, L: \begin{align} d(A, L) & = |AH|\\ & = |(A-M) \cdot w|\\ & = |(x_1^{(A)} - x_1^{(M)}) w_1 + (x_2^{(A)} - x_2^{(M)}) w_2|\\ & = x_1^{(A)} w_1 + x_2^{(A)} w_2 + b\\ & = w \cdot A + b \end{align}
• Prediction = $sign(w \cdot x + b)$
• "Confidence" = $(w \cdot x + b)y$
• For i-th data point: $\gamma_i = (w \cdot x^{(i)} + b)y^{(i)}$
• Therefore, the objective function becomes:
• max $\gamma$ s.t., $y^{(i)}(w \cdot x^{(i)} + b) \geq \gamma$ ($\forall i$)
• Good according to 1) intuition 2) theory and 3) practice
• Normalized weights
• Problem: With this equation, scaling $w$ increases margin! (i.e., $w$ can be arbitrarily large)
• If $(w \cdot x + b)y = \gamma$, then $(2w \cdot x + 2b)y = 2\gamma$
• Solution: work with normalized $w$, and require support vectors to be on the margin
• $\gamma = (\frac{w}{|w|} \cdot x + b)y$
• $w \cdot x^{(i)} + b = \pm 1$
• Margin maximization == weight minimization?
• We know three things
• $x^{(1)} = x^{(2)} + 2 \gamma \frac{w}{|w|}$
• $w \cdot x^{(1)} + b = +1$
• $w \cdot x^{(2)} + b = -1$
• Therefore
• $w \cdot x^{(1)} + b = +1$
• $w (x^{(2)} + 2 \gamma \frac{w}{|w|}) + b = +1$
• $(w \cdot x^{(2)} + b) + 2 \gamma \frac{w}{|w|} = +1$
• $\gamma = \frac{1}{|w|}$
• max $\gamma \thickapprox$ max $\frac{1}{|w|} \thickapprox$ min $|w| \thickapprox$ min $\frac{1}{2} |w|^2$
• Which finally gives
• min $\frac{1}{2} |w|^2$ s.t., $y^{(i)}(w \cdot x^{(i)} + b) \geq 1$ ($\forall i$)
• This is called SVM with "hard" constraints

### Soft margin SVMs

• Relax constraints
• Allowing the margin constraints to be violated
• In other words, allow some of the training data points to be misclassified
• min $\frac{1}{2} |w|^2 + C \sum_{i=1}^n \xi_i$ s.t., $y^{(i)}(w \cdot x^{(i)} + b) \geq 1 - \xi_i$ ($\forall i$)

### Kernel methods

Warning: NOT related to shell/kernels in the OS

• Life is not so easy, not all problems are linearly separable
• If so, choose a mapping to some (high dimensional) dot-product space, namely the feature space: $\Phi: X \to H$

• Mercer's condition
• If a symmetric function $K(x, y)$ satisfies $\sum_{i,j=1}^{M} a_ia_jK(x_i,x_j) \geq 0$
for all $M \in \mathbb{N}, x_i, a_i \in \mathbb{R}$
there exists a mapping function $\Phi$ that maps x into the dot-product feature space and $K(x, y) = <\Phi(x), \Phi(y)>$ and vice versa.
• Function $K$ is called the kernel.
• Types of kernels
• Linear kernels: $K(x, y) = <x, y>$
• Polynomial kernels: $K(x, y) = (< x, y> + 1)^d$ for $d = 2$
• RBF kernels: $K(x, y) = exp(-\frac{||x-y||^2}{d^2})$
• ...and more!
• Kernels on various objects, such as graphs, strings, texts, etc.
• Enable us to use dot-product algorithms
• Measure of similarity

## Programming SVMs

Go to the SVM documents in the Scikit-learn webpage.

## References

Many contents in courtesy of Jure Leskovec and Petra Kudova