# Machine Learning: The Mathematics of Support Vector Machines - Part 1

Linearly Separable Case, Formulating the Objective Function

Views and opinions expressed are solely my own.

## Introduction

The purpose of this post is to discuss the mathematics of support vector machines (SVMs) in detail, in the case of linear separability.

## Background

SVMs are a tool for classification. The idea is that we want to find two lines (linear equations) so that a given set of points are linearly separable according to a binary classifier, coded as $$\pm 1$$, assuming such lines exist. These lines are given by the black lines given below.

library(ggplot2)
set.seed(45)

plus_1 <- data.frame(x = runif(100, min = -3, max = 3))
plus_1$y <- plus_1$x + runif(100, min = 1.1, max = 2)

minus_1 <- data.frame(x = runif(100, min = -3, max = 3))
minus_1$y <- minus_1$x + runif(100, min = -2, max = -1.1)

ggplot() +
geom_point(data = plus_1,
aes(x = x, y = y),
alpha = 0.1) +
geom_point(data = minus_1,
aes(x = x, y = y),
alpha = 0.1) +
geom_text(aes(x = 2, y = 3.5, label = "y = 1"),
color = "black") +
stat_function(fun = function(x){ x + 1.142233 },
color = "black",
linetype = "dashed") +
geom_text(aes(x = 2, y = 2.5, label = "y = 0"),
color = "red") +
stat_function(fun = function(x){ x + 0.0149345 },
color = "red") +
geom_text(aes(x = 2, y = 1.5, label = "y = -1"),
color = "black") +
stat_function(fun = function(x){ x - 1.112364 },
color = "black",
linetype = "dashed") +
theme_void() +
scale_x_continuous(limits = c(-3, 3)) +
scale_y_continuous(limits = c(-5, 5)) 

Points at or above the $$y = 1$$ line are classified as $$+1$$, and points at or below the $$y = -1$$ line are classified as $$-1$$. The line labeled “$$y = 0$$” is known as the decision boundary. The $$y = 1$$ and $$y = -1$$ lines are the lines touching the points which are closest to the decision boundary.

Assuming the lines above all have common parameter vector $$\boldsymbol\beta \in \mathbb{R}^p \setminus \{\mathbf{0}\}$$ (which we assume only includes the slope-related terms) and intercept term $$\beta_0 \in \mathbb{R}$$, we seek the decision boundary following the form of a linear equation: $$\boldsymbol\beta^{T}\mathbf{x} + \beta_0$$. At the decision boundary, we assume the classifier is $$0$$, hence we assume $$\boldsymbol\beta^{T}\mathbf{x} + \beta_0 = 0$$ at the decision boundary.

Let $$(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_N, y_N)$$ denote the training data in $$\mathbb{R}^{p} \times \{-1, 1\}$$, with each $$y_i \in \{-1, +1\}$$ for each $$i = 1, \dots, N$$. Let $$\omega_{+1}$$, $$\omega_{-1}$$ denote the subsets of $$\mathbb{R}^p$$ for which the corresponding classifiers are $$+1$$ and $$-1$$ respectively. Then we assume that \begin{align*} &\boldsymbol\beta^{T}\mathbf{x}_i + \beta_0 \geq 1\text{, }\quad \mathbf{x}_i \in \omega_{+1} \\ &\boldsymbol\beta^{T}\mathbf{x}_i + \beta_0 \leq -1\text{, }\quad \mathbf{x}_i \in \omega_{-1} \end{align*} Since $$y_i \in \{-1, +1\}$$, the above two conditions are equivalent to $\begin{equation*} y_i(\boldsymbol\beta^{T}\mathbf{x}_i + \beta_0) \geq 1\text{, } \quad\mathbf{x}_i \in \mathbb{R}^{p}\text{.} \end{equation*}$ However, it is worth noting there are infinitely many possible decision boundaries that satisfy these conditions. We have to choose some sort of criteria for optimality.

## An Intermediate Result

Consider the decision boundary $$\boldsymbol\beta^{T}\mathbf{x} + \beta_0 = 0$$ and a point $$\mathbf{z} = \begin{bmatrix} z_1 & z_2 & \cdots & z_{p} \end{bmatrix}^{T}\in \mathbb{R}^{p}$$. We aim to calculate the perpendicular distance from $$\mathbf{z}$$ to $$\mathbf{x}$$, given the constraint $$\boldsymbol\beta^{T}\mathbf{x} + \beta_0 = 0$$. Consider minimizing the squared Euclidean distance from $$\mathbf{x}$$ to $$\mathbf{z}$$ subject to $$\mu(\mathbf{x}) = \boldsymbol\beta^{T}\mathbf{x} + \beta_0 = 0$$, by the method of Lagrange multipliers. We work with the function \begin{align*} L(\mathbf{x}, \lambda) = \sum_{i=1}^{p}(x_i - z_i)^2 + \lambda\left(\beta_1x_1 + \cdots + \beta_px_p + \beta_0\right)\text{.} \end{align*} Then $\begin{equation*} \dfrac{\partial L}{\partial x_i} = 2(x_i - z_i) + \lambda\beta_i = 0 \implies (x_i -z_i)^2 = \dfrac{\lambda^2\beta_i^2}{4}\text{.} \end{equation*}$ Because $$\beta_1x_1 + \cdots + \beta_px_p + \beta_0 = 0$$, we have that $\begin{equation*} \sum_{i=1}^{p}\beta_ix_i + \beta_0 = \sum_{i=1}^{p}\left(z_i\beta_i-\dfrac{\lambda\beta_i^2}{2}\right) + \beta_0 = \sum_{i=1}^{p}z_i\beta_i-\dfrac{\lambda}{2}\sum_{j=1}^{p}\beta_j^2 + \beta_0 = 0\text{.} \end{equation*}$ This implies that $\begin{equation*} \lambda = \dfrac{2}{\sum_{j=1}^{p}\beta_j^2}\left(\beta_0 + \sum_{i=1}^{p}z_i\beta_i\right) = \dfrac{2}{\|\boldsymbol\beta\|^2_2}\mu(\mathbf{z}) \implies \dfrac{\lambda^2\beta_i^2}{4} = \dfrac{[\mu(\mathbf{z})]^2\beta_i^2}{\|\boldsymbol\beta\|_2^4}\text{.} \end{equation*}$ The squared Euclidean distance is then given by $\begin{equation*} \sum_{i=1}^{p}(x_i - z_i)^2 = \dfrac{[\mu(\mathbf{z})]^2}{\|\boldsymbol\beta\|_2^4}\sum_{i=1}^{p}\beta_i^2 = \dfrac{[\mu(\mathbf{z})]^2\|\boldsymbol\beta\|_2^2}{\|\boldsymbol\beta\|_2^4} = \dfrac{[\mu(\mathbf{z})]^2}{\|\boldsymbol\beta\|_2^2} \end{equation*}$ hence the Euclidean distance from $$\mathbf{x}$$ to $$\mathbf{z}$$ is given by $\begin{equation*} \sqrt{\sum_{i=1}^{p}(x_i - z_i)^2} = \dfrac{|\mu(\mathbf{z})|}{\|\boldsymbol\beta\|_2}\text{.} \end{equation*}$

## Optimality in SVMs

Now suppose $$\mathbf{x}_{+1}$$ and $$\mathbf{x}_{-1}$$ are the closest points to the decision boundary with classifications $$+1$$ and $$-1$$ respectively. We obtain that \begin{align*} &\boldsymbol\beta^{T}\mathbf{x}_{+1} + \beta_0 = 1 \\ &\boldsymbol\beta^{T}\mathbf{x}_{-1} + \beta_0 = -1\text{.} \end{align*} Hence, the perpendicular distances of these two points to any point $$\mathbf{x}$$ on the decision boundary is given by \begin{align*} &\dfrac{|\mu(\mathbf{x}_{+1})|}{\|\boldsymbol\beta\|_2} = \dfrac{|\boldsymbol\beta^{T}\mathbf{x}_{+1} + \beta_0|}{\|\boldsymbol\beta\|_2} = \dfrac{1}{\|\boldsymbol\beta\|_2} \\ &\dfrac{|\mu(\mathbf{x}_{-1})|}{\|\boldsymbol\beta\|_2} = \dfrac{|\boldsymbol\beta^{T}\mathbf{x}_{-1} + \beta_0|}{\|\boldsymbol\beta\|_2} = \dfrac{1}{\|\boldsymbol\beta\|_2} \text{.} \end{align*} The sum of these two distances, given by $$M = \dfrac{2}{\|\boldsymbol\beta\|_2}$$, is called the margin. Geometrically, one can think of the margin as the total perpendicular distances from the decision boundary to the closest points ($$\mathbf{x}_{+1}$$, $$\mathbf{x}_{-1}$$) of each of the states of the binary classifier. These points are known as the support vectors.

The optimization problem with regard to SVMs is \begin{align*} &\max \dfrac{2}{\|\boldsymbol\beta\|_2}\\ &\text{subject to }y_i(\boldsymbol\beta^{T}\mathbf{x}_i + \beta_0) \geq 1\text{, } i = 1, \dots, n\text{.} \end{align*} However, the objective function $$\dfrac{2}{\|\boldsymbol\beta\|_2}$$ is difficult to work with; instead, we minimize the reciprocal and square the Euclidean norm to avoid having to deal with square roots, leading to the problem \begin{align*} &\min \dfrac{\|\boldsymbol\beta\|_2^2}{2}\\ &\text{subject to }y_i(\boldsymbol\beta^{T}\mathbf{x}_i + \beta_0) \geq 1\text{, } i = 1, \dots, n\text{.} \end{align*} Note that this is a convex optimization problem.

## Conclusion and Next Steps

In this post, we formulated the SVM problem as a convex optimization problem. In a future post, I will discuss more on conditions guaranteeing a solution to the SVM problem (i.e., the Karush-Kuhn-Tucker conditions).

## References

Bishop, C. M. (2006). Pattern recognition and machine learning. New York, NY: Springer Science+Business Media.

Izenman, A. J. (2013). Modern Multivariate Statistical Techniques. New York, NY: Springer Science+Business Media.

Wilmott, P. (2020). Machine Learning: An Applied Mathematics Introduction. Panda Ohana Publishing.

##### Yeng Miller-Chang

I am a Senior Data Scientist - Global Knowledge Solutions with General Mills, Inc. Views and opinions expressed are my own.