The other general form is ax+by+c=0. This is the equation we will be using the most to work with multi-dimensional data.
now from that equation, we get y= (-a/b)*x- (c/b), where m= -a/b , b = -c/b
Now we will modify that equation a lil bit as -> w1*x1+w2*x2+w0=0. This equation is for 2-dimensional data. see the below image for a clear understanding.
If we observe carefully the equation is nothing but the dot product of two vectors W and X, so this equation can be further simplified as W^T*X+W0=0. where W^T represents the transpose of the W matrix which does dot product with X. W0 is the bias term, it becomes zero when it passes through the origin.
As I said earlier the hard margin classifier is limited to Linearly separable problems, This is the reason I took an example of the linearly separable problem, which has two classes in it classified by the Positive class and negative class. If you guys observe there are three equations. The negative class has the equation W^T*X+b=-1 because whenever a data point falls in that region the dot product of the vectors must be negative for the negative class. The Decision boundary has the equation W^T*X+b=0 because whenever a data point falls on the decision boundary the dot product of the vectors must be equal to zero since it is in the middle of the both positive class and positive class. The positive class has the equation W^T*X+b=+1 because whenever a data point falls in the positive class the dot product of the vectors must be positive.
Now in the above picture, I took a data point that falls on the decision boundary, but it is to be classified as the positive class. So that the data point to be classified as the positive class it must move some units towards it. That’s actually the margin. we need to calculate the distance from the decision boundary hyperplane to the nearest data points which are support vectors.
See the below image for clear understanding
So we finally calculated the total size of the margin and also if we see carefully the margin size is inversely proportional to the magnitude of the vector. so in order to increase the margin we need to minimize the magnitude.
for y=+1 -> W^T*X+b≥1, the reason why it always should be greater than or equal to +1 is, whenever a data point is labeled as the positive class it might be a support vector where the dot product equals +1or it might be greater than +1.
for y=-1 -> W^T*X+b≤-1, the reason why it always should be lesser than or equal to -1 is, whenever a data point is labeled as the negative class it might be a support vector where the dot product equals -1 or it might be lesser than -1.
we can combine these two rules into a single rule which satisfies all the conditions. That is Y*(W^T*X+b)≥+1, this will be always correct for any value of Y, suppose if the Y is positive then the result will always be positive, and if Y is negative when it is multiplied with that equation the result will become negative and it correctly classifies the data.
So our main aim in the hard margin classifier is to minimize the magnitude of the vector so as to increase the margin between the classes for the prevention of misclassification. We also need to find the best W and b so that it decreases the magnitude and increases the margin.
So this is all about the Hard margin classifier and the math behind it. The main problem with this is, it can’t handle non-linearly separable problems. And also it forces these constraints on the data to get linearly separated.
In the next blog, I will post about the soft margin classifier. So stay tuned!
Skip the extension — just come straight here.
We’ve built a fast, permanent tool you can bookmark and use anytime.
Go To Paywall Unblock Tool