Example of TSVM
In our Python implementation, we are going to use a bidimensional dataset similar to one employed in the previous method; however, in this case, we impose 400 unlabeled samples out of a total of 500 points:
from sklearn.datasets import make_classification
nb_samples = 500
nb_unlabeled = 400
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, random_state=1000)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0
The corresponding plot is shown in the following graph:
Original labeled and unlabeled dataset
The procedure is similar to the one we used before. First of all, we need to initialize our variables:
import numpy as np
w = np.random.uniform(-0.1, 0.1, size=X.shape[1])
eta_labeled = np.random.uniform(0.0, 0.1, size=nb_samples - nb_unlabeled)
eta_unlabeled = np.random.uniform(0.0, 0.1, size=nb_unlabeled)
y_unlabeled = np.random.uniform(-1.0, 1.0, size=nb_unlabeled)
b = np.random.uniform(-0.1, 0.1, size=1)
C_labeled = 1.0
C_unlabeled = 10.0
theta0 = np.hstack((w, eta_labeled, eta_unlabeled, y_unlabeled, b))
In this case, we also need to define the y_unlabeled vector for variable-labels. The author also suggests using two C constants (C_labeled and C_unlabeled) in order to be able to weight the misclassification of labeled and unlabeled samples differently. We used a value of 1.0 for C_labeled and 10.0 for C_unlabled, because we want to penalize more the misclassification of unlabeled samples.
The objective function to optimize is as follows:
def svm_target(theta, Xd, Yd):
wt = theta[0:2].reshape((Xd.shape[1], 1))
s_eta_labeled = np.sum(theta[2:2 + nb_samples - nb_unlabeled])
s_eta_unlabeled = np.sum(theta[2 + nb_samples - nb_unlabeled:2 + nb_samples])
return (C_labeled * s_eta_labeled) + (C_unlabeled * s_eta_unlabeled) + (0.5 * np.dot(wt.T, wt))
While the labeled and unlabeled constraints are as follows:
def labeled_constraint(theta, Xd, Yd, idx):
wt = theta[0:2].reshape((Xd.shape[1], 1))
c = Yd[idx] * (np.dot(Xd[idx], wt) + theta[-1]) + \
theta[2:2 + nb_samples - nb_unlabeled][idx] - 1.0
return (c >= 0)[0]
def unlabeled_constraint(theta, Xd, idx):
wt = theta[0:2].reshape((Xd.shape[1], 1))
c = theta[2 + nb_samples:2 + nb_samples + nb_unlabeled][idx - nb_samples + nb_unlabeled] * \
(np.dot(Xd[idx], wt) + theta[-1]) + \
theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] - 1.0
return (c >= 0)[0]
We need also to impose the constraints on the slack variables and on the y(u):
def eta_labeled_constraint(theta, idx):
return theta[2:2 + nb_samples - nb_unlabeled][idx] >= 0
def eta_unlabeled_constraint(theta, idx):
return theta[2 + nb_samples - nb_unlabeled:2 + nb_samples][idx - nb_samples + nb_unlabeled] >= 0
def y_constraint(theta, idx):
return np.power(theta[2 + nb_samples:2 + nb_samples + nb_unlabeled][idx], 2) == 1.0
As in the previous example, we can create the constraint dictionary needed by SciPy:
svm_constraints = []
for i in range(nb_samples - nb_unlabeled):
svm_constraints.append({
'type': 'ineq',
'fun': labeled_constraint,
'args': (X, Y, i)
})
svm_constraints.append({
'type': 'ineq',
'fun': eta_labeled_constraint,
'args': (i,)
})
for i in range(nb_samples - nb_unlabeled, nb_samples):
svm_constraints.append({
'type': 'ineq',
'fun': unlabeled_constraint,
'args': (X, i)
})
svm_constraints.append({
'type': 'ineq',
'fun': eta_unlabeled_constraint,
'args': (i,)
})
for i in range(nb_unlabeled):
svm_constraints.append({
'type': 'eq',
'fun': y_constraint,
'args': (i,)
})
In this case, the last constraint is an equality, because we want to force y(u) to be equal either to -1 or 1. At this point, we minimize the objective function:
from scipy.optimize import minimize
result = minimize(fun=svm_target,
x0=theta0,
constraints=svm_constraints,
args=(X, Y),
method='SLSQP',
tol=0.0001,
options={'maxiter': 1000})
When the process is complete, we can compute the labels for the unlabeled samples and compare the plots:
theta_end = result['x']
w = theta_end[0:2]
b = theta_end[-1]
Xu= X[nb_samples - nb_unlabeled:nb_samples]
yu = -np.sign(np.dot(Xu, w) + b)
The plot comparison is shown in the following graph:
Original dataset (left). Final labeled dataset (right)
The misclassification (based on the density distribution) is slightly higher than S3VM, but it's possible to change the C values and the optimization method until the expected result has been reached. A good benchmark is provided by a supervised SVM, which can have better performances when the training set is huge enough (and when it represents the whole pdata correctly).
It's interesting to evaluate different combinations of the C parameters, starting from a standard supervised SVM. The dataset is smaller, with a high number of unlabeled samples:
nb_samples = 100
nb_unlabeled = 90
X, Y = make_classification(n_samples=nb_samples, n_features=2, n_redundant=0, random_state=100)
Y[Y==0] = -1
Y[nb_samples - nb_unlabeled:nb_samples] = 0
We use the standard SVM implementation provided by Scikit-Learn (the SVC() class) with a linear kernel and C=1.0:
from sklearn.svm import SVC
svc = SVC(kernel='linear', C=1.0)
svc.fit(X[Y!=0], Y[Y!=0])
Xu_svc= X[nb_samples - nb_unlabeled:nb_samples]
yu_svc = svc.predict(Xu_svc)
The SVM is trained with the labeled samples and the vector yu_svc contains the prediction for the unlabeled samples. The resulting plot (in comparison with the original dataset) is shown in the following graph:
Original dataset (left). Final labeled dataset (right) with C = 1.0
All the labeled samples are represented with bigger squares and circles. The result meets our expectations, but there's an area (X [-1, 0] - Y [-2, -1]), where the SVM decided to impose the circle class even if the unlabeled points are close to a square. This hypothesis can't be acceptable considering the clustering assumption; in fact, in a high-density region there are samples belonging to two classes. A similar (or even worse) result is obtained using an S3VM with CL=10 and CU=5:
Original dataset (left). Final labeled dataset (right) with CL = 10 and CU = 5
In this case, the classification accuracy is lower because the penalty for the unlabeled samples is lower than the one imposed on the labeled points. A supervised SVM has obviously better performances. Let's try with CL=10 and CU=50:
Original dataset (left). Final labeled dataset (right) with CL = 10 and CU = 50
Now, the penalty is quite a lot higher for the unlabeled samples and the result appears much more reasonable considering the clustering assumption. All the high-density regions are coherent and separated by low-density ones. These examples show how the value chosen for the parameters and the optimization method can dramatically change the result. My suggestion is to test several configurations (on sub-sampled datasets), before picking the final one. In Semi-Supervised Learning, Chapelle O., Schölkopf B., Zien A., (edited by), The MIT Press, there are further details about possible optimization strategies, with strengths and weaknesses.