Mastering Machine Learning Algorithms
上QQ阅读APP看书,第一时间看更新

Contrastive pessimistic likelihood estimation

As explained at the beginning of this chapter, in many real life problems, it's cheaper to retrieve unlabeled samples, rather than correctly labeled ones. For this reason, many researchers worked to find out the best strategies to carry out a semi-supervised classification that could outperform the supervised counterpart. The idea is to train a classifier with a few labeled samples and then improve its accuracy after adding weighted unlabeled samples. One of the best results is the Contrastive Pessimistic Likelihood Estimation (CPLE) algorithm, proposed by M. Loog (in Loog M., Contrastive Pessimistic Likelihood Estimation for Semi-Supervised Classification, arXiv:1503.00269).

Before explaining this algorithm, an introduction is necessary. If we have a labeled dataset (X, Y) containing N samples, it's possible to define the log-likelihood cost function of a generic estimator, as follows:

After training the model, it should be possible to determine p(yi|xi, θ), which is the probability of a label given a sample xi. However, some classifiers are not based on this approach (like SVM) and evaluate the right class, for example, by checking the sign of a parametrized function f(xiθ). As CPLE is a generic framework that can be used with any classification algorithm when the probabilities are not available, it's useful to implement a technique called Platt scaling, which allows transforming the decision function into a probability through a parametrized sigmoid. For a binary classifier, it can be expressed as follows:

α and β are parameters that must be learned in order to maximize the likelihood. Luckily Scikit-Learn provides the method predict_proba(), which returns the probabilities for all classes. Platt scaling is performed automatically or on demand; for example, the SCV classifier needs to have the parameter probability=True in order to compute the probability mapping. I always recommend checking the documentation before implementing a custom solution.

We can consider a full dataset, made up of labeled and unlabeled samples. For simplicity, we can reorganize the original dataset, so that the first N samples are labeled, while the next M are unlabeled:

As we don't know the labels for all xu samples, we can decide to use M k-dimensional (k is the number of classes) soft-labels qi that can be optimized during the training process:

The second condition in the previous formula is necessary to guarantee that each qi represents a discrete probability (all the elements must sum up to 1.0). The complete log-likelihood cost function can, therefore, be expressed as follows:

The first term represents the log-likelihood for the supervised part, while the second one is responsible for the unlabeled points. If we train a classifier with only the labeled samples, excluding the second addend, we get a parameter set θsup. CPLE defines a contrastive condition (as a log-likelihood too), by defining the improvement in the total cost function given by the semi-supervised approach, compared to the supervised solution:

This condition allows imposing that the semi-supervised solution must outperform the supervised one, in fact, maximizing it; we both increase the first term and reduce the second one, obtaining a proportional increase of CL (the term contrastive is very common in machine learning and it normally indicates a condition which is achieved as the difference between two opposite constraints). If CL doesn't increase, it probably means that the unlabeled samples have not been drawn from the marginal distribution p(x) extracted from pdata.

Moreover, in the previous expression, we have implicitly used soft-labels, but as they are initially randomly chosen and there's no ground truth to support their values, it's a good idea not to trust them by imposing a pessimistic condition (as another log-likelihood):

By imposing this constraint, we try to find the soft-labels that minimize the contrastive log-likelihood; that's why this is defined as a pessimistic approach. It can seem a contradiction; however, trusting soft-labels can be dangerous, because the semi-supervised log-likelihood could be increased even with a large percentage of misclassification. Our goal is to find the best parameter set that is able to guarantee the highest accuracy starting from the supervised baseline (which has been obtained using the labeled samples) and improving it, without forgetting the structural features provided by the labeled samples.

Therefore, our final goal can be expressed as follows: