This page has additional resources for the paper entitled "Influence of graph construction on semi-supervised learning" published at European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases 2013 (ECML-PKDD 2013). The resources available here include the source code necessary to replicate our results and all data sets used in the experimental evaluation.

Influence of Graph Construction on Semi-supervised Learning

(pdf)

Celso André de Sousa, Gustavo E. A. P. A. Batista, and Solange O. Rezende

Abstract A variety of graph-based semi-supervised learning (SSL) algorithms and graph construction methods have been proposed in the last few years. Despite their apparent empirical success, the field of SSL lacks a detailed study that empirically evaluates the influence of graph construction on SSL. In this paper we provide such an experimental study. We combine a variety of graph construction methods as well as a variety of graph-based SSL algorithms and empirically compare them on a number of benchmark data sets widely used in the SSL literature. The empirical evaluation proposed in this paper is subdivided into four parts: (1) best case analysis; (2) classifiers’ stability evaluation; (3) influence of graph construction; and (4) influence of regularization parameters. The purpose of our experiments is to evaluate the trade-off between classification performance and stability of the SSL algorithms on a variety of graph construction methods and parameter values. The obtained results show that the mutual k-nearest neighbors (mutKNN) graph may be the best choice for adjacency graph construction while the RBF kernel may be the best choice for weighted matrix generation. In addition, mutKNN tends to generate smoother error surfaces than other adjacency graph construction methods. However, mutKNN is unstable for a relatively small value of k. Our results indicate that the classification performance of the graph-based SSL algorithms are heavily influenced by the parameters setting and we found no evident explorable pattern to relay to future practitioners. We discuss the consequences of such instability in research and practice.

Keywords: Semi-supervised learning, graph-based methods, experimental study

Contacts: {sousa, solange, gbatista} at icmc.usp.br

Data sets
Here are the data sets used in our experimental evaluation in the .arff format.
For the USPS, COIL_2, DIGIT-1, G-241N, and G-241C data sets, we ran Principal Component Analysis (PCA) using the Matlab Toolbox for Dimensionality Reduction library in order to reduce the dimensionality of the original data, which are available at Chapelle et al.'s book
Source code
Here are the source codes used in our experiments.
For the Laplacian Support Vector Machine (LapSVM) algorithm, we used the Melacci's code.
For the Local Linear Embedding (LLE) method, we used the Liu's code.
In order to run matlab code inside our Java code, we used the Matlabcontrol library.
Results
Here are all the results for the evaluation of classifiers' stability.
Here are all the results for the evaluation of graphs' stability.
Here are the results for the evaluation of regularization parameters.