DISTRIBUTED MULTI-LABEL LEARNING ON APACHE SPARK.
BASIC INFORMATION
Ph.D. Student: Jorge Gonzalez Lopez
Advisors: Alberto Cano, Sebastián Ventura
Defended on: April 2019
Keywords: multi-label learning, distributed systems, spark
Digital version
DESCRIPTION
This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model.
Five approaches for determining the optimal architecture to speed up the multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method.
Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an
approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios.
The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of the individual information measures, and a method selects the subset of features that maximize the geometrical mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels.
Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art.
PUBLICATIONS ASSOCIATED WITH THIS THESIS
INTERNATIONAL JOURNALS
- J. Gonzalez-Lopez, S. Ventura and A. Cano, “Distributed nearest neighbor classification for large-scale multi-label data on Spark”, Future Generation Computer Systems, vol. 87, pp. 66-82, 2018.
- J. Gonzalez-Lopez, S. Ventura and A. Cano, “Distributed selection of continuous features in multi-label classification using mutual information”, IEEE Transactions on Neural Networks and Learning Systems, under review, 2019.
- J. Gonzalez-Lopez, S. Ventura and A. Cano, “Distributed multi-label feature selection using individual mutual information measures”, IEEE Transactions on Knowledge and Data Engineering, under review, 2019.
INTERNATIONAL CONFERENCES
- J. Gonzalez-Lopez, A. Cano and S. Ventura, “Large-Scale Multi-label Ensemble Learning on Spark”, IEEE Trustcom/BigDataSE/ICESS Sydney, pp. 893-900, 2017.
- J. Gonzalez-Lopez, S. Ventura and A. Cano, “ARFF data source library for distributed single/multiple instance, single/multiple output learning on Apache Spark”, International Conference on Computational Science, 2019.