DEMOCRATIZATION OF ADVANCED MODELS FOR DATA SCIENCE.
BASIC INFORMATION
Ph.D. Student: Rafael Barbudo Lunar
Advisors: José Raúl Romero, Aurora Ramírez
Defended on: April 2024
Keywords: automated machine learning, democratisation, evolutionary machine learning, interactive machine learning
THESIS PROPOSAL
Extracting useful and novel knowledge from raw data is a complex process, which apart from technical expertise, requires a considerable amount of background in the application domain. Therefore, it is desirable to bring the phases composing such a process closer to the domain experts. In this vein, the field known as Automated Machine Learning (AutoML) has emerged to precisely automate the machine learning process, which plays a key role during the extraction of valuable knowledge. Automating those repetitive and time-consuming phases allows technical experts to shift the focus to those phases requiring from their know-how and intuition. Indeed, AutoML approaches have already shown that they can outperform technical experts in certain tasks like the design of neural network architectures.
The main objective of this thesis is to develop AutoML approaches that assist both data scientist and end users during the knowledge extraction process. To achieve this objective, we developed Evoflow, a grammar-guide genetic programming (G3P)-based model for the automatic composition of workflows (AWC). This model generates a machine learning pipeline automatically, tailored to the characteristics of the input dataset, while incorporating the automatic selection of algorithms and the optimization of their hyperparameters (HPO).
One of the challenges faced by AWC proposals utilizing evolutionary computation is the extensive time required for evaluating various pipelines. The training of these algorithms is resource and time-intensive, often taking weeks to complete the computation. Consequently, it is valuable to explore how the insights of an expert data scientist can assist in constraining the search space during algorithm execution.
The partial objectives are the following:
- Developing an AutoML tool, which automates the generation of an optimised machine learning workflow.
- Reducing the gap between machine learning techniques and the experts on the application domain, thus making the data science process more accessible to end users.
FUNDS
The development of this thesis is being supported by:
- Spanish Ministry of Science and Competitiveness, project TIN2017-83445-P.
- Spanish Ministry of Education, Culture and Sports under the FPU program (FPU17/00799).
- Spanish Ministry of Science and Innovation, project PID2020-115832GB-I00.
PUBLICATIONS ASSOCIATED WITH THIS THESIS
INTERNATIONAL JOURNALS
- Barbudo R., Ventura S., Romero, J.R. “Eight years of AutoML: categorisation, review and trends”. Knowledge and Information Systems, pp. 1–78, 2023.
- Ramírez, A.; Barbudo, R.; Romero, J. R. “An experimental comparison of metaheuristic frameworks for multi-objective optimization” Expert Systems, pp. e12672, 2022.
INTERNATIONAL CONFERENCES
- Barbudo, R.; Ventura, S.; Romero, J. R. “Grammar-Based Evolutionary Approach for Automatic Workflow Composition with Open Preprocessing Sequence” Proceedings of the 13th International Conference on Soft Computing and Pattern Recognition (SoCPaR 2021), pp. 647–656, 2021.