Alvaro Espejo – Knowledge Discovery and Intelligent Systems – KDIS

NEW MODELS FOR DATA AUGMENTATION IN TIME SERIES PROBLEMS

BASIC INFORMATION

Ph.D. Student: Álvaro Espejo Muñoz
Advisors: José Luis Ávila, Sebastián Ventura
Started on: November 2021
Keywords: data augmentation, time series

THESIS PROPOSAL

Time series data is a collection of observations obtained sequentially over time that is widely used in applications spanning several fields, such as financial market analysis, health monitoring, energy demand forecasting and preventive maintenance, among others. The collection of this data type often demands substantial resources and time, depending on the specific requirements of each study. There are circumstances where researchers are compelled to handle proprietary data or become involved in exhaustive clinical trials. As a result of these exigencies, there may be a potential lack in the availability of temporal sequence data.

This scarcity presents a limitation for data-driven approaches, which often require vast amounts of data to build reliable and robust models. Data augmentation techniques, which artificially increase the quantity of data by adding slightly modified copies of already existing data or newly created synthetic data, have proven invaluable in various domains such as Image Processing, Natural Language Processing and Time Series Data Analysis.

Data augmentation in Time Series has traditionally been approached using techniques based on random transformations. The problem with this type of data transformations is that there are a large number of types of time series, each with its own specific properties, and not all transformations are applicable to all types of time series. Generative methods, which synthesize new time series data from inherent information in the available data, serve as an alternative to this transformations-based data augmentation techniques. Generative techniques include pattern decomposition methods, which breaks down and modify time series into distinct elements like trend, seasonality, and random noise, as well as generative models that leverages the statistical distribution of the dataset to create new samples. Although a wide range of statistical models exists, neural network (NN) based generative models have recently gained traction. However, due to the time-dependent component of time series data, not all NN models are suitable for data augmentation. Notable among these generative neural network-based models are Autoencoder (AE) networks and Generative Adversarial Networks (GAN).

GANs are implemented by a system of at least two neural networks contesting with each other in a zero-sum game framework. The goal of GANs is to capture the distribution of the training data set and to be able to generate, usually from a stochastic vector, new samples that follow the same distribution. They have demonstrated impressive results in generating new data that aligns with the distribution of the training set, particularly in image processing. Yet, their potential has been less explored in the context of time series problems, where augmenting data comes with its unique challenges due to temporal dependencies and seasonality patterns.

The primary objective of this thesis is to devise methods that allow the generation of synthetic time series data samples using generative methods in order to work in a machine learning environment with few labelled or unbalanced data sets. Likewise, it is proposed to develop a taxonomy and a development of a model to obtain efficient results taking into account the performance of state-of-the-art methods.

The partial objectives of this Ph.D thesis are the following:

Comparative study of data augmentation in the time series domain using generative modeling techniques based on neural networks. A complete study of the current proposals should be carried out, in order to delimit the types of techniques to be handled and the the working environment.
Design and development of a novel generative model for time series.
Validation of the model developed both with the use of benchmarks, as well as in a real-world scenario.

FUNDS

This research is made possible through the financial contributions of the following entities:

Spanish Ministry of Science and Innovation and the European Regional Development Fund, under project PID2020-115832GB-I00