Defesa de Dissertação de Mestrado: Optimizing Data Augmentation to Improve AI model performance
-
Palestrantes
Aluno: Henrique Matheus Ferreira da Silva
-
Informações úteis
Orientadores:
Fabio André Machado Porto - Laboratório Nacional de Computação Científica - LNCC
Banca Examinadora:
Fabio André Machado Porto - Laboratório Nacional de Computação Científica - LNCC (presidente)
Marisa Fabiana Nicolás - Laboratório Nacional de Computação Científica - LNCC
Eduardo Bezerra
Suplentes:
Luiz M. R. Gadelha Jr. - Laboratório Nacional de Computação Científica - LNCC
Marcel de Moraes Pedroso - FIOCRUZ
Resumo:The accuracy of Machine Learning (ML) based classification algorithms is highly dependent on the quality of the training dataset that the corresponding ML model's have been submitted to, as well as on how much the dataset represents the problem being analyzed. However, many research topics have classification problems in which the examples distribution may vary widely, and in which specific classes may be strongly underrepresented (such as patient-specific medicine), or training data may be scarce (such as plant species classification), leading to unbalanced datasets. In both scenarios, this may result in a poor model's efficiency. Data Augmentation Techniques try to mitigate this problem, by expanding the available training data in order to increase models
performance. In this work, we present two novel techniques for data augmentation over tabular data. First, we present a method denominated SAGAD (Synthetic Data Generator for Tabular Datasets), which is based on the concept of conditional entropy. SAGAD can balance minority classes, at the same time increasing the overall size of the training set. Next, we present an extension of SAGAD for iterative learning algorithms, called DABEL (Data Generation Based on Complexity per Classes), which iteratively produces new training data samples based on class ambiguity. To validate our proposal, we simulated a small data scenario by using datasets well known in literature and also evaluated our methods on real world data. We evaluated SAGAD using four machine learning algorithms and DABEL using a neural network model. To measure our method's performance, we developed a baseline use-case in which models are trained on small data, comparing both SAGAD and DABEL to it. We also
tested other data augmentation techniques, against SAGAD. SAGAD is implemented and available via AugmenteR(S. Pereira; ferreira da silva; Porto, 2021), which is an R package in CRAN, for data augmentation which currently has more than 1610 downloads.
- Mais informações