Talks and presentations

Towards better chemical databases for atomistic machine learning

August 15, 2023

Talk, ACS Fall: Harnessing the Power of Data, San Francisco, California, United States

Machine learning (ML) has revolutionized the field of atomistic simulations. It is now possible to obtain high-quality predictions of chemical properties such as total energies, forces, or dipoles at a low computational cost. Currently, the field is at a stage at which atomistic simulations in the gas phase on the sub-microsecond time scale with ab-initio MP2 quality can be carried out routinely. Given that the computational effort to evaluate such a statistical model is independent of the quality of the input data, the most significant bottleneck for devising yet better ML models is the considerable amount of data required to train them. Although the community consensus is that more data naturally leads to better performance; it has been found that this working hypothesis is not necessarily correct for predicting chemical properties with models trained on commonly used databases such as QM9 or ANI-1. Consequently, there is a need to identify how to obtain suitable data for training ML models and for established databases on how to add/remove information while retaining the best performance of the model.

More data or better data? How the training data influences machine learned predictions in Chemistry

July 15, 2022

Talk, Barcelona MMSML Workshop: Methods in Molecular Simulations and Machine Learning, Barcelona, Spain

Nowadays, Machine Learning(ML) methods routinely achieve high accuracy in short times and are becoming another standard tool for computational/theoretical chemists. However, ML methods require large quantities of data to train them to achieve the desired/required results.
The generation of data for chemical applications is not a trivial task and requires hours of computation from ab-initio methods. Nevertheless, the rule of thumb from computer science that large amounts of data will beat the best algorithms is still followed. Keeping in mind that the generation of data is not always feasible or practical, we reviewed the influence of common databases on the training of ML models. Our results indicate that common databases present redundancies that reduce the quality of prediction of an ML model. Therefore, there is a need to ‘clean’ and augment databases to assure the best prediction and, at the same time, obtain a procedure for the creation of databases with the minimum amount of computational effort.