More data or better data? How the training data influences machine learned predictions in Chemistry

Date:

Nowadays, Machine Learning(ML) methods routinely achieve high accuracy in short times and are becoming another standard tool for computational/theoretical chemists. However, ML methods require large quantities of data to train them to achieve the desired/required results.
The generation of data for chemical applications is not a trivial task and requires hours of computation from ab-initio methods. Nevertheless, the rule of thumb from computer science that large amounts of data will beat the best algorithms is still followed. Keeping in mind that the generation of data is not always feasible or practical, we reviewed the influence of common databases on the training of ML models. Our results indicate that common databases present redundancies that reduce the quality of prediction of an ML model. Therefore, there is a need to ‘clean’ and augment databases to assure the best prediction and, at the same time, obtain a procedure for the creation of databases with the minimum amount of computational effort.

A strategy to address the problem of construction of new databases is through the use of the uncertainty in the prediction, which can be used as a guide for the exploration of chemical/conformational space. Therefore, it will be desired that the ML model would be able to estimate the uncertainty of the prediction. However, the usual approach implies using several models, which comes with a high computational price. New developments has provided simple methodologies for the quantification of uncertainty on Neural Networks. Using these methodologies, we quantify the uncertainty using the PhysNet architecture. Our results, indicate that the uncertainty obtained from the ML model can help us to identify cases in which additional information is required or when redundant information complicates the prediction of a specific property.

See the talk here