Loughborough University
Browse

Boosting predictability: towards rapid estimation of organic molecule solubility

Download (3.24 MB)
preprint
posted on 2025-10-31, 14:20 authored by Arsalan Hashemi, Pekka Peljo, Tapio Ala-NissilaTapio Ala-Nissila, Kari Laasonen
<p dir="ltr">The water solubility of organic molecules is critical for optimizing the performance and stability of aqueous flow batteries, as well as for various other applications. Although relatively straightforward to measure in some cases, the theoretical prediction of the solubility remains a considerable challenge. To this end, machine learning algorithms have become increasingly important tools in the past decade. High-quality data and effective descriptors are essential for constructing reliable data-driven estimation models. We systematically investigate the effectiveness of enhanced structure-based descriptors and an outlier detection procedure for improving aqueous solubility predictability. We train and evaluate random forest regression models using various descriptors to predict experimental solubility. Outliers are identified through an iterative maximum-error deletion procedure. We discover that descriptors derived from hydration free energy and weighted fingerprints, along with other established features, are effective. Notably, solvation energy, octanol-water partition coefficient, atomic charge polarizability interactions, and the presence of a full-carbon aromatic ring are critical for solubility prediction. Furthermore, the effectiveness of the outlier detection protocol is validated by improving the performance of the model and detailed analysis of the dataset. This study significantly improves the predictive capacity of supervised machine learning for molecular properties, enabling advancements in various technological applications.</p>

Funding

Academy of Finland - 312298

European Union - Next GenerationEU - 353298

Teknologiateollisuuden 100v Säätiö and Jane ja Aatos Erkon Säätiö

History

School

  • Science

Published in

ChemRxiv

Publisher

ChemRxiv

Publisher statement

The content is available under CC BY 4.0. Content is a preprint and has not been peer-reviewed.

Publication date

2025-01-16

Copyright date

2025

Notes

16 January 2025, Version 1 - Content is a preprint and has not undergone peer review at the time of posting.

eISSN

2573-2293

Language

  • en

Depositor

Prof Tapio Ala-Nissila. Deposit date: 29 October 2025

Usage metrics

    Loughborough Publications

    Categories

    No categories selected

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC