Abstract:
The physicochemical and biological properties of organic compounds play a crucial role in
various fields such as drug manufacturing, development of analytical methods, and toxicity
assessment. Prediction models for structure-property relationships (QSPR) and structure-activity
relationships (QSAR) offer a promising solution for rapidly and cost-effectively estimating these
properties.
In this study, we developed QSPR and QSAR models using relevant chemical descriptors and
chemometric techniques. The dataset included the structures of 472 organic molecules sourced
from the PubChem and Mol-Instincts databases. 383 descriptors were calculated for each
structure, with 354 obtained using the MOE software and 29 via the pkCSM server.
After preprocessing the data, we selected 264 descriptors, of which 222 were used as
independent variables and 26 as responses. The dataset was divided using the Kennard Stone
algorithm.
An initial modeling was performed using multiple linear regression (MLR) with simple
descriptors that are easily accessible and straightforward to use. Various variable selection
methods (forward, backward, stepwise, and genetic algorithm) were employed.
The chosen responses for the chemometric models were: logP(o/w), h_logP, logS, h_logS, mr,
h_mr, TPSA, Caco2 permeability, Intestinal absorption, BBB permeability, CNS permeability,
Oral Rat Chronic Toxicity (LOAEL), and Minnow toxicity.
MLR yielded models with R² values ranging from 0.252 to 0.987. Among the 14 studied
responses, 12 models achieved R² ≥ 0.6.
To improve predictions, we also explored models based on artificial neural networks (ANN). The
ANN models outperformed MLR significantly, with R² values ranging from 0.394 to 0.999.
Among the 14 studied responses, 12 ANN models achieved R² ≥ 0.7, and 9 models achieved R² ≥
0.9.
In conclusion, the results confirm the effectiveness of MLR and ANN models for accurate
modeling and prediction of the studied organic molecule properties. These approaches offer
promising prospects for rapid and cost-effective estimation of physicochemical and biological
properties of organic compounds.