Machine-learning Prediction Of Infrared Spectra Of Interstellar Polycyclic Aromatic Hydrocarbons


Illustration of how the topological descriptors are built based on the presence of different molecular fragments. Example molecules are shown on top, with marks showing the positions of the specific fragments depicted below. The middle line shows the corresponding region of the generated counting fingerprints. The red and blue fragments are generated by the first iteration of the fingerprint generation, so they only contain information about their base atom. In the current case, the red items represent carbon atoms with 3 non-hydrogen neighbours and no hydrogens connected to them, while the blue items represent carbon atoms with 2 non-hydrogen bonds and, likewise, with no hydrogen neighbours. The green and yellow circles show fragments generated by the 2nd and 3rd iterations respectively. During training we use more than 9200 unique fragments generated by up to 11 iterations of the algorithm.

We design and train a neural network (NN) model to efficiently predict the infrared spectra of interstellar polycyclic aromatic hydrocarbons (PAHs) with a computational cost many orders of magnitude lower than what a first-principles calculation would demand.

The input to the NN is based on the Morgan fingerprints extracted from the skeletal formulas of the molecules and does not require precise geometrical information such as interatomic distances. The model shows excellent predictive skill for out-of-sample inputs, making it suitable for improving the mixture models currently used for understanding the chemical composition and evolution of the interstellar medium.

We also identify the constraints to its applicability caused by the limited diversity of the training data and estimate the prediction errors using a ensemble of NNs trained on subsets of the data. With help from other machine-learning methods like random forests, we dissect the role of different chemical features in this prediction. The power of these topological descriptors is demonstrated by the limited effect of including detailed geometrical information in the form of Coulomb matrix eigenvalues.

Peter Kovacs, Xiaosi Zhu, Jesus Carrete, Georg K. H. Madsen, Zhao Wang

Comments: 8 figures
Subjects: Astrophysics of Galaxies (astro-ph.GA); Instrumentation and Methods for Astrophysics (astro-ph.IM); Solar and Stellar Astrophysics (astro-ph.SR)
Journal reference: 2020 ApJ 902 100
DOI: 10.3847/1538-4357/abb5b6
Cite as: arXiv:2010.09150 [astro-ph.GA] (or arXiv:2010.09150v1 [astro-ph.GA] for this version)
Submission history
From: Zhao Wang
[v1] Mon, 19 Oct 2020 00:27:12 UTC (607 KB)
https://arxiv.org/abs/2010.09150
Astrobiology, Astrochemistry,

Please follow Astrobiology on Twitter.


  • submit to reddit