MegaSpec enables the prediction of GC/MS spectra, enabling the ranking of potential molecules from spectra matching from a database of millions of spectra while a secondary model provides confidence in predictions.
Gas Chromatography Mass Spectroscopy (GC/MS) is used in detection and ultimately the identification of small molecules. The spectral pattern and mass assignments from each can suggest many candidate fragments and molecules and it is key to narrow this list of options down to high probability molecule matches.
In GC-MS with electron ionization (EI), molecules dissociate into fragments with a reproducible fingerprint regardless of the instrument, suggesting that with a large library of molecules and fragments we could predict fragmentation spectra from structure alone.
Currently in analytical labs, scientists search for spectra matches in commercial libraries for tentative identification. The list of hits can be reduced by considering the retention time of the compound. However, GC-MS EI often lacks an intact parent peak, and filtering hits by retention time does not distinguish between structurally similar compounds. Other MS techniques can generate a detectable parent ion to help confirm compound identity, but these methods are often inaccessible in routine labs or suffer from low sensitivity and low reliability. These methods are also all limited by the size of the GC-MS EI library available for the user to search.
Prior work has explored using neural networks to augment existing libraries by predicting the GC-MS EI spectra of untested compounds. One approach used simple multi-layer perceptrons trained on Morgan Fingerprints. Another group trained graph neural networks on molecular subsets to predict GC-MS EI spectra.
We have expanded upon these approaches by fine-tuning a pretrained Molecular Bidirectional Auto-Regressive Transformer (MolBART) to estimate a given compound’s GC-MS EI spectrum. Our fine-tuning dataset includes a subset of a publicly available GC-MS EI library (~86,000 molecules).
We have evaluated the model in a library‑matching setting by ranking candidate molecules based on similarity between the query spectrum and a library of predicted spectra. Using this recommendation‑style approach, the correct compound ranked within the top 10 in over 80% of searches.
To support laboratory use, we have developed MegaSpec, a command‑line tool that predicts spectra from SMILES and returns ranked candidate matches, effectively expanding GC‑MS EI libraries with predicted data.
We can offer customization of MegaSpec to include data from your company as well as a graphical user interface to complement the command line interface.
Wei, J. N.; Belanger, D.; Adams, R. P.; Sculley, D. Rapid Prediction of Electron–Ionization Mass Spectrometry Using Neural Networks. ACS Central Science 2019, 5 (4), 700-708. DOI: 10.1021/acscentsci.9b00085.
Zhu, R. L.; Jonas, E. Rapid Approximate Subset-Based Spectra Prediction for Electron Ionization–Mass Spectrometry. Analytical Chemistry 2023, 95 (5), 2653-2663. DOI: 10.1021/acs.analchem.2c02093.