Researchers developed a structure-informed deep generation for de novo metabolite annotation in untargeted metabolomics
Date:2026-05-06
Metabolite annotation, especially the discovery of unknown metabolites, remains a fundamental challenge in mass spectrometry-based untargeted metabolomics due to limited reference mass spectra. Library-based spectral matching remains the gold standard for annotation but is limited to known metabolites with available reference spectra. The annotation of unknown metabolites, including "known unknowns" (metabolites present in structural databases but lacking reference spectra) and "unknown unknowns" (metabolites with previously uncharacterized structures), continues to pose a fundamental challenge. Recent advances in deep generation have enabled the discovery of chemical structures beyond known chemical space. However, limited by the availability of high-quality MS2 spectra, the full potential of in-silico prediction for metabolite annotation has yet to be fully realized.
In a study published in Nature Communications, research teams led by Prof. ZHU Zheng-Jiang at the Shanghai Institute of Organic Chemistry (SIOC) of the Chinese Academy of Sciences develop a structure-informed encoder-decoder neural network that enables efficient and controllable metabolite generation from MS2 spectra, enabling the discovery of previously unknown metabolites and offering a transformative approach to biological insight (https://www.nature.com/articles/s41467-026-72149-6).
Instead of directly encoding a query MS2 spectrum, MetGenX adopts a representation transformation strategy that first maps the query spectrum to structurally similar metabolites through spectral similarity search, and then uses these retrieved structures as templates to guide de novo generation. This approach transforms the problem from “spectrum-to-structure” to “structure-to-structure” generation, utilizing structure templates to bridge the gap between experimental data and de novo structure generation, thereby improving model performance.
Through this innovation, MetGenX enables pretraining of a generative model on structures and subsequent fine-tuning on MS2 spectra within a unified model architecture, thereby greatly exceeding the scale of available reference MS2 spectra. The model was first pretrained on over two million biologically relevant chemical structures, and then fine-tuned using the NIST20 MS2 spectral database to bridge the gap between structure and spectra. The pretraining–finetuning framework overcomes the limitation of MS2 spectral training data, achieving accurate and robust metabolite annotation, while also empowering the discovery of previously uncharacterized metabolites.
In independent tests on the NIST20 spectral dataset, MetGenX achieves a top-1 accuracy of 55.9% and a top-3 accuracy of 76.1% on 1388 MS2 spectra. In real biological samples, MetGenX achieves a top-1 accuracy of 68.5% and a top-3 accuracy of 89.2% on 1681 MS2 spectra, outperforming other in-silico annotation tools in both accuracy and coverage. Notably, as a structure-informed generative model, MetGenX generalized effectively to tandem MS spectra from the negative ionization mode without retraining. On 2319 MS2 spectra from real biological samples, MetGenX achieves a top-1 accuracy of 60.7% and a top-3 accuracy of 82.5%, demonstrating its robust performance and generalization across ionization modes. To demonstrate its real-world utility, researchers applied MetGenX to untargeted metabolomics data from mouse liver tissue using the multi-step annotation workflow. This analysis led to the identification of two previously unreported metabolites absent from major metabolome databases, demonstrating MetGenX’s strong potential to discover previously uncharacterized metabolites.
This work was supported by the National Natural Science Foundation of China, the Chinese Academy of Sciences, the Shanghai Municipal Science and Technology Commission, and the Shanghai Academy of Natural Sciences, among other funding sources.

Figure 1. structure-informed deep generation model MetGenX for metabolite annotation
Article Link: https://www.nature.com/articles/s41467-026-72149-6
附件下载: