Background: The Avogadro project has developed an advanced molecule editor and visualizer designed for cross-platform use in computational chemistry, molecular modeling, bioinformatics, materials science, and related areas. It offers flexible, high quality rendering, and a powerful plugin architecture. Typical uses include building molecular structures, formatting input files, and analyzing output of a wide variety of computational chemistry packages. By using the CML file format as its native document type, Avogadro seeks to enhance the semantic accessibility of chemical data types. Results: The work presented here details the Avogadro library, which is a framework providing a code library and application programming interface (API) with three-dimensional visualization capabilities; and has direct applications to research and education in the fields of chemistry, physics, materials science, and biology. The Avogadro application provides a rich graphical interface using dynamically loaded plugins through the library itself. The application and library can each be extended by implementing a plugin module in C++ or Python to explore different visualization techniques, build/manipulate molecular structures, and interact with other programs. We describe some example extensions, one which uses a genetic algorithm to find stable crystal structures, and one which interfaces with the PackMol program to create packed, solvated structures for molecular dynamics simulations. The 1.0 release series of Avogadro is the main focus of the results discussed here. Conclusions: Avogadro offers a semantic chemical builder and platform for visualization and analysis. For users, it offers an easy-to-use builder, integrated support for downloading from common databases such as PubChem and the Protein Data Bank, extracting chemical data from a wide variety of formats, including computational chemistry output, and native, semantic support for the CML file format. For developers, it can be easily extended via a powerful plugin mechanism to support new features in organic chemistry, inorganic complexes, drug design, materials, biomolecules, and simulations. Avogadro is freely available under an open-source license from http://avogadro.openmolecules.net.
A frequent problem in computational modeling is the interconversion of chemical structures between different formats. While standard interchange formats exist (for example, Chemical Markup Language) and de facto standards have arisen (for example, SMILES format), the need to interconvert formats is a continuing problem due to the multitude of different application areas for chemistry data, differences in the data stored by different formats (0D versus 3D, for example), and competition between software along with a lack of vendor-neutral formats.We discuss, for the first time, Open Babel, an open-source chemical toolbox that speaks the many languages of chemical data. Open Babel version 2.3 interconverts over 110 formats. The need to represent such a wide variety of chemical and molecular data requires a library that implements a wide range of cheminformatics algorithms, from partial charge assignment and aromaticity detection, to bond order perception and canonicalization. We detail the implementation of Open Babel, describe key advances in the 2.3 release, and outline a variety of uses both in terms of software products and scientific research, including applications far beyond simple format interconversion.Open Babel presents a solution to the proliferation of multiple chemical file formats. In addition, it provides a variety of useful utilities from conformer searching and 2D depiction, to filtering, batch conversion, and substructure and similarity searching. For developers, it can be used as a programming library to handle chemical data in areas such as organic chemistry, drug design, materials science, and computational chemistry. It is freely available under an open-source license from http://openbabel.org .
The in silico fragmenter MetFrag, launched in 2010, was one of the first approaches combining compound database searching and fragmentation prediction for small molecule identification from tandem mass spectrometry data. Since then many new approaches have evolved, as has MetFrag itself. This article details the latest developments to MetFrag and its use in small molecule identification since the original publication.MetFrag has gone through algorithmic and scoring refinements. New features include the retrieval of reference, data source and patent information via ChemSpider and PubChem web services, as well as InChIKey filtering to reduce candidate redundancy due to stereoisomerism. Candidates can be filtered or scored differently based on criteria like occurence of certain elements and/or substructures prior to fragmentation, or presence in so-called “suspect lists”. Retention time information can now be calculated either within MetFrag with a sufficient amount of user-provided retention times, or incorporated separately as “user-defined scores” to be included in candidate ranking. The changes to MetFrag were evaluated on the original dataset as well as a dataset of 473 merged high resolution tandem mass spectra (HR-MS/MS) and compared with another open source in silico fragmenter, CFM-ID. Using HR-MS/MS information only, MetFrag2.2 and CFM-ID had 30 and 43 Top 1 ranks, respectively, using PubChem as a database. Including reference and retention information in MetFrag2.2 improved this to 420 and 336 Top 1 ranks with ChemSpider and PubChem (89 and 71 %), respectively, and even up to 343 Top 1 ranks (PubChem) when combining with CFM-ID. The optimal parameters and weights were verified using three additional datasets of 824 merged HR-MS/MS spectra in total. Further examples are given to demonstrate flexibility of the enhanced features.In many cases additional information is available from the experimental context to add to small molecule identification, which is especially useful where the mass spectrum alone is not sufficient for candidate selection from a large number of candidates. The results achieved with MetFrag2.2 clearly show the benefit of considering this additional information. The new functions greatly enhance the chance of identification success and have been incorporated into a command line interface in a flexible way designed to be integrated into high throughput workflows. Feedback on the command line version of MetFrag2.2 available at http://c-ruttkies.github.io/MetFrag/ is welcome.
Modern medicine often clashes with traditional medicine such as Chinese herbal medicine because of the little understanding of the underlying mechanisms of action of the herbs. In an effort to promote integration of both sides and to accelerate the drug discovery from herbal medicines, an efficient systems pharmacology platform that represents ideal information convergence of pharmacochemistry, ADME properties, drug-likeness, drug targets, associated diseases and interaction networks, are urgently needed.The traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) was built based on the framework of systems pharmacology for herbal medicines. It consists of all the 499 Chinese herbs registered in the Chinese pharmacopoeia with 29,384 ingredients, 3,311 targets and 837 associated diseases. Twelve important ADME-related properties like human oral bioavailability, half-life, drug-likeness, Caco-2 permeability, blood-brain barrier and Lipinski’s rule of five are provided for drug screening and evaluation. TCMSP also provides drug targets and diseases of each active compound, which can automatically establish the compound-target and target-disease networks that let users view and analyze the drug action mechanisms. It is designed to fuel the development of herbal medicines and to promote integration of modern medicine and traditional medicine for drug discovery and development.The particular strengths of TCMSP are the composition of the large number of herbal entries, and the ability to identify drug-target networks and drug-disease networks, which will help revealing the mechanisms of action of Chinese herbs, uncovering the nature of TCM theory and developing new herb-oriented drugs. TCMSP is freely available at http://sm.nwsuaf.edu.cn/lsp/tcmsp.php .
Cheminformaticians are equipped with a very rich toolbox when carrying out molecular similarity calculations. A large number of molecular representations exist, and there are several methods (similarity and distance metrics) to quantify the similarity of molecular representations. In this work, eight well-known similarity/distance metrics are compared on a large dataset of molecular fingerprints with sum of ranking differences (SRD) and ANOVA analysis. The effects of molecular size, selection methods and data pretreatment methods on the outcome of the comparison are also assessed.A supplier database ( https://mcule.com/ ) was used as the source of compounds for the similarity calculations in this study. A large number of datasets, each consisting of one hundred compounds, were compiled, molecular fingerprints were generated and similarity values between a randomly chosen reference compound and the rest were calculated for each dataset. Similarity metrics were compared based on their ranking of the compounds within one experiment (one dataset) using sum of ranking differences (SRD), while the results of the entire set of experiments were summarized on box and whisker plots. Finally, the effects of various factors (data pretreatment, molecule size, selection method) were evaluated with analysis of variance (ANOVA).This study complements previous efforts to examine and rank various metrics for molecular similarity calculations. Here, however, an entirely general approach was taken to neglect any a priori knowledge on the compounds involved, as well as any bias introduced by examining only one or a few specific scenarios. The Tanimoto index, Dice index, Cosine coefficient and Soergel distance were identified to be the best (and in some sense equivalent) metrics for similarity calculations, i.e. these metrics could produce the rankings closest to the composite (average) ranking of the eight metrics. The similarity metrics derived from Euclidean and Manhattan distances are not recommended on their own, although their variability and diversity from other similarity metrics might be advantageous in certain cases (e.g. for data fusion). Conclusions are also drawn regarding the effects of molecule size, selection method and data pretreatment on the ranking behavior of the studied metrics. Graphical Abstract A visual summary of the comparison of similarity metrics with sum of ranking differences (SRD).
Since its public introduction in 2005 the IUPAC InChI chemical structure identifier standard has become the international, worldwide standard for defined chemical structures. This article will describe the extensive use and dissemination of the InChI and InChIKey structure representations by and for the world-wide chemistry community, the chemical information community, and major publishers and disseminators of chemical and related scientific offerings in manuscripts and databases.
Metabolomics experiments involve generating and comparing small molecule (metabolite) profiles from complex mixture samples to identify those metabolites that are modulated in altered states (e.g., disease, drug treatment, toxin exposure). One non-targeted metabolomics approach attempts to identify and interrogate all small molecules in a sample using GC or LC separation followed by MS or MSn detection. Analysis of the resulting large, multifaceted data sets to rapidly and accurately identify the metabolites is a challenging task that relies on the availability of chemical libraries of metabolite spectral signatures. A method for analyzing spectrometry data to identify and Qu antify I ndividual C omponents in a S ample, (QUICS), enables generation of chemical library entries from known standards and, importantly, from unknown metabolites present in experimental samples but without a corresponding library entry. This method accounts for all ions in a sample spectrum, performs library matches, and allows review of the data to quality check library entries. The QUICS method identifies ions related to any given metabolite by correlating ion data across the complete set of experimental samples, thus revealing subtle spectral trends that may not be evident when viewing individual samples and are likely to be indicative of the presence of one or more otherwise obscured metabolites.LC-MS/MS or GC-MS data from 33 liver samples were analyzed simultaneously which exploited the inherent biological diversity of the samples and the largely non-covariant chemical nature of the metabolites when viewed over multiple samples. Ions were partitioned by both retention time (RT) and covariance which grouped ions from a single common underlying metabolite. This approach benefitted from using mass, time and intensity data in aggregate over the entire sample set to reject outliers and noise thereby producing higher quality chemical identities. The aggregated data was matched to reference chemical libraries to aid in identifying the ion set as a known metabolite or as a new unknown biochemical to be added to the library.The QUICS methodology enabled rapid, in-depth evaluation of all possible metabolites (known and unknown) within a set of samples to identify the metabolites and, for those that did not have an entry in the reference library, to create a library entry to identify that metabolite in future studies.
We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches.We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case.We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models.We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.
Chemical compounds and drugs are an important class of entities in biomedical research with great potential in a wide range of applications, including clinical medicine. Locating chemical named entities in the literature is a useful step in chemical text mining pipelines for identifying the chemical mentions, their properties, and their relationships as discussed in the literature.We introduce the tmChem system, a chemical named entity recognizer created by combining two independent machine learning models in an ensemble. We use the corpus released as part of the recent CHEMDNER task to develop and evaluate tmChem, achieving a micro-averaged f-measure of 0.8739 on the CEM subtask (mention-level evaluation) and 0.8745 f-measure on the CDI subtask (abstract-level evaluation). We also report a high-recall combination (0.9212 for CEM and 0.9224 for CDI). tmChem achieved the highest f-measure reported in the CHEMDNER task for the CEM subtask, and the high recall variant achieved the highest recall on both the CEM and CDI tasks.We report that tmChem is a state-of-the-art tool for chemical named entity recognition and that performance for chemical named entity recognition has now tied (or exceeded) the performance previously reported for genes and diseases. Future research should focus on tighter integration between the named entity recognition and normalization steps for improved performance.The source code and a trained model for both models of tmChem is available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmChem . The results of running tmChem (Model 2) on PubMed are available in PubTator: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/PubTator
This work introduces a method to tune a sequence-based generative model for molecular de novo design that through augmented episodic likelihood can learn to generate structures with certain specified desirable properties. We demonstrate how this model can execute a range of tasks such as generating analogues to a query structure and generating compounds predicted to be active against a biological target. As a proof of principle, the model is first trained to generate molecules that do not contain sulphur. As a second example, the model is trained to generate analogues to the drug Celecoxib, a technique that could be used for scaffold hopping or library expansion starting from a single molecule. Finally, when tuning the model towards generating compounds predicted to be active against the dopamine receptor type 2, the model generates structures of which more than 95% are predicted to be active, including experimentally confirmed actives that have not been included in either the generative model nor the activity prediction model. Graphical abstract .
Molecular docking methods are commonly used for predicting binding modes and energies of ligands to proteins. For accurate complex geometry and binding energy estimation, an appropriate method for calculating partial charges is essential. AutoDockTools software, the interface for preparing input files for one of the most widely used docking programs AutoDock 4, utilizes the Gasteiger partial charge calculation method for both protein and ligand charge calculation. However, it has already been shown that more accurate partial charge calculation - and as a consequence, more accurate docking- can be achieved by using quantum chemical methods. For docking calculations quantum chemical partial charge calculation as a routine was only used for ligands so far. The newly developed Mozyme function of MOPAC2009 allows fast partial charge calculation of proteins by quantum mechanical semi-empirical methods. Thus, in the current study, the effect of semi-empirical quantum-mechanical partial charge calculation on docking accuracy could be investigated.The docking accuracy of AutoDock 4 using the original AutoDock scoring function was investigated on a set of 53 protein ligand complexes using Gasteiger and PM6 partial charge calculation methods. This has enabled us to compare the effect of the partial charge calculation method on docking accuracy utilizing AutoDock 4 software. Our results showed that the docking accuracy in regard to complex geometry (docking result defined as accurate when the RMSD of the first rank docking result complex is within 2 Å of the experimentally determined X-ray structure) significantly increased when partial charges of the ligands and proteins were calculated with the semi-empirical PM6 method.Out of the 53 complexes analyzed in the course of our study, the geometry of 42 complexes were accurately calculated using PM6 partial charges, while the use of Gasteiger charges resulted in only 28 accurate geometries. The binding affinity estimation was not influenced by the partial charge calculation method - for more accurate binding affinity prediction development of a new scoring function for AutoDock is needed.Our results demonstrate that the accuracy of determination of complex geometry using AutoDock 4 for docking calculation greatly increases with the use of quantum chemical partial charge calculation on both the ligands and proteins.
Background: Channels and pores in biomacromolecules (proteins, nucleic acids and their complexes) play significant biological roles, e. g., in molecular recognition and enzyme substrate specificity. Results: We present an advanced software tool entitled MOLE 2.0, which has been designed to analyze molecular channels and pores. Benchmark tests against other available software tools showed that MOLE 2.0 is by comparison quicker, more robust and more versatile. As a new feature, MOLE 2.0 estimates physicochemical properties of the identified channels, i.e., hydropathy, hydrophobicity, polarity, charge, and mutability. We also assessed the variability in physicochemical properties of eighty X-ray structures of two members of the cytochrome P450 superfamily. Conclusion: Estimated physicochemical properties of the identified channels in the selected biomacromolecules corresponded well with the known functions of the respective channels. Thus, the predicted physicochemical properties may provide useful information about the potential functions of identified channels. The MOLE 2.0 software is available at http://mole.chemi.muni.cz.
In spite of its great promise, metabolomics has proven difficult to execute in an untargeted and generalizable manner. Liquid chromatography–mass spectrometry (LC–MS) has made it possible to gather data on thousands of cellular metabolites. However, matching metabolites to their spectral features continues to be a bottleneck, meaning that much of the collected information remains uninterpreted and that new metabolites are seldom discovered in untargeted studies. These challenges require new approaches that consider compounds beyond those available in curated biochemistry databases.Here we present Metabolic In silico Network Expansions (MINEs), an extension of known metabolite databases to include molecules that have not been observed, but are likely to occur based on known metabolites and common biochemical reactions. We utilize an algorithm called the Biochemical Network Integrated Computational Explorer (BNICE) and expert-curated reaction rules based on the Enzyme Commission classification system to propose the novel chemical structures and reactions that comprise MINE databases. Starting from the Kyoto Encyclopedia of Genes and Genomes (KEGG) COMPOUND database, the MINE contains over 571,000 compounds, of which 93% are not present in the PubChem database. However, these MINE compounds have on average higher structural similarity to natural products than compounds from KEGG or PubChem. MINE databases were able to propose annotations for 98.6% of a set of 667 MassBank spectra, 14% more than KEGG alone and equivalent to PubChem while returning far fewer candidates per spectra than PubChem (46 vs. 1715 median candidates). Application of MINEs to LC–MS accurate mass data enabled the identity of an unknown peak to be confidently predicted.MINE databases are freely accessible for non-commercial use via user-friendly web-tools at http://minedatabase.mcs.anl.gov and developer-friendly APIs. MINEs improve metabolomics peak identification as compared to general chemical databases whose results include irrelevant synthetic compounds. Furthermore, MINEs complement and expand on previous in silico generated compound databases that focus on human metabolism. We are actively developing the database; future versions of this resource will incorporate transformation rules for spontaneous chemical reactions and more advanced filtering and prioritization of candidate structures.
There is an abundance of information about drugs available on the Web. Data sources range from medicinal chemistry results, over the impact of drugs on gene expression, to the outcomes of drugs in clinical trials. These data are typically not connected together, which reduces the ease with which insights can be gained. Linking Open Drug Data (LODD) is a task force within the World Wide Web Consortium's (W3C) Health Care and Life Sciences Interest Group (HCLS IG). LODD has surveyed publicly available data about drugs, created Linked Data representations of the data sets, and identified interesting scientific and business questions that can be answered once the data sets are connected. The task force provides recommendations for the best practices of exposing data in a Linked Data representation. In this paper, we present past and ongoing work of LODD and discuss the growing importance of Linked Data as a foundation for pharmaceutical R&D data sharing.
Despite an abundance of online databases providing access to chemical data, there is increasing demand for high-quality, structure-curated, open data to meet the various needs of the environmental sciences and computational toxicology communities. The U.S. Environmental Protection Agency’s (EPA) web-based CompTox Chemistry Dashboard is addressing these needs by integrating diverse types of relevant domain data through a cheminformatics layer, built upon a database of curated substances linked to chemical structures. These data include physicochemical, environmental fate and transport, exposure, usage, in vivo toxicity, and in vitro bioassay data, surfaced through an integration hub with link-outs to additional EPA data and public domain online resources. Batch searching allows for direct chemical identifier (ID) mapping and downloading of multiple data streams in several different formats. This facilitates fast access to available structure, property, toxicity, and bioassay data for collections of chemicals (hundreds to thousands at a time). Advanced search capabilities are available to support, for example, non-targeted analysis and identification of chemicals using mass spectrometry. The contents of the chemistry database, presently containing ~ 760,000 substances, are available as public domain data for download. The chemistry content underpinning the Dashboard has been aggregated over the past 15 years by both manual and auto-curation techniques within EPA’s DSSTox project. DSSTox chemical content is subject to strict quality controls to enforce consistency among chemical substance-structure identifiers, as well as list curation review to ensure accurate linkages of DSSTox substances to chemical lists and associated data. The Dashboard, publicly launched in April 2016, has expanded considerably in content and user traffic over the past year. It is continuously evolving with the growth of DSSTox into high-interest or data-rich domains of interest to EPA, such as chemicals on the Toxic Substances Control Act listing, while providing the user community with a flexible and dynamic web-based platform for integration, processing, visualization and delivery of data and resources. The Dashboard provides support for a broad array of research and regulatory programs across the worldwide community of toxicologists and environmental scientists.
Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties.
The Chemistry Development Kit (CDK) is a widely used open source cheminformatics toolkit, providing data structures to represent chemical concepts along with methods to manipulate such structures and perform computations on them. The library implements a wide variety of cheminformatics algorithms ranging from chemical structure canonicalization to molecular descriptor calculations and pharmacophore perception. It is used in drug discovery, metabolomics, and toxicology. Over the last 10 years, the code base has grown significantly, however, resulting in many complex interdependencies among components and poor performance of many algorithms. We report improvements to the CDK v2.0 since the v1.2 release series, specifically addressing the increased functional complexity and poor performance. We first summarize the addition of new functionality, such atom typing and molecular formula handling, and improvement to existing functionality that has led to significantly better performance for substructure searching, molecular fingerprints, and rendering of molecules. Second, we outline how the CDK has evolved with respect to quality control and the approaches we have adopted to ensure stability, including a code review mechanism.This paper highlights our continued efforts to provide a community driven, open source cheminformatics library, and shows that such collaborative projects can thrive over extended periods of time, resulting in a high-quality and performant library. By taking advantage of community support and contributions, we show that an open source cheminformatics project can act as a peer reviewed publishing platform for scientific computing software. Graphical abstract CDK 2.0 provides new features and improved performance
Complex diseases like cancer are regulated by large, interconnected networks with many pathways affecting cell proliferation, invasion, and drug resistance. However, current cancer therapy predominantly relies on the reductionist approach of one gene-one disease. Combinations of drugs may overcome drug resistance by limiting mutations and induction of escape pathways, but given the enormous number of possible drug combinations, strategies to reduce the search space and prioritize experiments are needed. In this review, we focus on the use of computational modeling, bioinformatics and high-throughput experimental methods for discovery of drug combinations. We highlight cutting-edge systems approaches, including large-scale modeling of cell signaling networks, network motif analysis, statistical association-based models, identifying correlations in gene signatures, functional genomics, and high-throughput combination screens. We also present a list of publicly available data and resources to aid in discovery of drug combinations. Integration of these systems approaches will enable faster discovery and translation of clinically relevant drug combinations.