Convolutional neural network based on SMILES representation of compounds for detecting chemical motif

Published on Dec 1, 2018in BMC Bioinformatics2.511
· DOI :10.1186/s12859-018-2523-5
Maya Hirohara1
Estimated H-index: 1
(Keio: Keio University),
Yutaka Saito7
Estimated H-index: 7
(AIST: National Institute of Advanced Industrial Science and Technology)
+ 2 AuthorsYasubumi Sakakibara29
Estimated H-index: 29
(Keio: Keio University)
Previous studies have suggested deep learning to be a highly effective approach for screening lead compounds for new drugs. Several deep learning models have been developed by addressing the use of various kinds of fingerprints and graph convolution architectures. However, these methods are either advantageous or disadvantageous depending on whether they (1) can distinguish structural differences including chirality of compounds, and (2) can automatically discover effective features. We developed another deep learning model for compound classification. In this method, we constructed a distributed representation of compounds based on the SMILES notation, which linearly represents a compound structure, and applied the SMILES-based representation to a convolutional neural network (CNN). The use of SMILES allows us to process all types of compounds while incorporating a broad range of structure information, and representation learning by CNN automatically acquires a low-dimensional representation of input features. In a benchmark experiment using the TOX 21 dataset, our method outperformed conventional fingerprint methods, and performed comparably against the winning model of the TOX 21 Challenge. Multivariate analysis confirmed that the chemical space consisting of the features learned by SMILES-based representation learning adequately expressed a richer feature space that enabled the accurate discrimination of compounds. Using motif detection with the learned filters, not only important known structures (motifs) such as protein-binding sites but also structures of unknown functional groups were detected. The source code of our SMILES-based convolutional neural network software in the deep learning framework Chainer is available at , and the dataset used for performance evaluation in this work is available at the same URL.
Figures & Tables
  • References (18)
  • Citations (3)
📖 Papers frequently viewed together
3 Citations
8 Authors (Josep Arús-Pous, ..., Ola Engkvist)
3 Authors (Eric Golinko, ..., Xingquan Zhuy)
78% of Scinapse members use related papers. After signing in, all features are FREE.
#1Hanwen DuH-Index: 3
#2Yingchun CaiH-Index: 5
Last. Weihua LiH-Index: 30
view all 8 authors...
Environmental chemicals may affect endocrine systems through multiple mechanisms, one of which is via effects on aromatase (also known as CYP19A1), an enzyme critical for maintaining the normal balance of estrogens and androgens in the body. Therefore, rapid and efficient identification of aromatase-related endocrine disrupting chemicals (EDCs) is important for toxicology and environment risk assessment. In this study, on the basis of the Tox21 10K compound library, in silico classification mode...
10 CitationsSource
#1Ruili Huang (NIH: National Institutes of Health)H-Index: 1
#2Menghang Xia (NIH: National Institutes of Health)H-Index: 1
Tens of thousands of chemicals with poorly understood biological properties are released into the environment each day. High-throughput screening (HTS) is potentially a more efficient and cost-effective alternative to traditional toxicity tests. Using HTS, one can profile chemicals for potential adverse effects and prioritize a manageable number for more in-depth testing. Importantly, it can provide clues to mechanism of toxicity. The Tox21 program has generated >50 million quantitative high-thr...
26 CitationsSource
#1Jack Lanchantin (UVA: University of Virginia)H-Index: 8
#2Ritambhara Singh (UVA: University of Virginia)H-Index: 8
Last. Yanjun Qi (UVA: University of Virginia)H-Index: 27
view all 4 authors...
34 CitationsSource
#1Steven Kearnes (Stanford University)H-Index: 9
#2Kevin McCloskey (Google)H-Index: 1
Last. Patrick Riley (Google)H-Index: 18
view all 5 authors...
Molecular “fingerprints” encoding structural information are the workhorse of cheminformatics and machine learning in drug discovery applications. However, fingerprint representations necessarily emphasize particular aspects of the molecular structure while ignoring others, rather than allowing the model to make data-driven decisions. We describe molecular graph convolutions, a machine learning architecture for learning from undirected graphs, specifically small molecules. Graph convolutions use...
215 CitationsSource
#1Haoyang Zeng (MIT: Massachusetts Institute of Technology)H-Index: 9
#2Matthew D. Edwards (MIT: Massachusetts Institute of Technology)H-Index: 10
Last. David K Gifford D K (MIT: Massachusetts Institute of Technology)H-Index: 60
view all 4 authors...
140 CitationsSource
#1David R. Kelley (Harvard University)H-Index: 21
#2Jasper Snoek (Harvard University)H-Index: 19
Last. John L. Rinn (Harvard University)H-Index: 74
view all 3 authors...
The process of identifying genomic sites that show statistical relationships to phenotypes holds great promise for human health and disease (Hindorff et al. 2009). However, our current inability to efficiently interpret noncoding variants impedes progress toward using personal genomes in medicine. Coordinated efforts to survey the noncoding genome have shown that sequences marked by DNA accessibility and certain histone modifications are enriched for variants that are statistically related to ph...
281 CitationsSource
#1Andreas Mayr (Johannes Kepler University of Linz)H-Index: 11
#2Günter Klambauer (Johannes Kepler University of Linz)H-Index: 15
Last. Sepp Hochreiter (Johannes Kepler University of Linz)H-Index: 30
view all 4 authors...
The Tox21 Data Challenge has been the largest effort of the scientific community to compare computational methods for toxicity prediction. This challenge comprised 12,000 environmental chemicals and drugs which were measured for 12 different toxic effects by specifically designed assays. We participated in this challenge to assess the performance of Deep Learning in computational toxicity prediction. Deep Learning has already revolutionized image processing, speech recognition, and language unde...
152 CitationsSource
DeepSEA, a deep-learning algorithm trained on large-scale chromatin-profiling data, predicts chromatin effects from sequence alone, has single-nucleotide sensitivity and can predict effects of noncoding variants.
587 CitationsSource
#1Babak AlipanahiH-Index: 17
#2Andrew DelongH-Index: 13
Last. Brendan J. FreyH-Index: 59
view all 4 authors...
The binding specificities of RNA- and DNA-binding proteins are determined from experimental data using a ‘deep learning’ approach.
812 CitationsSource
Neural networks were widely used for quantitative structure–activity relationships (QSAR) in the 1990s. Because of various practical issues (e.g., slow on large problems, difficult to train, prone to overfitting, etc.), they were superseded by more robust methods like support vector machine (SVM) and random forest (RF), which arose in the early 2000s. The last 10 years has witnessed a revival of neural networks in the machine learning community thanks to new methods for preventing overfitting, m...
258 CitationsSource
Cited By3
#2Thomas BrettinH-Index: 43
Last. Rick StevensH-Index: 43
view all 9 authors...
By combining various cancer cell line (CCL) drug screening panels, the size of the data has grown significantly to begin understanding how advances in deep learning can advance drug response predictions. In this paper we train >35,000 neural network models, sweeping over common featurization techniques. We found the RNA-seq to be highly redundant and informative even with subsets larger than 128 features. We found the inclusion of single nucleotide polymorphisms (SNPs) coded as count matrices im...
#1Thin Nguyen (Deakin University)H-Index: 12
#2Hang Le (Nha Trang University)
Last. Svetha Venkatesh (Deakin University)H-Index: 45
view all 5 authors...
The development of new drugs is costly, time consuming, and often accompanied with safety issues. Drug repurposing can avoid the expensive and lengthy process of drug development by finding new uses for already approved drugs. In order to repurpose drugs effectively, it is useful to know which proteins are targeted by which drugs. Computational models that estimate the interaction strength of new drug--target pairs have the potential to expedite drug repurposing. Several models have been propose...
#2Pierre Baldi (UCI: University of California, Irvine)H-Index: 86
In order to continuously represent molecules, we propose a generative model in the form of a VAE which is operating on the 2D-graph structure of molecules. A side predictor is employed to prune the latent space and help the decoder in generating meaningful adjacency tensor of molecules. Other than the potential applicability in drug design and property prediction, we show the superior performance of this technique in comparison to other similar methods based on the SMILES representation of the m...
view all 3 authors...
Compound toxicity prediction is a very challenging and critical task in the drug discovery and design field. Traditionally, cell or animal-based experiments are required to confirm the acute oral toxicity of chemical compounds. However, these methods are often restricted by availability of experimental facilities, long experimentation time, and high cost. In this paper, we propose a novel convolutional neural network regression model, named BESTox, to predict the acute oral toxicity (\(LD_{50}\)...
#1Luis A. Miccio (Donostia International Physics Center)H-Index: 9
#2Gustavo A. Schwartz (Donostia International Physics Center)H-Index: 19
Abstract In this work convolutional-fully connected neural networks were designed and trained to predict the glass transition temperature of polymers based only on their chemical structure. This approach has shown to successfully predict the Tg of unknown polymers with average relative errors as low as 6%. Several networks with different architecture or hiperparameters were successfully trained using a previously studied glass transition temperatures dataset for validation, and then the same met...
A proof-of-concept framework for identifying molecules of unknown elemental composition and structure using experimental rotational data and probabilistic deep learning is presented. Using a minimal set of input data determined experimentally, we describe four neural network architectures that yield information to assist in the identification of an unknown molecule. The first architecture translates spectroscopic parameters into Coulomb matrix eigenspectra, as a method of recovering chemical and...
#1D. Cakmakci (Bilkent University)
#2E. O. Karakaslar (Bilkent University)
Last. A. E. Cicek (Bilkent University)
view all 8 authors...
Complete resection of the tumor is important for survival in glioma patients. Even if the gross total resection was achieved, left-over micro-scale tissue in the excision cavity risks recurrence. High Resolution Magic Angle Spinning Nuclear Magnetic Resonance (HRMAS NMR) technique can distinguish healthy and malign tissue efficiently using peak intensities of biomarker metabolites. The method is fast, sensitive and can work with small and unprocessed samples, which makes it a good fit for real-t...
#1R.P. Sharma (Indian Institute of Chemical Technology)H-Index: 4
Abstract Clustering brings molecules having similar patterns together and is governed mainly by the structural features (SFs). The challenge is to cluster in such a way that the minimum number of groups with significant molecules having similar prevalent patterns comes together with minimal human intervention. Determining an automatic and reliable approach to cluster molecules is crucial for clinical assessment of medical conditions. Hypertension is one of such health conditions and anti-hyperte...
#1Xiaoyan Li (MSU: Michigan State University)
#2Alyssa R. Sanderson (MSU: Michigan State University)
Last. Rebecca H. Lahr (MSU: Michigan State University)
view all 4 authors...
A low-cost tap water fingerprinting technique was evaluated using the coffee-ring effect, a phenomenon by which tap water droplets leave distinguishable “fingerprint” residue patterns after water evaporates. Tap waters from communities across southern Michigan dried on aluminum and photographed with a cell phone camera and 30× loupe produced unique and reproducible images. A convolutional neural network (CNN) model was trained using the images from the Michigan tap waters, and despite the small ...
#1Yejin Kim (University of Texas Health Science Center at Houston)
#2Shuyu Zheng (UH: University of Helsinki)H-Index: 1
Last. Xiaoqian Jiang (University of Texas Health Science Center at Houston)
view all 6 authors...
Motivation: Exploring an exponentially increasing yet more promising space, high-throughput combinatorial drug screening has advantages in identifying cancer treatment options with higher efficacy without degradation in terms of safety. A key challenge is that accumulated number of observations in in-vitro drug responses varies greatly among different cancer types, where some tissues (such as bone and prostate) are understudied than the others. Thus, we aim to develop a drug synergy prediction m...