A guide to machine learning for biologists
The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
206,07 € per year
only 17,17 € per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
![](https://media.springernature.com/m312/springer-static/image/art%3A10.1038%2Fs41580-021-00407-0/MediaObjects/41580_2021_407_Fig1_HTML.png)
![](https://media.springernature.com/m312/springer-static/image/art%3A10.1038%2Fs41580-021-00407-0/MediaObjects/41580_2021_407_Fig2_HTML.png)
![](https://media.springernature.com/m312/springer-static/image/art%3A10.1038%2Fs41580-021-00407-0/MediaObjects/41580_2021_407_Fig3_HTML.png)
![](https://media.springernature.com/m312/springer-static/image/art%3A10.1038%2Fs41580-021-00407-0/MediaObjects/41580_2021_407_Fig4_HTML.png)
Similar content being viewed by others
![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs41467-022-29268-7/MediaObjects/41467_2022_29268_Fig1_HTML.png)
Current progress and open challenges for applying deep learning across the biosciences
Article Open access 01 April 2022
![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs42256-020-0217-y/MediaObjects/42256_2020_217_Fig1_HTML.png)
Article 17 August 2020
![](https://media.springernature.com/w215h120/springer-static/image/art%3A10.1038%2Fs41592-021-01283-4/MediaObjects/41592_2021_1283_Fig1_HTML.png)
Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms
Article 04 October 2021
References
- Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface15, 20170387 (2018). This is a thorough review of applications of deep learning to biology and medicine including many references to the literature. PubMedPubMed CentralGoogle Scholar
- Mitchell, T. M. Machine Learning (McGraw Hill, 1997).
- Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).
- Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet.16, 321–332 (2015). CASPubMedPubMed CentralGoogle Scholar
- Zou, J. et al. A primer on deep learning in genomics. Nat. Genet.51, 12–18 (2019). CASPubMedGoogle Scholar
- Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol.16, 440–456 (2020). PubMedGoogle Scholar
- Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods16, 687–694 (2019). CASPubMedGoogle Scholar
- Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. & Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol.3, e116 (2007). This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods. PubMedPubMed CentralGoogle Scholar
- Silva, J. C. F., Teixeira, R. M., Silva, F. F., Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review. Plant. Sci.284, 37–47 (2019). CASPubMedGoogle Scholar
- Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol.6, 366 (2015). PubMedPubMed CentralGoogle Scholar
- Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci.10, 94 (2016). PubMedPubMed CentralGoogle Scholar
- Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell.2, 573–584 (2020). Google Scholar
- Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res.47, W402–W407 (2019). CASPubMedPubMed CentralGoogle Scholar
- Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.26, 990–999 (2016). CASPubMedPubMed CentralGoogle Scholar
- Altman, N. & Krzywinski, M. Clustering. Nat. Methods14, 545–546 (2017). CASGoogle Scholar
- Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol.35, 128–135 (2017). CASPubMedPubMed CentralGoogle Scholar
- Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics28, 664–671 (2012). CASPubMedPubMed CentralGoogle Scholar
- Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.12, 2825–2830 (2011). Google Scholar
- Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw.28, 1–26 (2008). Google Scholar
- Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw.5, 2704 (2020). Google Scholar
- Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol.20, 659–660 (2019). CASPubMedGoogle Scholar
- Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.33, 831–838 (2015). CASPubMedGoogle Scholar
- Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature577, 706–710 (2020). Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered. CASPubMedGoogle Scholar
- Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). CASPubMedPubMed CentralGoogle Scholar
- Tegunov, D. & Cramer, P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat. Methods16, 1146–1152 (2019). CASPubMedPubMed CentralGoogle Scholar
- LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution. CASPubMedGoogle Scholar
- Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).
- Adebayo, J. et al. Sanity checks for saliency maps. NeurIPShttps://arxiv.org/abs/1810.03292 (2018).
- Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML48, 1050–1059 (2016). Google Scholar
- Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics21, 119 (2020). PubMedPubMed CentralGoogle Scholar
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B.58, 267–288 (1996). Google Scholar
- Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B.67, 301–320 (2005). Google Scholar
- Noble, W. S. What is a support vector machine? Nat. Biotechnol.24, 1565–1567 (2006). CASPubMedGoogle Scholar
- Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol.609, 223–239 (2010). CASPubMedGoogle Scholar
- Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. & Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol.4, e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks. PubMedPubMed CentralGoogle Scholar
- Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet.46, 310–315 (2014). CASPubMedPubMed CentralGoogle Scholar
- Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods16, 1037–1044 (2019). CASPubMedPubMed CentralGoogle Scholar
- Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods15, 5–6 (2018). CASPubMedPubMed CentralGoogle Scholar
- Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem.38, 169–177 (2017). PubMedGoogle Scholar
- Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics19, 84 (2018). PubMedPubMed CentralGoogle Scholar
- Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput.23, 192–203 (2018). PubMedPubMed CentralGoogle Scholar
- Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res.47, 1044 (2019). PubMedGoogle Scholar
- Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). CASPubMedGoogle Scholar
- Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.31, 651–666 (2010). Google Scholar
- Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining.96, 226–231 (1996). Google Scholar
- Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol.15, e1006907 (2019). CASPubMedPubMed CentralGoogle Scholar
- Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol.37, 1482–1492 (2019). CASPubMedPubMed CentralGoogle Scholar
- van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008). Google Scholar
- Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun.10, 5416 (2019). This article provides a discussion and tips for usingt-SNE as a dimensionality reduction technique on single-cell transcriptomics data. PubMedPubMed CentralGoogle Scholar
- Crick, F. The recent excitement about neural networks. Nature337, 129–132 (1989). CASPubMedGoogle Scholar
- Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell.2, 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real-world data. Google Scholar
- Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol.202, 865–884 (1988). CASPubMedGoogle Scholar
- deFigueiredo, R. J. et al. Neural-network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA92, 5530–5534 (1995). CASPubMedPubMed CentralGoogle Scholar
- Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci.3, 80 (2016). Google Scholar
- Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA117, 1496–1503 (2020). CASPubMedPubMed CentralGoogle Scholar
- Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell.3, 601–609 (2021). PubMedPubMed CentralGoogle Scholar
- Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol.36, 983–987 (2018). CASPubMedGoogle Scholar
- Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods17, 1111–1117 (2020). PubMedPubMed CentralGoogle Scholar
- Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics32, i121–i127 (2016). CASPubMedPubMed CentralGoogle Scholar
- Yao, R., Qian, J. & Huang, Q. Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules. Bioinformatics36, 1252–1259 (2020). CASPubMedGoogle Scholar
- Si, D. et al. Deep learning to predict protein backbone structure from high-resolution cryo-EM density maps. Sci. Rep.10, 4282 (2020). PubMedPubMed CentralGoogle Scholar
- Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.2, 158–164 (2018). PubMedGoogle Scholar
- AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst.8, 292–301.e3 (2019). CASPubMedPubMed CentralGoogle Scholar
- Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics33, 2842–2849 (2017). CASPubMedGoogle Scholar
- Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model.58, 472–479 (2018). PubMedGoogle Scholar
- Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc.56, 301–318 (2016). PubMedPubMed CentralGoogle Scholar
- Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.44, e107 (2016). PubMedPubMed CentralGoogle Scholar
- Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods16, 1315–1322 (2019). CASPubMedPubMed CentralGoogle Scholar
- Vaswani, A. et al. Attention is all you need. arXivhttps://arxiv.org/abs/1706.03762 (2017).
- Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXivhttps://arxiv.org/abs/2007.06225 (2020).
- Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589 (2021). CASPubMedPubMed CentralGoogle Scholar
- Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXivhttps://arxiv.org/abs/1806.01261 (2018).
- Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell181, 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad-spectrum antibiotic activities in mice. CASPubMedGoogle Scholar
- Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods17, 184–192 (2020). CASPubMedGoogle Scholar
- Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst.11, 402–411.e4 (2020). CASPubMedGoogle Scholar
- Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun.12, 3168 (2021). CASPubMedPubMed CentralGoogle Scholar
- Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics34, i457–i466 (2018). CASPubMedPubMed CentralGoogle Scholar
- Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Sci. Rep.9, 9237 (2019). PubMedPubMed CentralGoogle Scholar
- Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXivhttps://arxiv.org/abs/1903.02428 (2019).
- Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol.37, 1038–1040 (2019). CASPubMedGoogle Scholar
- Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep.6, 19598 (2016). CASPubMedPubMed CentralGoogle Scholar
- Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst.11, 49–62.e16 (2020). CASPubMedPubMed CentralGoogle Scholar
- Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep.8, 16189 (2018). PubMedPubMed CentralGoogle Scholar
- Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun.12, 1882 (2021). CASPubMedPubMed CentralGoogle Scholar
- Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32, 8024–8035 (2019). Google Scholar
- Abadi M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283 (USENIX, 2016).
- Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE8, e67863 (2013). CASPubMedPubMed CentralGoogle Scholar
- Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform17, 831–840 (2016). This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly. CASPubMedGoogle Scholar
- Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol.21, 282 (2020). PubMedPubMed CentralGoogle Scholar
- Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J.5, 823–826 (1986). CASPubMedPubMed CentralGoogle Scholar
- Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr. Opin. Struct. Biol.21, 404–411 (2011). PubMedGoogle Scholar
- Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics20, 473 (2019). PubMedPubMed CentralGoogle Scholar
- Sillitoe, I. et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res.47, D280–D284 (2019). CASPubMedGoogle Scholar
- Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol.10, e1003926 (2014). PubMedPubMed CentralGoogle Scholar
- Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein-ligand interactions. J. Chem. Inf. Model.57, 1007–1012 (2017). CASPubMedGoogle Scholar
- Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med.15, e1002683 (2018). PubMedPubMed CentralGoogle Scholar
- Szegedy, C. et al. Intriguing properties of neural networks. arXivhttps://arxiv.org/abs/1312.6199 (2014).
- Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science362, 347–350 (2018). CASPubMedPubMed CentralGoogle Scholar
- Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes12, e005122 (2019). PubMedPubMed CentralGoogle Scholar
- Konečný, J., Brendan McMahan, H., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXivhttps://arxiv.org/abs/1610.02527 (2016).
- Pérez, A., Martínez-Rosell, G. & De Fabritiis, G. Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol.49, 139–144 (2018). PubMedGoogle Scholar
- Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science365, 6457 (2019). Google Scholar
- Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxivhttps://www.biorxiv.org/content/10.1101/103663v1 (2017).
- Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol.16, e9198 (2020). PubMedPubMed CentralGoogle Scholar
- Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxivhttps://doi.org/10.1101/2020.07.22.211482 (2020). ArticleGoogle Scholar
- Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXivhttps://arxiv.org/abs/1907.07587 (2019).
- Ingraham J., Riesselman A. J., Sander C., Marks D. S. Learning protein structure with a differentiable simulator. ICLRhttps://openreview.net/forum?id=Byg3y3C9Km (2019).
- Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. Trajectory-based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu-hours. PLoS Comput. Biol.14, e1006578 (2018). PubMedPubMed CentralGoogle Scholar
- Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXivhttp://arxiv.org/abs/2010.01196 (2020).
- Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHubhttp://github.com/google/jax (2018).
- Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods16, 315–318 (2019). This work provides a software library based on PyTorch providing functionality for biological sequences. CASPubMedPubMed CentralGoogle Scholar
- Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun.11, 3488 (2020). CASPubMedPubMed CentralGoogle Scholar
- Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXivhttps://arxiv.org/abs/1912.04232 (2019).
- Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol.37, 592–600 (2019). CASPubMedPubMed CentralGoogle Scholar
- Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods18, 203–211 (2020). PubMedGoogle Scholar
- Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol.16, e9380 (2020). CASPubMedPubMed CentralGoogle Scholar
- AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics20, 311 (2019). PubMedPubMed CentralGoogle Scholar
- Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXivhttps://arxiv.org/abs/2012.04035 (2020).
- Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst.32, 9689–9701 (2019). PubMedPubMed CentralGoogle Scholar
- Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins87, 1011–1020 (2019). CASPubMedPubMed CentralGoogle Scholar
- Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol.20, 244 (2019). CASPubMedPubMed CentralGoogle Scholar
- Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics36, 5322–5329 (2020). CASPubMed CentralGoogle Scholar
- Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom. Intell. Lab. Syst.44, 77–98 (1998). CASGoogle Scholar
- Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all gene ontology domains. Sci. Rep.6, 31865 (2016). CASPubMedPubMed CentralGoogle Scholar
- Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics10, 159 (2009). PubMedPubMed CentralGoogle Scholar
- Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res.33, W480–W482 (2005). CASPubMedPubMed CentralGoogle Scholar
- Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet.10, 1077 (2019). CASPubMedPubMed CentralGoogle Scholar
- Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem.30, 865–871 (2004). Google Scholar
- Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Understanding protein flexibility through dimensionality reduction. J. Comput. Biol.10, 617–634 (2003). CASPubMedGoogle Scholar
- Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXivhttps://arxiv.org/abs/1703.06103 (2019).
- Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods15, 805–815 (2018). CASPubMedPubMed CentralGoogle Scholar
- Antczak, M., Michaelis, M. & Wass, M. N. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun.10, 3100 (2019). PubMedPubMed CentralGoogle Scholar
- Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics18, 277 (2017). PubMedPubMed CentralGoogle Scholar
- Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun.12, 1340 (2021). CASPubMedPubMed CentralGoogle Scholar
- Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics35, 3313–3319 (2019). PubMedGoogle Scholar
- Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res.42, W314–W319 (2014). CASPubMedPubMed CentralGoogle Scholar
- Yuan, Y. & Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl Acad. Sci. USA116, 27151–27158 (2019). CASPubMed CentralGoogle Scholar
- Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics17, S9 (2016). Google Scholar
- Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S. & Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem.91, 12407–12413 (2019). CASPubMedPubMed CentralGoogle Scholar
- Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods16, 299–302 (2019). PubMedGoogle Scholar
- Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites10, 243 (2020). CASPubMed CentralGoogle Scholar
- Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods18, 176–185 (2021). CASPubMedPubMed CentralGoogle Scholar
- Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun.11, 3877 (2020). CASPubMedPubMed CentralGoogle Scholar
- Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng.5, 613–623 (2021). CASPubMedGoogle Scholar
- Gligorijevic, V., Barot, M. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics34, 3873–3881 (2018). CASPubMedPubMed CentralGoogle Scholar
- Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).
- Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci.7700, 437–478 (2012). Google Scholar
- Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell.3, 199–217 (2021). This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model. Google Scholar
- List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol.13, e1005265 (2017). PubMedPubMed CentralGoogle Scholar
- Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S. The need for open source software in machine learning. J. Mach. Learn. Res.8, 2443–2466 (2007). Google Scholar
Acknowledgements
The authors thank members of the UCL Bioinformatics Group for valuable discussions and comments. This work was supported by the European Research Council Advanced Grant ProCovar (project ID 695558).