A guide to machine learning for biologists

The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed.

This is a preview of subscription content, access via your institution

Access options

Access Nature and 54 other Nature Portfolio journals

Get Nature+, our best-value online-access subscription

cancel any time

Subscribe to this journal

Receive 12 print issues and online access

206,07 € per year

only 17,17 € per issue

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Current progress and open challenges for applying deep learning across the biosciences

Article Open access 01 April 2022

Ensemble deep learning in bioinformatics

Article 17 August 2020

Differentiable biology: using deep learning for biophysics-based and data-driven modeling of molecular mechanisms

Article 04 October 2021

References

Ching, T. et al. Opportunities and obstacles for deep learning in biology and medicine. J. R. Soc. Interface15, 20170387 (2018). This is a thorough review of applications of deep learning to biology and medicine including many references to the literature. PubMedPubMed CentralGoogle Scholar
Mitchell, T. M. Machine Learning (McGraw Hill, 1997).
Goodfellow, I., Bengio Y. & Courville, A. Deep Learning (MIT Press, 2016).
Libbrecht, M. W. & Noble, W. S. Machine learning applications in genetics and genomics. Nat. Rev. Genet.16, 321–332 (2015). CASPubMedPubMed CentralGoogle Scholar
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet.51, 12–18 (2019). CASPubMedGoogle Scholar
Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol.16, 440–456 (2020). PubMedGoogle Scholar
Yang, K. K., Wu, Z. & Arnold, F. H. Machine-learning-guided directed evolution for protein engineering. Nat. Methods16, 687–694 (2019). CASPubMedGoogle Scholar
Tarca, A. L., Carey, V. J., Chen, X.-W., Romero, R. & Drăghici, S. Machine learning and its applications to biology. PLoS Comput. Biol.3, e116 (2007). This is an introduction to machine learning concepts and applications in biology with a focus on traditional machine learning methods. PubMedPubMed CentralGoogle Scholar
Silva, J. C. F., Teixeira, R. M., Silva, F. F., Brommonschenkel, S. H. & Fontes, E. P. B. Machine learning approaches and their current application in plant molecular biology: a systematic review. Plant. Sci.284, 37–47 (2019). CASPubMedGoogle Scholar
Kandoi, G., Acencio, M. L. & Lemke, N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front. Physiol.6, 366 (2015). PubMedPubMed CentralGoogle Scholar
Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci.10, 94 (2016). PubMedPubMed CentralGoogle Scholar
Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell.2, 573–584 (2020). Google Scholar
Buchan, D. W. A. & Jones, D. T. The PSIPRED Protein Analysis Workbench: 20 years on. Nucleic Acids Res.47, W402–W407 (2019). CASPubMedPubMed CentralGoogle Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.26, 990–999 (2016). CASPubMedPubMed CentralGoogle Scholar
Altman, N. & Krzywinski, M. Clustering. Nat. Methods14, 545–546 (2017). CASGoogle Scholar
Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol.35, 128–135 (2017). CASPubMedPubMed CentralGoogle Scholar
Zhang, Z. et al. Predicting folding free energy changes upon single point mutations. Bioinformatics28, 664–671 (2012). CASPubMedPubMed CentralGoogle Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res.12, 2825–2830 (2011). Google Scholar
Kuhn, M. Building predictive models in r using the caret package. J. Stat. Softw.28, 1–26 (2008). Google Scholar
Blaom, A. D. et al. MLJ: a Julia package for composable machine learning. J. Open Source Softw.5, 2704 (2020). Google Scholar
Jones, D. T. Setting the standards for machine learning in biology. Nat. Rev. Mol. Cell Biol.20, 659–660 (2019). CASPubMedGoogle Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol.33, 831–838 (2015). CASPubMedGoogle Scholar
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature577, 706–710 (2020). Technology company DeepMind entered the CASP13 assessment in protein structure prediction and its method using deep learning was the most accurate of the methods entered. CASPubMedGoogle Scholar
Esteva, A. et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature542, 115–118 (2017). CASPubMedPubMed CentralGoogle Scholar
Tegunov, D. & Cramer, P. Real-time cryo-electron microscopy data preprocessing with Warp. Nat. Methods16, 1146–1152 (2019). CASPubMedPubMed CentralGoogle Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature521, 436–444 (2015). This is a review of deep learning by some of the major figures in the deep learning revolution. CASPubMedGoogle Scholar
Hastie T., Tibshirani R., Friedman J. The elements of statistical learning: data mining, inference, and prediction. 2nd Edn. (Springer Science & Business Media; 2009).
Adebayo, J. et al. Sanity checks for saliency maps. NeurIPShttps://arxiv.org/abs/1810.03292 (2018).
Gal, Y. & Ghahramani, Z. Dropout as a Bayesian approximation: representing model uncertainty in deep learning. ICML48, 1050–1059 (2016). Google Scholar
Smith, A. M. et al. Standard machine learning approaches outperform deep representation learning on phenotype prediction from transcriptomics data. BMC Bioinformatics21, 119 (2020). PubMedPubMed CentralGoogle Scholar
Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B.58, 267–288 (1996). Google Scholar
Zou, H. & Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B.67, 301–320 (2005). Google Scholar
Noble, W. S. What is a support vector machine? Nat. Biotechnol.24, 1565–1567 (2006). CASPubMedGoogle Scholar
Ben-Hur, A. & Weston, J. A user’s guide to support vector machines. Methods Mol. Biol.609, 223–239 (2010). CASPubMedGoogle Scholar
Ben-Hur, A., Ong, C. S., Sonnenburg, S., Schölkopf, B. & Rätsch, G. Support vector machines and kernels for computational biology. PLoS Comput. Biol.4, e1000173 (2008). This is an introduction to SVMs with a focus on biological data and prediction tasks. PubMedPubMed CentralGoogle Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet.46, 310–315 (2014). CASPubMedPubMed CentralGoogle Scholar
Driscoll, M. K. et al. Robust and automated detection of subcellular morphological motifs in 3D microscopy images. Nat. Methods16, 1037–1044 (2019). CASPubMedPubMed CentralGoogle Scholar
Bzdok, D., Krzywinski, M. & Altman, N. Machine learning: supervised methods. Nat. Methods15, 5–6 (2018). CASPubMedPubMed CentralGoogle Scholar
Wang, C. & Zhang, Y. Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem.38, 169–177 (2017). PubMedGoogle Scholar
Zeng, W., Wu, M. & Jiang, R. Prediction of enhancer-promoter interactions via natural language processing. BMC Genomics19, 84 (2018). PubMedPubMed CentralGoogle Scholar
Olson, R. S., Cava, W. L., Mustahsan, Z., Varik, A. & Moore, J. H. Data-driven advice for applying machine learning to bioinformatics problems. Pac. Symp. Biocomput.23, 192–203 (2018). PubMedPubMed CentralGoogle Scholar
Rappoport, N. & Shamir, R. Multi-omic and multi-view clustering algorithms: review and cancer benchmark. Nucleic Acids Res.47, 1044 (2019). PubMedGoogle Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). CASPubMedGoogle Scholar
Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett.31, 651–666 (2010). Google Scholar
Ester M., Kriegel H.-P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. KDD‘96 Proc. Second Int. Conf. Knowl. Discov. Data Mining.96, 226–231 (1996). Google Scholar
Nguyen, L. H. & Holmes, S. Ten quick tips for effective dimensionality reduction. PLoS Comput. Biol.15, e1006907 (2019). CASPubMedPubMed CentralGoogle Scholar
Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nat. Biotechnol.37, 1482–1492 (2019). CASPubMedPubMed CentralGoogle Scholar
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008). Google Scholar
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun.10, 5416 (2019). This article provides a discussion and tips for usingt-SNE as a dimensionality reduction technique on single-cell transcriptomics data. PubMedPubMed CentralGoogle Scholar
Crick, F. The recent excitement about neural networks. Nature337, 129–132 (1989). CASPubMedGoogle Scholar
Geirhos, R. et al. Shortcut learning in deep neural networks. Nat. Mach. Intell.2, 665–673 (2020). This article discusses a common problem in deep learning called ‘shortcut learning’, where the model uses decision rules that do not transfer to real-world data. Google Scholar
Qian, N. & Sejnowski, T. J. Predicting the secondary structure of globular proteins using neural network models. J. Mol. Biol.202, 865–884 (1988). CASPubMedGoogle Scholar
deFigueiredo, R. J. et al. Neural-network-based classification of cognitively normal, demented, Alzheimer disease and vascular dementia from single photon emission with computed tomography image data from brain. Proc. Natl Acad. Sci. USA92, 5530–5534 (1995). CASPubMedPubMed CentralGoogle Scholar
Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. DeepTox: toxicity prediction using deep learning. Front. Environ. Sci.3, 80 (2016). Google Scholar
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA117, 1496–1503 (2020). CASPubMedPubMed CentralGoogle Scholar
Xu, J., Mcpartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell.3, 601–609 (2021). PubMedPubMed CentralGoogle Scholar
Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol.36, 983–987 (2018). CASPubMedGoogle Scholar
Fudenberg, G., Kelley, D. R. & Pollard, K. S. Predicting 3D genome folding from DNA sequence with Akita. Nat. Methods17, 1111–1117 (2020). PubMedPubMed CentralGoogle Scholar
Zeng, H., Edwards, M. D., Liu, G. & Gifford, D. K. Convolutional neural network architectures for predicting DNA-protein binding. Bioinformatics32, i121–i127 (2016). CASPubMedPubMed CentralGoogle Scholar
Yao, R., Qian, J. & Huang, Q. Deep-learning with synthetic data enables automated picking of cryo-EM particle images of biological macromolecules. Bioinformatics36, 1252–1259 (2020). CASPubMedGoogle Scholar
Si, D. et al. Deep learning to predict protein backbone structure from high-resolution cryo-EM density maps. Sci. Rep.10, 4282 (2020). PubMedPubMed CentralGoogle Scholar
Poplin, R. et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat. Biomed. Eng.2, 158–164 (2018). PubMedGoogle Scholar
AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell Syst.8, 292–301.e3 (2019). CASPubMedPubMed CentralGoogle Scholar
Heffernan, R., Yang, Y., Paliwal, K. & Zhou, Y. Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics33, 2842–2849 (2017). CASPubMedGoogle Scholar
Müller, A. T., Hiss, J. A. & Schneider, G. Recurrent neural network model for constructive peptide design. J. Chem. Inf. Model.58, 472–479 (2018). PubMedGoogle Scholar
Choi, E., Bahadori, M. T., Schuetz, A., Stewart, W. F. & Sun, J. Doctor AI: predicting clinical events via recurrent neural networks. JMLR Workshop Conf. Proc.56, 301–318 (2016). PubMedPubMed CentralGoogle Scholar
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res.44, e107 (2016). PubMedPubMed CentralGoogle Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods16, 1315–1322 (2019). CASPubMedPubMed CentralGoogle Scholar
Vaswani, A. et al. Attention is all you need. arXivhttps://arxiv.org/abs/1706.03762 (2017).
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. arXivhttps://arxiv.org/abs/2007.06225 (2020).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature596, 583–589 (2021). CASPubMedPubMed CentralGoogle Scholar
Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. arXivhttps://arxiv.org/abs/1806.01261 (2018).
Stokes, J. M. et al. A deep learning approach to antibiotic discovery. Cell181, 475–483 (2020). In this work, a deep learning model predicts antibiotic activity, with one candidate showing broad-spectrum antibiotic activities in mice. CASPubMedGoogle Scholar
Gainza, P. et al. Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat. Methods17, 184–192 (2020). CASPubMedGoogle Scholar
Strokach, A., Becerra, D., Corbi-Verge, C., Perez-Riba, A. & Kim, P. M. Fast and flexible protein design using deep graph neural networks. Cell Syst.11, 402–411.e4 (2020). CASPubMedGoogle Scholar
Gligorijevic, V. et al. Structure-based function prediction using graph convolutional networks. Nat. Commun.12, 3168 (2021). CASPubMedPubMed CentralGoogle Scholar
Zitnik, M., Agrawal, M. & Leskovec, J. Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics34, i457–i466 (2018). CASPubMedPubMed CentralGoogle Scholar
Veselkov, K. et al. HyperFoods: machine intelligent mapping of cancer-beating molecules in foods. Sci. Rep.9, 9237 (2019). PubMedPubMed CentralGoogle Scholar
Fey, M. & Lenssen, J. E. Fast graph representation learning with PyTorch geometric. arXivhttps://arxiv.org/abs/1903.02428 (2019).
Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol.37, 1038–1040 (2019). CASPubMedGoogle Scholar
Wang, Y. et al. Predicting DNA methylation state of CpG dinucleotide using genome topological features and deep networks. Sci. Rep.6, 19598 (2016). CASPubMedPubMed CentralGoogle Scholar
Linder, J., Bogard, N., Rosenberg, A. B. & Seelig, G. A generative neural network for maximizing fitness and diversity of synthetic DNA and protein sequences. Cell Syst.11, 49–62.e16 (2020). CASPubMedPubMed CentralGoogle Scholar
Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Sci. Rep.8, 16189 (2018). PubMedPubMed CentralGoogle Scholar
Wang, J. et al. scGNN is a novel graph neural network framework for single-cell RNA-Seq analyses. Nat. Commun.12, 1882 (2021). CASPubMedPubMed CentralGoogle Scholar
Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst.32, 8024–8035 (2019). Google Scholar
Abadi M. et al. Tensorflow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation. 265–283 (USENIX, 2016).
Wei, Q. & Dunbrack, R. L. Jr The role of balanced training and testing data sets for binary classifiers in bioinformatics. PLoS ONE8, e67863 (2013). CASPubMedPubMed CentralGoogle Scholar
Walsh, I., Pollastri, G. & Tosatto, S. C. E. Correct machine learning on protein sequences: a peer-reviewing perspective. Brief. Bioinform17, 831–840 (2016). This article discusses how peer reviewers can assess machine learning methods in biology, and by extension how scientists can design and conduct such studies properly. CASPubMedGoogle Scholar
Schreiber, J., Singh, R., Bilmes, J. & Noble, W. S. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol.21, 282 (2020). PubMedPubMed CentralGoogle Scholar
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J.5, 823–826 (1986). CASPubMedPubMed CentralGoogle Scholar
Söding, J. & Remmert, M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr. Opin. Struct. Biol.21, 404–411 (2011). PubMedGoogle Scholar
Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics20, 473 (2019). PubMedPubMed CentralGoogle Scholar
Sillitoe, I. et al. CATH: expanding the horizons of structure-based functional annotations for genome sequences. Nucleic Acids Res.47, D280–D284 (2019). CASPubMedGoogle Scholar
Cheng, H. et al. ECOD: an evolutionary classification of protein domains. PLoS Comput. Biol.10, e1003926 (2014). PubMedPubMed CentralGoogle Scholar
Li, Y. & Yang, J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein-ligand interactions. J. Chem. Inf. Model.57, 1007–1012 (2017). CASPubMedGoogle Scholar
Zech, J. R. et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med.15, e1002683 (2018). PubMedPubMed CentralGoogle Scholar
Szegedy, C. et al. Intriguing properties of neural networks. arXivhttps://arxiv.org/abs/1312.6199 (2014).
Hie, B., Cho, H. & Berger, B. Realizing private and practical pharmacological collaboration. Science362, 347–350 (2018). CASPubMedPubMed CentralGoogle Scholar
Beaulieu-Jones, B. K. et al. Privacy-preserving generative deep neural networks support clinical data sharing. Circ. Cardiovasc. Qual. Outcomes12, e005122 (2019). PubMedPubMed CentralGoogle Scholar
Konečný, J., Brendan McMahan, H., Ramage, D. & Richtárik, P. Federated optimization: distributed machine learning for on-device intelligence. arXivhttps://arxiv.org/abs/1610.02527 (2016).
Pérez, A., Martínez-Rosell, G. & De Fabritiis, G. Simulations meet machine learning in structural biology. Curr. Opin. Struct. Biol.49, 139–144 (2018). PubMedGoogle Scholar
Noé, F., Olsson, S., Köhler, J. & Wu, H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science365, 6457 (2019). Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Reverse-complement parameter sharing improves deep learning models for genomics. bioRxivhttps://www.biorxiv.org/content/10.1101/103663v1 (2017).
Lopez, R., Gayoso, A. & Yosef, N. Enhancing scientific discoveries in molecular biology with deep generative models. Mol. Syst. Biol.16, e9198 (2020). PubMedPubMed CentralGoogle Scholar
Anishchenko, I., Chidyausiku, T. M., Ovchinnikov, S., Pellock, S. J. & Baker, D. De novo protein design by deep network hallucination. bioRxivhttps://doi.org/10.1101/2020.07.22.211482 (2020). ArticleGoogle Scholar
Innes, M. et al. A differentiable programming system to bridge machine learning and scientific computing. arXivhttps://arxiv.org/abs/1907.07587 (2019).
Ingraham J., Riesselman A. J., Sander C., Marks D. S. Learning protein structure with a differentiable simulator. ICLRhttps://openreview.net/forum?id=Byg3y3C9Km (2019).
Jumper, J. M., Faruk, N. F., Freed, K. F. & Sosnick, T. R. Trajectory-based training enables protein simulations with accurate folding and Boltzmann ensembles in cpu-hours. PLoS Comput. Biol.14, e1006578 (2018). PubMedPubMed CentralGoogle Scholar
Wang, Y., Fass, J. & Chodera, J. D. End-to-end differentiable molecular mechanics force field construction. arXivhttp://arxiv.org/abs/2010.01196 (2020).
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHubhttp://github.com/google/jax (2018).
Chen, K. M., Cofer, E. M., Zhou, J. & Troyanskaya, O. G. Selene: a PyTorch-based deep learning library for sequence data. Nat. Methods16, 315–318 (2019). This work provides a software library based on PyTorch providing functionality for biological sequences. CASPubMedPubMed CentralGoogle Scholar
Kopp, W., Monti, R., Tamburrini, A., Ohler, U. & Akalin, A. Deep learning for genomics using Janggu. Nat. Commun.11, 3488 (2020). CASPubMedPubMed CentralGoogle Scholar
Schoenholz, S. S. & Cubuk, E. D. JAX, M.D.: end-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. arXivhttps://arxiv.org/abs/1912.04232 (2019).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol.37, 592–600 (2019). CASPubMedPubMed CentralGoogle Scholar
Isensee, F., Jaeger, P. F., Kohl, S. A. A., Petersen, J. & Maier-Hein, K. H. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods18, 203–211 (2020). PubMedGoogle Scholar
Livesey, B. J. & Marsh, J. A. Using deep mutational scanning to benchmark variant effect predictors and identify disease mutations. Mol. Syst. Biol.16, e9380 (2020). CASPubMedPubMed CentralGoogle Scholar
AlQuraishi, M. ProteinNet: a standardized data set for machine learning of protein structure. BMC Bioinformatics20, 311 (2019). PubMedPubMed CentralGoogle Scholar
Townshend, R. J. L. et al. ATOM3D: tasks on molecules in three dimensions. arXivhttps://arxiv.org/abs/2012.04035 (2020).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural. Inf. Process. Syst.32, 9689–9701 (2019). PubMedPubMed CentralGoogle Scholar
Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP) — round XIII. Proteins87, 1011–1020 (2019). CASPubMedPubMed CentralGoogle Scholar
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol.20, 244 (2019). CASPubMedPubMed CentralGoogle Scholar
Munro, D. & Singh, M. DeMaSk: a deep mutational scanning substitution matrix and its use for variant impact prediction. Bioinformatics36, 5322–5329 (2020). CASPubMed CentralGoogle Scholar
Haario, H. & Taavitsainen, V.-M. Combining soft and hard modelling in chemical kinetic models. Chemom. Intell. Lab. Syst.44, 77–98 (1998). CASGoogle Scholar
Cozzetto, D., Minneci, F., Currant, H. & Jones, D. T. FFPred 3: feature-based function prediction for all gene ontology domains. Sci. Rep.6, 31865 (2016). CASPubMedPubMed CentralGoogle Scholar
Nugent, T. & Jones, D. T. Transmembrane protein topology prediction using support vector machines. BMC Bioinformatics10, 159 (2009). PubMedPubMed CentralGoogle Scholar
Bao, L., Zhou, M. & Cui, Y. nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms. Nucleic Acids Res.33, W480–W482 (2005). CASPubMedPubMed CentralGoogle Scholar
Li, W., Yin, Y., Quan, X. & Zhang, H. Gene expression value prediction based on XGBoost algorithm. Front. Genet.10, 1077 (2019). CASPubMedPubMed CentralGoogle Scholar
Zhang, Y. & Skolnick, J. SPICKER: a clustering approach to identify near-native protein folds. J. Comput. Chem.30, 865–871 (2004). Google Scholar
Teodoro, M. L., Phillips, G. N. Jr & Kavraki, L. E. Understanding protein flexibility through dimensionality reduction. J. Comput. Biol.10, 617–634 (2003). CASPubMedGoogle Scholar
Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXivhttps://arxiv.org/abs/1703.06103 (2019).
Pandarinath, C. et al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nat. Methods15, 805–815 (2018). CASPubMedPubMed CentralGoogle Scholar
Antczak, M., Michaelis, M. & Wass, M. N. Environmental conditions shape the nature of a minimal bacterial genome. Nat. Commun.10, 3100 (2019). PubMedPubMed CentralGoogle Scholar
Sun, T., Zhou, B., Lai, L. & Pei, J. Sequence-based prediction of protein protein interaction using a deep-learning algorithm. BMC Bioinformatics18, 277 (2017). PubMedPubMed CentralGoogle Scholar
Hiranuma, N. et al. Improved protein structure refinement guided by deep learning based accuracy estimation. Nat. Commun.12, 1340 (2021). CASPubMedPubMed CentralGoogle Scholar
Pagès, G., Charmettant, B. & Grudinin, S. Protein model quality assessment using 3D oriented convolutional neural networks. Bioinformatics35, 3313–3319 (2019). PubMedGoogle Scholar
Pires, D. E. V., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res.42, W314–W319 (2014). CASPubMedPubMed CentralGoogle Scholar
Yuan, Y. & Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl Acad. Sci. USA116, 27151–27158 (2019). CASPubMed CentralGoogle Scholar
Chen, L., Cai, C., Chen, V. & Lu, X. Learning a hierarchical representation of the yeast transcriptomic machinery using an autoencoder model. BMC Bioinformatics17, S9 (2016). Google Scholar
Kantz, E. D., Tiwari, S., Watrous, J. D., Cheng, S. & Jain, M. Deep neural networks for classification of LC-MS spectral peaks. Anal. Chem.91, 12407–12413 (2019). CASPubMedPubMed CentralGoogle Scholar
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods16, 299–302 (2019). PubMedGoogle Scholar
Liebal, U. W., Phan, A. N. T., Sudhakar, M., Raman, K. & Blank, L. M. Machine learning applications for mass spectrometry-based metabolomics. Metabolites10, 243 (2020). CASPubMed CentralGoogle Scholar
Zhong, E. D., Bepler, T., Berger, B. & Davis, J. H. CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks. Nat. Methods18, 176–185 (2021). CASPubMedPubMed CentralGoogle Scholar
Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun.11, 3877 (2020). CASPubMedPubMed CentralGoogle Scholar
Das, P. et al. Accelerated antimicrobial discovery via deep generative models and molecular dynamics simulations. Nat. Biomed. Eng.5, 613–623 (2021). CASPubMedGoogle Scholar
Gligorijevic, V., Barot, M. & Bonneau, R. deepNF: deep network fusion for protein function prediction. Bioinformatics34, 3873–3881 (2018). CASPubMedPubMed CentralGoogle Scholar
Karpathy A. A recipe for training neural networks. https://karpathy.github.io/2019/04/25/recipe (2019).
Bengio, Y. Practical recommendations for gradient-based training of deep architectures. Lecture Notes Comput. Sci.7700, 437–478 (2012). Google Scholar
Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell.3, 199–217 (2021). This study assesses 62 machine learning studies that analyse medical images for COVID-19 and none is found to be of clinical use, indicating the difficulties of training a useful model. Google Scholar
List, M., Ebert, P. & Albrecht, F. Ten simple rules for developing usable software in computational biology. PLoS Comput. Biol.13, e1005265 (2017). PubMedPubMed CentralGoogle Scholar
Sonnenburg, S. Ã., Braun, M. L., Ong, C. S. & Bengio, S. The need for open source software in machine learning. J. Mach. Learn. Res.8, 2443–2466 (2007). Google Scholar

Acknowledgements

The authors thank members of the UCL Bioinformatics Group for valuable discussions and comments. This work was supported by the European Research Council Advanced Grant ProCovar (project ID 695558).