Medical informatics in drug discovery and development

7 min readJun 24, 2021

Drug discovery and development is a huge domain, covering discovery phase, including lead and target discovery, and development phase, ranging from ADME (absorption, distribution, metabolism, excretion), toxicology to clinical trials. Despite the advancement in wet lab technique, dry lab and software have greatly enhanced the speed of discovery and development process in the past decades. To grasp the comprehensive understanding of informatics applications in such process, I would like to do a small literature review on database establishment, new drug discovery and development.

Database

The establishment and maintenance of medical databases plays a vital role in medical research and its collaboration. Released in 2004, PubChem is the world’s biggest freely accessible online database, containing compounds, substances and bioassays. Data of bioassays are obtained from high-throughput screening (HTS), a method for compounds, antibodies, or genes identification, using robotics, data-related software and devices. Before, such data were mainly accessible by pharmaceutical companies. It is exciting that nowadays these data can now be utilized by all professionals and students.

Several HTS methods in genomics developed around year 2000 are so-called next-generation sequencing (NGS). They are highly scalable and can sequence the entire genome at one time. Databases of genomes, including humans, animals and bacteria were established. For example, GDB Human Genome Database (1) collected human genomic data from the Human Genome Project, which was an international scientific research project with the aim to determine entire human genome. The Project was conducted between 1990 to 2003 and the Database was closed in 2008. Recently, Clinvar, launched in 2013 by the National Center for Biotechnology Information (NCBI), is a public archive of human genetic variants and phenotypes. The data are freely accessible online as well as programmatic access for local use (2).

Despite biochemical data, the development Electronic Health Record (EHR) has made clinical data storage efficient and liable. Some of these databases become “open source” after de-identification or anonymization. One typical example is MIMIC-III published online in 2016. MIMIC-III is a freely accessible critical care database, comprised of clinical data from patients admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts. Adult patients (aged 16 years or above) admitted to critical care units between 2001 and 2012 and neonates admitted between 2001 and 2008 were included (3). Such clinical databases have been served as good resources for “unintended” findings through big data analysis and machine learning model predictions. For instance, Pfizer used EHR and medical claim data to build a machine learning model to identify amyloid cardiomyopathy patients, which is a disease often underdiagnosed (4)

Informatics in drug discovery

Overview
Drug discovery process can be broken down into a few stages. Table 1 summarizes a list of work that can be assisted by informatics for each stage, especially the usage of artificial intelligence (AI) (5).

An example of lead discovery driven by deep neural networks (DNN)
Lead discovery has been difficult for antibiotics development. The challenge is especially concerning due to the emergence of antimicrobial resistance. Massachusetts Institute of Technology (MIT) identified halicin as a broad-spectrum antibiotic agent with low structural similarity compared to known antibiotics (6). The team first built a DNN model to identify hits, based on the ability of hits to inhibit E. coli growth (an 80% growth inhibition was used as a cutoff). Hit candidates were obtained from US Food and Drug Administration (FDA)-approved drug library. An ensemble of models was subsequently used to identify potential antibacterial molecules from the Drug Repurposing Hub. Compounds with low prediction scores (representing the probability of compounds’ ability to inhibit E. coli), and high compound structure similarity compared with training dataset (hit candidates) were removed. Finally, halicin was identified as a lead compound with antibacterial activity. Assays performed proved the hypothesis that halicin displayed inhibitory activity against mycobacterium tuberculosis, carbapenem-resistant Enterobacteriaceae. In murine models, halicin effectively treated Clostridioides difficile and pan-resistant Acinetobacter baumannii infections.

Transformation of molecular descriptors and fingerprints as model inputs
One thing draws my attention when reading the halicin example (6) and a review article (5) is “how machine learns molecules”. Several informatics tools have been developed to automatically produce molecular descriptors and fingerprints (7). And molecular property prediction has been extensively studied and utilized. Two kinds data inputs are typically used for model prediction: computed molecular fingerprints, which can be analyzed with neural networks, and expert-crafted descriptors, which are usually used with a special type of convolutional neural networks, called Graph Convolutional Networks (GCN) (8).

Target discovery — protein structure prediction and druggability
Three-dimensional (3D) protein structure prediction based on amino acid sequence has been studied more than 30 years. A well-known example is AlphaFold, a DNN model developed by Google DeepMind. The company won the Critical Assessment of protein Structure Prediction (CASP) contest with the unprecedented prediction accuracy. Nowadays, besides structure prediction, protein dynamics has also been widely investigated with AI models, which is especially important when understanding the desired target. Moreover, druggability prediction can save a lot of resources in lab experiments. SCREEN (Surface Cavity REcognition and EvaluatioN) webserver, one of the earliest machine learning (ML) applications, was built and trained on features such as geometric, structural and physicochemical properties of drug‐binding and nondrug‐binding cavities on proteins. Such classifier also leads to better understandings of druggability for that it reveals that size and shape of the surface cavities of a protein are the most critical attributes (5).

Prediction of substance properties: QSAR, physicochemical properties and ADME-T
Finally, QSAR, physicochemical properties and ADME-T profiles of a lead are important for drug efficacy and its optimization.

In QSAR, molecular descriptors are usually used as inputs. For ML models, random forest model is deemed as gold standard; and for DNN models, instead of single DNN architecture, multitasking is often utilized for its better performance.

Physicochemical properties such as solubility and permeability can also be predicted with GCN. Similarly, in ADME-T predication, molecular descriptors are often used as model inputs. Furthermore, capsule networks are especially implemented as DNN models for ADME-T predication (5).

Informatics in drug development

In drug development, which covers stages of clinical trials, medical informatics has been ubiquitously implemented to ease the process of clinical trial design and conduction.

Research Electronic Data Capture (REDCap) is a browser-based collaborative tool as an electronic data capture (EDC) system built by Vanderbilt University (9). An EDC system is commonly used for clinical trial data collection. Key features of REDCap include collaborative access to data across institutions, user authentication and role-based security, electronic case report forms (CRFs) and protocol document storage and sharing. Notably, a CRF contains all information including patient baselines, therapeutic outcomes and adverse events. Electronation of CRF with metadata ensure durability and scalability.

One merits of AI application in clinical trials is the facilitation of subject recruitment. The recruitment process is often the most time-consuming and expensive step of a trial. Moreover, it is very likely to have biased trial subjects due to time or resource constraints. Natural language processing (NLP), an AI technique which can understand written and spoken words, can be applied to screen physician notes and pathology reports, automatically identifying eligible people from these clinical documents (10). However ideal an AI application looks like, challenges are always present, such as the heterogeneity and unstructured formats of free text. Efforts need to be done for the full realization of AI in clinical research informatics.

Final thoughts

Most of my experience in lectures and lab meetings lies in healthcare informatics. It is very interesting to also learn about informatics and AI applications in biomedical and chemical disciplines. Some models I have never heard in healthcare informatics, such as Graph Convolutional Networks (GCN) and capsule networks, are commonly used in drug discovery. I am very excited to see more advancements of AI applications in life science and healthcare service.

Reference

1. Letovsky SI, Cottingham RW, Porter CJ, Li PW. GDB: the Human Genome Database. Nucleic Acids Res. 1998;26(1):94–9.
2. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 2016;44(D1):D862-D8.
3. Johnson AE, Pollard TJ, Shen L, Lehman LW, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035.
4. Huda A, Castano A, Niyogi A, Schumacher J, Stewart M, Bruno M, et al. A machine learning model for identifying patients at risk for wild-type transthyretin amyloid cardiomyopathy. Nat Commun. 2021;12(1):2725.
5. Vatansever S, Schlessinger A, Wacker D, Kaniskan HU, Jin J, Zhou MM, et al. Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: State-of-the-arts and future directions. Med Res Rev. 2021;41(3):1427–73.
6. Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, et al. A Deep Learning Approach to Antibiotic Discovery. Cell. 2020;180(4):688–702 e13.
7. Dong J, Cao DS, Miao HY, Liu S, Deng BC, Yun YH, et al. ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J Cheminform. 2015;7:60.
8. Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, et al. Analyzing Learned Molecular Representations for Property Prediction. J Chem Inf Model. 2019;59(8):3370–88.
9. Harris PA, Taylor R, Thielke R, Payne J, Gonzalez N, Conde JG. Research electronic data capture (REDCap) — a metadata-driven methodology and workflow process for providing translational research informatics support. J Biomed Inform. 2009;42(2):377–81.
10. Woo M. An AI boost for clinical trials. Nature. 2019;573.