Russ Altman
Profile Url: russ-altman
Researcher at Stanford University
Journal of Biomedical Informatics, 2019-10-16
Social media has been identified as a promising potential source of information for pharmacovigilance. The adoption of social media data has been hindered by the massive and noisy nature of the data. Initial attempts to use social media data have relied on exact text matches to drugs of interest, and therefore suffer from the gap between formal drug lexicons and the informal nature of social media. The Reddit comment archive represents an ideal corpus for bridging this gap. We trained a word embedding model, RedMed, to facilitate the identification and retrieval of health entities from Reddit data. We compare the performance of our model trained on a consumer-generated corpus against publicly available models trained on expert-generated corpora. Our automated classification pipeline achieves an accuracy of 0.88 and a specificity of >0.9 across four different term classes. Of all drug mentions, an average of 79% (±0.5%) were exact matches to a generic or trademark drug name, 14% (±0.5%) were misspellings, 6.4% (±0.3%) were synonyms, and 0.13% (±0.05%) were pill marks. We find that our system captures an additional 20% of mentions; these would have been missed by approaches that rely solely on exact string matches. We provide a lexicon of misspellings and synonyms for 2,978 drugs and a word embedding model trained on a health-oriented subset of Reddit.
Recent molecular dynamics (MD) simulations of the catalytic domain of the c-Src kinase revealed intermediate conformations with a potentially druggable allosteric pocket adjacent to the C-helix, bound by 8-anilino-1-naphthalene sulfonate. Towards confirming the existence of this pocket, we have developed a novel lead enrichment protocol using new target and lead enrichment software to identify sixteen allosteric lead ligands of the c-Src kinase. First, Markov State Models analysis was used to identify the most statistically significant c-Src target conformations from all MD-simulated conformations. The most statistically relevant candidate MSM targets were then prioritized by assessing how well each reproduced binding poses of ligands specific to the ATP-competitive and allosteric pockets. The top-performing MSM targets, identified by receiver-operating curve analysis, were then used to screen the ZINC library of 13 million ″clean, drug-like ligands″, all of which prioritized based on their empirical scoring function, binding pose consistency across MSM targets, and strong hydrogen bonding and hydrophobic interactions with Src residues. The FragFEATURE knowledgebase of fragment-protein pocket interactions was then used to identify fragments specific to the ATP-competitive and allosteric pockets. This information was used to identify seven Type II and nine Type III lead ligands with binding poses supported by fragment predictions. Of these, Type II lead ligands, ZINC13037947 and ZINC09672647, and Type III lead ligands, ZINC12530852 and ZINC30012975, exhibited the most favorable fragment profiles and are recommended for further experimental testing for the existence of the allosteric pocket in Src.
Massively accumulated pharmacogenomics, chemogenomics, and side effect datasets offer an unprecedented opportunity for drug response prediction, drug target identification and drug side effect prediction. Existing computational approaches limit their scope to only one of these three tasks, inevitably overlooking the rich connection among them. Here, we propose DrugOrchestra, a deep multi-task learning framework that jointly predicts drug response, targets and side effects. DrugOrchestra leverages pre-trained molecular structure-based drug representation to bridge these three tasks. Instead of directly fine-tuning on an individual task, DrugOrchestra uses deep multi-task learning to obtain a phenotype-based drug representation by simultaneously fine-tuning on drug response, target and side effect prediction. By coupling these three tasks together, DrugOrchestra is able to make predictions for unseen drugs by only knowing their molecular structures. We constructed a heterogeneous drug discovery dataset of over 21k drugs by integrating 8 datasets across three tasks. Our method obtained significant improvements in comparison to methods that were trained on a single task or a single dataset. We further revealed the transferability across 8 datasets and 3 tasks, providing novel insights for understanding drug mechanisms.
Journal of the American Medical Informatics Association, 2020-05-02
Non-small cell lung cancer is a leading cause of cancer death worldwide, and histopathological evaluation plays the primary role in its diagnosis. However, the morphological patterns associated with the molecular subtypes have not been systematically studied. To bridge this gap, we developed a quantitative histopathology analytic framework to identify the gene expression subtypes of non-small cell lung cancer objectively. We processed whole-slide histopathology images of lung adenocarcinoma (n=427) and lung squamous cell carcinoma patients (n=457) in The Cancer Genome Atlas. To establish neural networks for quantitative image analyses, we first build convolutional neural network models to identify tumor regions from adjacent dense benign tissues (areas under the receiver operating characteristic curves (AUC) > 0.935) and recapitulated expert pathologists' diagnosis (AUC > 0.88), with the results validated in an independent cohort (n=125; AUC > 0.85). We further demonstrated that quantitative histopathology morphology features identified the major transcriptomic subtypes of both adenocarcinoma and squamous cell carcinoma (P < 0.01). Our study is the first to classify the transcriptomic subtypes of non-small cell lung cancer using fully-automated machine learning methods. Our approach does not rely on prior pathology knowledge and can discover novel clinically-relevant histopathology patterns objectively. The developed procedure is generalizable to other tumor types or diseases.
One in ten people are affected by rare diseases, and three out of ten children with rare diseases will not live past age five. However, the small market size of individual rare diseases, combined with the time and capital requirements of pharmaceutical R&D, have hindered the development of new drugs for these cases. A promising alternative is drug repurposing, whereby existing FDA-approved drugs might be used to treat diseases different from their original indications. In order to generate drug repurposing hypotheses in a systematic and comprehensive fashion, it is essential to integrate information from across the literature of pharmacology, genetics, and pathology. To this end, we leverage a newly developed knowledge graph, the Global Network of Biomedical Relationships (GNBR). GNBR is a large, heterogeneous knowledge graph comprising drug, disease, and gene (or protein) entities linked by a small set of semantic themes derived from the abstracts of biomedical literature. We apply a knowledge graph embedding method that explicitly models the uncertainty associated with literature-derived relationships and uses link prediction to generate drug repurposing hypotheses. This approach achieves high performance on a gold-standard test set of known drug indications (AUROC = 0.89) and is capable of generating novel repurposing hypotheses, which we independently validate using external literature sources and protein interaction networks. Finally, we demonstrate the ability of our model to produce explanations of its predictions.
ObjectiveTo determine whether clinicians will use machine learned clinical order recommender systems for electronic order entry for simulated inpatient cases, and whether such recommendations impact the clinical appropriateness of the orders being placed. Materials and Methods43 physicians used a clinical order entry interface for five simulated medical cases, with each physician-case randomized whether to have access to a previously-developed clinical order recommendation system. A panel of clinicians determined whether orders placed were clinically appropriate. The primary outcome was the difference in clinical appropriateness scores of orders for cases randomized to the recommender system. Secondary outcomes included usage metrics and physician opinions. ResultsClinical appropriateness scores for orders were comparable for cases randomized to the recommender system (mean difference -0.1 order per score, 95% CI:[-0.4, 0.2]). Physicians using the recommender placed more orders (mean 17.3 vs. 15.7 orders; incidence ratio 1.09, 95% CI:[1.01-1.17]). Case times were comparable with the recommender system. Order suggestions generated from the recommender system were more likely to match physician needs than standard manual search options. Approximately 95% of participants agreed the system would be useful for their workflows. DiscussionMachine-learned clinical order options can meet physician needs better than standard manual search systems. This may increase the number of clinical orders placed per case, while still resulting in similar overall clinically appropriate choices. ConclusionsClinicians can use and accept machine learned clinical order recommendations integrated into an electronic order entry interface. The clinical appropriateness of orders entered was comparable even when supported by automated recommendations.