Natural Language Processing
A significant amount of information is embedded within clinical notes and other text-based documents. The large volume of text-based data generated and stored every day makes manual analysis intractable. To access this information, the Wright Center provides data extraction via text analyses and natural language processing, including clinical data extraction from clinical notes and reports.
Natural Language Processing (NLP) is a powerful tool that unleashes the power of machines onto both written and spoken language. NLP methods can extract pertinent information from written text automatically to aid in the effective use of this data – and to identify patterns and trends that would otherwise be buried in unstructured texts.
The informatics team at the Wright Center has experience in implementing NLP workflows and pipelines, including cohort identification, named entity recognition, topic analysis, and publication reporting statistics. Using NLP in research is project specific. Thus, we work closely with principal investigators and subject matter experts to develop NLP pipelines and applications to advance research at VCU. If you have a project that may benefit from NLP, please contact us for a consultation.
NLP tools developed by the Wright Center informatics team include:
TopEx is an NLP application developed to facilitate the exploration of topics and keywords in a set of texts through a user interface that requires no programming or NLP knowledge. Through use by VCU collaborators, and as a participating system for the 2021 BioCreative VII Challenge Track 4: COVID-19 text mining tool interactive demo, TopEx has been shown to be useful in analyzing reflective medical writing responses from students, tweets, abstracts, discharge summaries, official government communications, grant abstracts and more. User feedback indicates that TopEx is “intuitive and relatively easy to follow” and is useful for “identifying trends in the data” and for obtaining a “quick grasp to understand topics.” TopEx can be accessed online, or installed locally by following the instructions on its GitHub page. For advanced users, TopEx is also available as a Python library.
Chrono is a hybrid rule-based and machine learning application that identifies and normalizes temporal expressions in text. Chrono has been trained on both general domain and clinical texts and ranked first in the 2018 SemEval Task 6 temporal challenge.
PubReporter (under development) is an application designed to summarize MeSH terms associated with a set of publications for reporting purposes.
- NLP@VCU – The NLP lab, led by Bridget McInnes, Ph.D., in VCU’s Computer Science Department, is actively developing NLP tools and is part of cross-campus collaborations with the Wright Center and VCU School of Medicine.
- CLAMP – An application developed out of the University of Texas Health with a user interface for drag-and-drop NLP pipeline development. Wright Center team members have experience building pipelines with CLAMP. Academic use/research licenses are free upon request.
- Olex, A.L., French, E., Burdette, P., Sagiraju, S., Neumann, T., Gal, T.S., McInnes, B.T., Kenneth, C., 2021. TopEx: Topic Exploration of COVID-19 Corpora - Results from the BioCreative VII Challenge Track 4, in: Proceedings of the BioCreative VII Challenge Evaluation Workshop. Presented at the BioCreative VII Workshop, Virtual, pp. 238–242. PDF
- Olex A, DiazGranados D, McInnes BT, and Goldberg S. Local Topic Mining for Reflective Medical Writing. Full Length Paper. AMIA Jt Summits Transl Sci Proc 2020, 459–468. PMCID: PMC7233034.
- Olex A, Maffey L, Morgan N et al. Chrono at SemEval-2018 Task 6: A System for Normalizing Temporal Expressions. Full Length Paper. Proceedings of the 12th International Workshop on Semantic Evaluation. New Orleans, Louisiana: Association for Computational Linguistics, 2018, 97–101. DOI: 10.18653/v1/S18-1012
- Olex A, Maffey L, McInnes B. NLP Whack-A-Mole: Challenges in Cross-Domain Temporal Expression Extraction. Long Paper. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Minneapolis, Minnesota: Association for Computational Linguistics, 2019, 3682–3692. DOI: 10.18653/v1/N19-1369
Posters and Oral Presentations
- Kim SJ, Olex AL, Ming HM, French ET, Gal TS. Linguistic Characteristics of COVID-19 Pandemic Control and Mitigation Communications in South Korea. Submitted Sept. 2021 to Scientific Sessions: 43rd Society of Behavioral Medicine Annual Meeting & Scientific Sessions.
- DiazGranados D, Olex AL, Garber A, Santen SA, McInnes BT, and Goldberg S. Utilizing Natural Language Processing to Automate the Identification of Acting Intern Challenges. Oral presentation by Olex, DiazGranados, and Goldberg as a team at the ChangeMedEd 2019 conference in Chicago, IL, Sept 18-21, 2019.
- Olex AL, Gal T, Afshar M, Dligach D, Karnik N, Oakes T, Sharma B, Xie M, McInnes BT, Solway J, Kho A, Cramer WC, and Moeller FG. Untapped Potential of Clinical Text for Opioid Surveillance. Poster presented by Amy Olex at the AMIA 2019 Annual Symposium, Washington DC.