Statistical text classification techniques use the frequency distribution of words to automatically classify a set of documents or text fragments into one of a discrete set of predefined categories (Mitchell, 1997). For example, a text classification application may classify MEDLINE abstracts into one of many possible MeSH categories or may classify websites by topic. Various statistical models have been applied to the problem of text classification, including regression models, Bayesian belief networks, nearest neighbor algorithms, neural networks, decision trees, and support vector machines. The basic element in all text classification algorithms is the frequency distribution of the words in the text. Applications of text classification of free-text patient medical records include retrieving records of interest to a specific research query (Aronis et al., 1999, Cooper et al., 1998), assigning ICD-9 admission diagnoses to chief complaints (Gundersen et al., 1996), and retrieving medical images with specific abnormalities (Hersh et al., 2001). In the domain of biosurveillance, text classification techniques have been applied to triage chief complaints and chest radiograph reports. CoCo (Olszewski, 2003) is a naive Bayesian text classification application that classifies free-text triage chief complaints into syndromic categories, such as respiratory, gastrointestinal, or neurological, based on the frequency distribution of the words in the chief complaints. For example, the chief complaint "cough'' would be assigned a higher probability of being respiratory than of being gastrointestinal or neurological, because chief complaints in the training corpus that contained the word "cough'' were classified most frequently as respiratory. The IPS system (Aronis et al., 1999, Cooper et al., 1998) was used to create a query for retrieving chest radiograph reports describing mediastinal findings consistent with inhalational anthrax (Chapman et al., 2003). The IPS system uses likelihood ratios to identify words that discriminate between relevant and not relevant documents.
Statistical NLP techniques have been applied to the problem of biomedical polysemy. Given a word or phrase with multiple meanings, the statistical distribution of the neighboring words in the document could be helpful in disambiguating the correct meaning or sense of the word. As an example, consider the word "discharge,'' which has two word senses: a procedure for being released from the hospital (Disch1) and a substance emitted from the body (Disch2). If we applied a statistical learning technique to text containing the word "discharge,'' we may learn that Disch1 occurs significantly more often with the neighboring words "prescription,'' "upon,'' "home,'' "today,'' and "instructions,'' and that Disch2 occurs more often with the words "purulent,'' "rashes,'' "swelling,'' and "wound.''
Beyond text classification, statistical techniques can be used for complex NLP tasks. For instance,Taira and Soderland (1999) have developed an NLP system for radiology reports that uses mainly statistical techniques to encode detailed information about radiology findings and diseases, including the finding, whether it was present or absent, and its anatomic location.
Because of the complexity of patient medical reports, purely statistical techniques that only rely on words and their frequencies are less common than hybrid or purely symbolic techniques that leverage knowledge about the structure or meaning of the words in the text in order to classify, extract, or encode information in clinical documents.
Was this article helpful?