Natural language processing techniques are far from perfect. However, the question is not whether the techniques perform perfectly but whether the performance is good enough to contribute to disease and outbreak detection. For instance, a few errors in part-of-speech tagging or negation identification may not substantially decrease the ability of an NLP application to determine whether a patient has a fever. Evaluation studies of NLP in biosurveillance are still young, but we have learned a few things about how variables extracted from free-text medical records with NLP can contribute to outbreak detection.

First, we have learned that automated classification of freetext chief complaints, while not perfect, is sufficient to detect one-third to two-thirds of positive syndromic cases. Moreover, chief complaints for pediatric patients are accurate and timely at detecting respiratory and gastrointestinal outbreaks. Second, we have learned that ED reports can provide more detailed information about the state of a patient than chief complaints. For example, we can detect 40% more patients with fever from ED reports than from chief complaints (Chapman et al., 2004). Third, several researchers have shown that identification of radiological variables required for detection of many public health threats (including SARS and inhalational anthrax) from chest radiograph reports is feasible with NLP techniques.

This chapter has focused on applying NLP techniques to variable extraction from patient medical records, but other types of free-text documents contain information that may be useful for biosurveillance, including web queries, transcripts from call centers, and autopsy reports. Regardless of the type of free-text data, we suggest three questions to consider when deciding whether application of NLP techniques to textual data is feasible for disease and outbreak detection: (1) How complex is the text? The simple phrases in chief complaints are much easier to understand than complex discourses contained in ED reports. Textual data that require corefer-ence resolution, domain modeling for inference, and other more difficult techniques required to identify values for the variables of interest will be more challenging to process and will be more prone to error. (2) What is the goal of the NLP technique? If the goal is to understand all temporal, anatomic, and diagnostic relations described in the text as well as a physician could, you may be in for a lifetime of hard but interesting work. Extraction of a single variable, such as fever, or encoding temporal, anatomic, and diagnostic relations for a finite set of findings, such as all respiratory findings, is more feasible. (3) Can the detection algorithms that will use the variables extracted with NLP handle noise? Detecting small outbreaks requires more accuracy in the input variables.

figure 17.7 Time series plot of chief complaint syndromic classifications against ICD-9 discharge diagnoses for admissions of patients with respiratory illnesses including pneumonia, influenza, and bronchiolitis.

As an extreme example, some diseases such as inhalational anthrax require only a single case to be considered a threatening outbreak. If the NLP-based expert system did not correctly detect that case, then the detection system would have failed. However, in detecting an outbreak of a gastrointestinal illness, for example, if the NLP-based expert system only detected two-thirds of the true cases, there may still be enough positive patients to detect a moderate to large-sized outbreak. In addition, the consistent stream of false positive cases identified by the NLP-based expert system would comprise a noisy baseline that may not prevent the algorithm from detecting a significant increase in gastrointestinal cases but would require a larger increase to detect the outbreak. Consideration of these three questions can help determine the feasibility of using NLP for outbreak and disease surveillance.

NLP techniques can be applied to determine the values of predefined variables that may be useful in detecting outbreaks. The linguistic structure of the textual data being processed and the nature of the variables being used for surveillance determine the feasibility of applying NLP techniques to the problem. Characteristics such as linguistic variation, polysemy, negation, contextual information, finding validation, implication, and coreference must be accounted for to understand the information within patient medical reports as well as a physician does. However, because many of the variables helpful in biosurveillance do not require complete understanding of the text, NLP techniques may successfully extract variables useful for outbreak detection. In fact, evaluations of feature detection, case detection, and epidemic detection of NLP techniques have begun to demonstrate the utility of NLP techniques in this new field. More research in NLP techniques and more evaluation studies of the effectiveness of NLP will not only increase our understanding of how to extract information from text but will also help us continue to learn what types of data provide the most timely and accurate information for detecting outbreaks.

0 0

Post a comment