Case Detection

The question being addressed when measuring the case detection ability of an NLP application for the domain of biosurveillance is: How well does the NLP application identify relevant patients from textual data? For our SARS detector, examples of case detection evaluations include how well the NLP application can determine whether a patient has a respiratory syndrome, whether a patient has a fever, whether a patient has radiological evidence of pneumonia, or whether a patient has SARS.

Figure 17.6 illustrates the evaluation process for a study on case detection. The reference standard is generated by expert diagnosis of the patients. The source of the expert diagnosis

figure 17.6 In an evaluation of case detection, the NLP application extracts relevant variable values from text, which may or may not be combined with variables from other sources (dashed box) to diagnose patients. The reference standard reviews test cases independently and generates a reference diagnosis. Performance metrics are calculated by comparing the diagnoses generated in part or in whole by the NLP application against that of the reference standard.

figure 17.6 In an evaluation of case detection, the NLP application extracts relevant variable values from text, which may or may not be combined with variables from other sources (dashed box) to diagnose patients. The reference standard reviews test cases independently and generates a reference diagnosis. Performance metrics are calculated by comparing the diagnoses generated in part or in whole by the NLP application against that of the reference standard.

depends on the finding, syndrome, or disease being diagnosed, and may comprise review of textual patient reports or complete medical records, results of laboratory tests, autopsy results, and so on.

One of the first case detection studies involving an NLP-based system evaluated the ability of a computerized protocol to detect patients suspicious for Tuberculosis with data stored in electronic medical records (Hripcsak et al., 1997, Knirsch et al., 1998). In a prospective study, the system correctly identified 30 of 43 patients with TB.The computerized system also identified four positive patients not identified by clinicians. Aronsky et al. (2001) showed that a Bayesian network for diagnosing patients with pneumonia performed significantly better with information from the chest radiograph encoded with an NLP system than it did without that information (AUC 88% without NLP vs. 92% with NLP).

Several studies have evaluated how well automatically classified chief complaints can classify patients into syndromic categories (Espino and Wagner, 2001, Ivanov et al., 2002, Beitel et al., 2004, Chapman et al., 2005b, Gesteland et al., 2004). The studies used either ICD-9 discharge diagnoses or physician judgment from medical record review as the reference standard for the syndromic categories. The majority of the studies have focused on more prevalent syndromes—respiratory and gastrointestinal—but a few studies have evaluated classification into more rarely occurring syndromes, such as hemorrhagic and botulinic. Results suggest that syndromic surveillance from free-text chief complaints can diagnose patients into most syndromic categories with sensitivities between 40% and 77%, in spite of the limited nature of chief complaints.

In the section on feature detection, we described a study that evaluated the ability of an NLP application to determine whether chief complaints and ED reports described fever (Chapman et al., 2004). The fever study also measured the case detection accuracy of fever diagnosis from chief complaints and ED reports. The NLP application for identifying fever in chief complaints performed with perfect sensitivity and specificity in the feature detection evaluation. However, when quantifying how well the automatically extracted variable of fever from chief complaints identified patients who had a true fever based on reference standard judgment from the ED report, the chief complaint fever detector only performed with a sensitivity of 61%. The specificity remained at 100%. On the one hand, whenever a chief complaint mentioned fever, the patient actually had a fever, so there were no false-positive diagnoses from chief complaints. On the other hand, despite the fact that the NLP technology did not make any mistakes in determining if fever was described in a chief complaint, the chief complaints themselves did not always mention fever when the patient was febrile in the ED. As demonstrated by this study, coupling evaluations on feature detection with evaluations on case detection can inform us about the source of diagnostic errors, which could be the NLP technology, the input data itself, or a combination of the two.

0 0

Post a comment