F valuation of Feature Detection


CNLP Application

figure 17.5 In an evaluation of feature detection,the NLP application and the reference standard independently extract the relevant variable values from the same text. Performance metrics are calculated by comparing the NLP output against that of the reference standard.

Several studies have evaluated how well NLP applications can encode findings and diseases, such as atelectasis, pleural effusions, CHF, stroke, and pneumonia from radiograph reports (Hripcsak et al., 1995, Friedman et al., 2004, Fiszman et al., 2000, Elkins et al., 2000). The reference standard for these studies was physician encodings of the variables, and the studies showed that the NLP applications performed similarly to physicians. One study (Chapman et al., 2004) evaluated how well the variable fever could be automatically identified in chief complaints and ED reports compared to a reference standard of physician judgment from the ED report. The application identified fever from chief complaints with 100% sensitivity and 100% specificity, and from ED reports with 98% sensitivity and 89% specificity.

Other studies have evaluated how well NLP technology can classify chief complaints into syndromic categories (e.g., respiratory, gastrointestinal, neurological, rash, etc.). Olszewski (2003) evaluated CoCo, a naive Bayesian classifier (Mitchell, 1997) that classifies chief complaints into one of eight syn-dromic categories. Chapman et al. (2005a) evaluated a chief complaint classifier (MPLUS [Christensen et al., 2002]) that used syntactic and semantic information to classify the chief complaints into syndromic categories. The reference standard for both studies was a physician reading the chief complaints and classifying them into the same syndromic categories. Performance of the NLP applications was measured with the area under the receiver operating characteristic (ROC) curve (areas under curve [AUC]), with AUCs ranging from 0.80 to 0.97 for CoCo and 0.95 to 1.0 for MPLUS, suggesting that NLP technology is quite good at classifying chief complaints into syndromes.

Studies of feature detection do not make claims about whether the NLP technology can accurately diagnose patients with the target findings, syndromes, or diseases. The conclusions only relate to the application's ability to determine the correct values for the variables given the relevant input text. Once feature detection has been validated, the next step is to apply the technology to the problem of diagnosing the patients and evaluate the technology's accuracy at case detection.

0 0

Post a comment