The first type of NLP evaluation should measure the application's ability to detect features from text. The question being addressed when quantifying the performance of feature detection for the domain of biosurveillance is: How well does the NLP application determine the values to the variables of interest from text? For our SARS detector, examples of feature detection evaluations include how well the NLP application can determine whether a patient has a respiratory-related chief complaint, whether an ED report describes fever in a patient, or whether a patient has radiological evidence of pneumonia in a radiograph report.
Figure 17.5 illustrates the evaluation process for feature detection. Studies of feature detection do not evaluate the truth of the feature in relation to the patient-that is, whether the patient actually had the finding of interest-but only evaluate how well the technique interpreted the text in relation to the feature. Therefore the reference standard for an evaluation of feature detection is generated by experts who read the same text processed by the NLP application and assign values to the same variables. If the reference standard and the NLP application both believe the chest radiograph report describes the possibility of pneumonia, the NLP system is considered correct-even if the patient turned out to not have pneumonia.
Was this article helpful?