Interpretive reports generated by computer-based psychological assessment systems need to have demonstrated validity even if the instruments on which the interpretations are based are supported by research literature. Computerized outputs are typically one step removed from the test index-validity data relationships from the original test; therefore, it is important to demonstrate that the inferences included in the computerized report are reliable and valid in the settings where they are used. Some computer interpretation programs now in use also provide comprehensive personality assessment by combining test findings into narrative descriptions and conclusions. Butcher, Perry, and Atlis (2000) recently reviewed the extensive validity research for computer-based interpretation systems. Highlights from their evaluation are summarized in the following sections.
In discussing computer-based assessment, it is useful to subdivide computerized reports into two broad categories: descriptive summaries and consultative reports. Descriptive summaries (e.g., for the 16 Personality Factor Test or 16PF) are usually on a scale-by-scale basis without integration of the results into a narrative. Consultative reports (e.g., those for the MMPI-2 and DTREE, a computer-based DSM-IV diagnostic program) provide detailed analysis of the test data and emulate as closely as possible the interpretive strategies of a trained human consultant.
The validity of computerized reports has been extensively studied in both personality testing and psychiatric screening (computer-based diagnostic interviewing). Research aimed at exploring the accuracy of narrative reports has been conducted for several computerized personality tests, such as the Rorschach Inkblot Test (e.g., Harris, Niedner, Feldman, Fink, & Johnson, 1981; Prince & Guastello, 1990), the 16PF (e.g., Guastello & Rieke, 1990; O'Dell, 1972), the Marital Satisfaction Questionnaire (Hoover & Snyder, 1991) and the Millon Clinical Multiaxial Inventory (MCMI; Moreland & Onstad, 1987; Rogers, Salekin, & Sewell, 1999). Moreland (1987) surveyed results from the most widely studied computer-based personality assessment instrument, the MMPI. Evaluation of diagnostic interview screening by computer (e.g., the DIS) has also been reported (First, 1994).
Moreland (1987) provided an overview of studies that investigated the accuracy of computer-generated MMPI narrative reports. Some studies compared computer-generated narrative interpretations with evaluations provided by human interpreters. One methodological limitation of this type of study is that the clinician's interpretation might not be valid and accurate (Moreland, 1987). For example, Labeck, Johnson, and Harris (1983) asked three clinicians (each with at least 12 years of clinical experience) to rate the quality and the accuracy of code-type interpretations generated by an automated MMPI program (the clinicians did not rate the fit of a narrative to a particular patient, however). Results indicated that the MMPI code-type, diagnostic, and overall profile interpretive statements were consistently rated by the expert judges as strong interpretations. The narratives provided by automated
MMPI programs were judged to be substantially better than average when compared to the blind interpretations of similar profiles that were produced by the expert clinicians. The researchers, however, did not specify how they judged the quality of the blind interpretation and did not investigate the possibility that statements in the blind interpretation could have been so brief and general (especially when compared to a two-page narrative CBTI) that they could have artificially inflated the ratings of the CBTI reports. In spite of these limitations, this research design was considered useful in evaluating the overall congruence of computer-generated decision and interpretation rules.
Shores and Carstairs (1998) evaluated the effectiveness of the Minnesota Report in detecting faking. They found that the computer-based reports detected fake-bad profiles in 100% of the cases and detected fake-good profiles in 94% of the cases.
The primary way researchers have attempted to determine the accuracy of computer-based tests is through the use of raters (usually clinicians) who judge the accuracy of computer interpretations based on their knowledge of the client (Moreland, 1987). For example, a study by Butcher and colleagues (1998) explored the utility of computer-based MMPI-2 reports in Australia, France, Norway, and the United States. In all four countries, clinicians administered the MMPI-2 to their patients being seen for psychological evaluation or therapy; they a booklet format in the language of each country. The tests were scored and interpreted by the Minnesota Report using the American norms for MMPI-2. The practitioner, familiar with the client, rated the information available in each narrative section as insufficient, some, adequate, more than adequate, or extensive. In each case, the clinicians also indicated the percentage of accurate descriptions of the patient and were asked to respond to open-ended questions regarding ways to improve the report. Relatively few raters found the reports inappropriate or inaccurate. In all four countries, the Validity Considerations, Symptomatic Patterns, and Interpersonal Relations sections of the Minnesota Report were found to be the most useful sections in providing detailed information about the patients, compared with the Diagnostic Considerations section. Over two thirds of the records were considered to be highly accurate, which indicated that clinicians judged 80-100% of the computer-generated narrative statements in them to be appropriate and relevant. Overall, in 87% of the reports, at least 60% of the computer-generated narrative statements were believed to be appropriate and relevant to understanding the client's clinical picture.
Although such field studies are valuable in examining the potential usefulness of computer-based reports for various applications, there are limitations to their generalizability. Moreland concluded that this type of study has limitations, in part because estimates of interrater reliability are usually not practical. Raters usually are not asked to provide descriptions of how their judgments were made, and the appropriateness of their judgments was not verified with information from the patients themselves and from other sources (e.g., physicians or family members). Moreland (1985) suggested that in assessing the validity of computer-generated narrative reports, raters should evaluate individual interpretive statements because global accuracy ratings may limit the usefulness of ratings in developing the CBTI system.
Eyde, Kowal, and Fishburne (1991) followed Moreland's recommendations in a study that investigated the comparative validity of the narrative outputs for several CBTI systems. They used case histories and self-report questionnaires as criteria against which narrative reports obtained from seven MMPI computer interpretation systems could be evaluated. Each of the clinicians rated six protocols. Some of the cases were assigned to all raters; they consisted of an African American patient and a Caucasian patient who were matched for a 7-2 (Psychasthenia-Depression) code-type and an African American soldier and a Caucasian soldier who had all clinical scales in the subclinical range (T < 70). The clinicians rated the relevance of each sentence presented in the narrative CBTI as well as the global accuracy of each report. Some CBTI systems studied showed a high degree of accuracy (The Minnesota Report was found to be most accurate of the seven). However, the overall results indicated that the validity of the narrative outputs varied, with the highest accuracy ratings being associated with narrative lengths in the short-to-medium range. The longer reports tended to include less accurate statements. For different CBTI systems, results for both sentence-by-sentence and global ratings were consistent, but they differed for the clinical and subclinical normal profiles. The subclinical normal cases had a high percentage (Mdn = 50%) of unratable sentences, and the 7-2 profiles had a low percentage (Mdn = 14%) of sentences that could not be rated. One explanation for such differences may come from the fact that the clinical cases were inpatients for whom more detailed case histories were available. Because the length of time between the preparation of the case histories and the administrations of the MMPI varied from case to case, it was not possible to control for changes that a patient might have experienced over time or as a result of treatment.
One possible limitation of the published accuracy-rating studies is that it is usually not possible to control for a phenomenon referred to as the P. T. Barnum effect (e.g., Meehl, 1956) or Aunt Fanny effect (e.g., Tallent, 1958), which suggests that a narrative report may contain high base-rate descriptions that apply to virtually anybody. One factor to consider is that personality variables, such as extraversion, introversion, and neuroticism (Furnham, 1989), as well as the extent of private self-consciousness (Davies, 1997), also have been found to be connected to individuals' acceptance of Barnum feedback.
Research on the Barnum rating effect has shown that participants can usually detect the nature of the overly general feedback if asked the appropriate questions about it (Furnham & Schofield, 1987; Layne, 1979). However, this criticism might not be appropriate for clinical studies because this research has most often been demonstrated for situations involving acceptance of positive statements in self-ratings in normally functioning individuals. For example, research also has demonstrated that people typically are more accepting of favorable Barnum feedback than they are of unfavorable feedback (Dickson & Kelly, 1985; Furnham & Schofield, 1987; C. R. Snyder & Newburg, 1981), and people have been found to perceive favorable descriptions as more appropriate for themselves than for people in general (Baillargeon & Danis, 1984).
Dickson and Kelly (1985) suggested that test situations, such as the type of assessment instruments used, can be significant in eliciting acceptance of Barnum statements. However, Baillargeon and Danis (1984) found no interaction between the type of assessment device and the favorability of statements. Research has suggested that people are more likely to accept Barnum descriptions that are presented by persons of authority or expertise (Lees-Haley, Williams, & Brown, 1993). However, the relevance of this interpretation to studies of testing results has been debated.
Some researchers have made efforts to control for Barnum-type effects on narrative CBTIs by comparing the accuracy of ratings to a stereotypical client or an average subject and by using multireport-multirating intercorrela-tion matrices (Moreland, 1987) or by examining differences in perceived accuracy between bogus and real reports (Moreland & Onstad, 1987; O'Dell, 1972). Several studies have compared bogus with genuine reports and found them to be statistically different in judged accuracy. In one study, for example, Guastello, Guastello, and Craft (1989) asked college students to complete the Comprehensive Personality Profile Compatibility Questionnaire (CPPCQ). One group of students rated the real computerized test interpretation of the CPPCQ, and another group rated a bogus report. The difference between the accuracy ratings for the bogus and real profiles (57.9% and 74.5%, respectively) was statistically significant. In another study (Guastello & Rieke, 1990), undergraduate students enrolled in an industrial psychology class evaluated a real computer-generated Human Resources Development Report (HRDR) of the 16PF and a bogus report generated from the average 16PF profile of the entire class. Results indicated no statistically significant difference between the ratings for the real reports and the bogus reports
(which had mean accuracy ratings of 71.3% and 71.1%, respectively). However, when the results were analyzed separately, four out of five sections of the real 16PF output had significantly higher accuracy ratings than did the bogus report. Contrary to these findings, Prince and Guastello (1990) found no statistically significant differences between descriptions of a bogus and real CBTI interpretations when they investigated a computerized version of the Exner Rorschach interpretation system.
Moreland and Onstad (1987) asked clinical psychologists to rate genuine MCMI computer-generated reports and randomly generated reports. The judges rated the accuracy of the reports based on their knowledge of the client as a whole as well as the global accuracy of each section of the report. Five out of seven sections of the report exceeded chance accuracy when considered one at a time. Axis I and Axis II sections demonstrated the highest incremental validity. There was no difference in accuracy between the real reports and the randomly selected reports for the Axis IV psychosocial stressors section. The overall pattern of findings indicated that computer reports based on the MCMI can exceed chance accuracy in diagnosing patients (Moreland & Onstad, 1987, 1989).
Overall, research concerning computer-generated narrative reports for personality assessment has typically found that the interpretive statements contained in them are comparable to clinician-generated statements. Research also points to the importance of controlling for the degree of generality of the reports' descriptions in order to reduce the confounding influence of the Barnum effect (Butcher et al., 2000).
Computer-based test batteries have also been used in making assessment decisions for cognitive evaluation and in neu-ropsychological evaluations. The 1960s marked the beginning of investigations into the applicability of computerized testing to this field (e.g., Knights & Watson, 1968). Because of the inclusion of complex visual stimuli and the requirement that participants perform motor response tasks, the computer development of computerized assessment of cognitive tasks has not proceeded as rapidly as that of paper-and-pencil personality measures. Therefore, neuropsychology computerized test interpretation was slower to develop procedures that are equal in accuracy to those achieved by human clinicians (Adams & Heaton, 1985, p. 790; see also Golden, 1987). Garb and Schramke (1996) reviewed and performed a meta-analysis of studies involving computer analyses for neuropsychologi-cal assessment, concluding that they were promising but that they needed improvement. Specifically, they pointed out that programs needed to be created that included such information as patient history and clinician observation in addition to the psychometric and demographic data that are more typically used in the prediction process for cognitive measures.
Russell (1995) concluded that computerized testing procedures were capable of aiding in the detection and location of brain damage accurately but not as precisely as clinical judgment. For example, the Right Hemisphere Dysfunction Test (RHDT) and Visual Perception Test (VPT) were used in one study (Sips, Catsman-Berrevoets, van Dongen, van der Werff, & Brook, 1994) in which these computerized measures were created for the purpose of assessing right-hemisphere dysfunction in children and were intended to have the same validity as the Line Orientation Test (LOT) and Facial Recognition Test (FRT) had for adults. Fourteen children with acquired cerebral lesions were administered all four tests. Findings indicated that the computerized RHDT and VPT together were sensitive (at a level of 89%) to right-hemisphere lesions, had relatively low specificity (40%), had high predictive value (72%), and accurately located the lesion in 71% of cases. Fray, Robbins, and Sahakian (1996) reviewed findings regarding a computerized assessment program, the Cambridge Neuropsy-chological Test Automated Batteries (CANTAB). Although specificity and sensitivity were not reported, the reviewers concluded that CANTAB could detect the effects of progressive, neurogenerative disorders sometimes before other signs manifested themselves. They concluded that the CANTAB has been found successful in detecting early signs of Alzheimer's, Parkinson's, and Huntington's diseases.
Research on computer-assisted psychiatric screening has largely involved the development of logic-tree decision models to assist the clinician in arriving at clinical diagnoses (Erdman, Klein, & Greist, 1985; see also the chapter by Craig in this volume). Logic-tree systems are designed to establish the presence of symptoms specified in diagnostic criteria and to arrive at a particular diagnosis (First, 1994). For example, the DTREE is a recent program designed to guide the clinician through the diagnostic process (First, 1994) and provide the clinician with diagnostic consultation both during and after the assessment process. A narrative report is provided that includes likely diagnoses as well as an extensive narrative explaining the reasoning behind diagnostic decisions included. Research on the validity of logic-tree programs typically compares diagnostic decisions made by a computer and diagnostic decisions made by clinicians. In an initial evaluation, First et al. (1993) evaluated the use of DTREE in an inpatient setting by comparing case conclusions by expert clinicians with the results of DTREE output.
Psychiatric inpatients (N = 20) were evaluated by a consensus case conference and by their treating psychiatrist (five psychiatrists participated in the rating) who used DTREE software. Although the number of cases within each of the diagnostic categories was small, the results are informative. On the primary diagnosis, perfect agreement was reached between the DTREE and the consensus case conference in 75% of cases (N = 15). The agreement was likely to be inflated because some of the treating psychiatrists participated in both the DTREE evaluation and the consensus case conference. This preliminary analysis, however, suggested that DTREE might be useful in education and in evaluation of diagnosti-cally challenging clients (First et al., 1993), although the amount of rigorous research on the system is limited.
A second logic-tree program in use is a computerized version of the World Health Organization (WHO) Composite International Diagnostic Interview (CIDI-Auto). Peters and Andrews (1995) conducted an investigation of the validity of the CIDI-Auto in the DSM-III-R diagnoses of anxiety disorders, finding generally variable results ranging from low to high accuracy for the CIDI-auto administered by computer. However, there was only modest overall agreement for the procedure. Ninety-eight patients were interviewed by the first clinician in a brief clinical intake interview prior to entering an anxiety disorders clinic. When the patients returned for a second session, a CIDI-Auto was administered and the client was interviewed by another clinician. The order in which CIDI-Auto was completed varied depending upon the availability of the computer and the second clinician. At the end of treatment, clinicians reached consensus about the diagnosis in each individual case (k = .93). When such agreement could not be reached, diagnoses were not recorded as the LEAD standard against which CIDI-Auto results were evaluated. Peters and Andrews (1995) concluded that the over-diagnosis provided by the CIDI might have been caused by clinicians' using stricter diagnostic rules in the application of duration criteria for symptoms.
In another study, 37 psychiatric inpatients completed a structured computerized interview assessing their psychiatric history (Carr, Ghosh, & Ancill, 1983). The computerized interview agreed with the case records and clinician interview on 90% of the information. Most patients (88%) considered computer interview to be no more demanding than a traditional interview, and about one third of them reported that the computer interview was easier. Some patients felt that their responses to the computer were more accurate than those provided to interviewers. The computer program in this study elicited about 9.5% more information than did traditional interviews.
Psychiatric screening research has more frequently involved evaluating computer-administered versions of the DIS
(Blouin, Perez, & Blouin, 1988; Erdman et al., 1992; Greist et al., 1987; Mathisen, Evans, & Meyers, 1987; Wyndowe, 1987). Research has shown that in general, patients tend to hold favorable attitudes toward computerized DIS systems, although diagnostic validity and reliability are questioned when such programs are used alone (First, 1994).
Was this article helpful?