Several authorities have raised questions about the equivalence of computer-based assessment methods and traditional psychological testing procedures. Hofer and Green (1985), for example, pointed out that there are several conditions related to computerized test administration that could produce noncom-parable results. Some people might be uncomfortable with computers and feel awkward dealing with them; this would make the task of taking tests on a computer different from standard testing procedures. Moreover, factors such as the type of equipment used and the nature of the test material (i.e., when item content deals with sensitive and personal information) might make respondents less willing (or more willing) to reveal their true feelings to a computer than to a human being. These situations might lead to atypical results for computerized assessment compared to a traditional format. Another possible disadvantage of computer assessment is that computer-generated interpretations may be excessively general in scope and not specific enough for practical use. Finally, there is a potential for computer-based results to be misused because they might be viewed as more scientific than they actually are, simply because they came out of a computer (Butcher, 1987). It is therefore important that the issues of measurement comparability and, of course, validity of the interpretation be addressed. The next section addresses the comparability of computer-administered tests and paper-and-pencil measures or other traditional methods of data collection.
Comparability of Psychiatric Screening by Computer and Clinical Interview
Several studies have reported on adaptations of psychiatric interviews for computer-based screening, and these adaptations are discussed in the chapter by Craig in this volume. Research has shown that clients in mental health settings report feeling comfortable with providing personal information through computer assessment (e.g., Hile & Adkins, 1997). Moreover, research has shown that computerized assessment programs were generally accurate in being able to diagnose the presence of behavioral problems. Ross, Swinson, Larkin, and Doumani (1994) used the Computerized Diagnostic
Interview Schedule (C-DIS) and a clinician-administered Structural Clinical Interview for the Diagnostic and Statistical Manual of Mental Disorders-Third Edition-Revised (DSM-III-R; SCID) to evaluate 173 clients. They reported the congruence between the two instruments to be acceptable except for substance abuse disorders and antisocial personality disorder, in which the levels of agreement were poor. The C-DIS was able to rule out the possibility of comorbid disorders in the sample with approximately 90% accuracy.
Farrell, Camplair, and McCullough (1987) evaluated the capability of a computerized interview to identify the presence of target complaints in a clinical sample. Both a face-to-face, unstructured intake interview and the interview component of a computerized mental health information system, the Computerized Assessment System for Psychotherapy Evaluation and Research (CASPER), were administered to 103 adult clients seeking outpatient psychological treatment. Results showed relatively low agreement (mean r = .33) between target complaints as reported by clients on the computer and as identified by therapists in interviews. However, 9 of the 15 complaints identified in the computerized interview were found to be significantly associated with other self-report and therapist-generated measures of global functioning.
Comparability of Standard and Computer-Administered Questionnaires
The comparability of computer and standard administrations of questionnaires has been widely researched. Wilson, Genco, and Yager (1985) used a test-attitudes screening instrument as a representative of paper-and-pencil tests that are administered also by computer. Ninety-eight female college freshman were administered the Test Attitude Battery (TAB) in both paper-and-pencil and computer-administered formats (with order of administration counterbalanced). The means and variances were found to be comparable for paper-and-pencil and computerized versions.
Holden and Hickman (1987) investigated computerized and paper-and-pencil versions of the Jenkins Activity Scale, a measure that assesses behaviors related to the Type A personality. Sixty male undergraduate students were assigned to one of the two administration formats. The stability of scale scores was comparable for both formats, as were mean scores, variances, reliabilities, and construct validities. Merten and Ruch (1996) examined the comparability of the German versions of the Eysenck Personality Questionnaire (EPQ-R) and the Carroll Rating Scale for Depression (CRS) by having people complete half of each instrument with a paper-and-pencil administration and the other half with computer administration (with order counterbalanced). They compared the results from the two formats to one another as
Equivalence of Computer-Administered Tests and Traditional Methods 145
well as to data from another sample, consisting of individuals who were administered only the paper-and-pencil version of the EPQ-R. As in the initial study, means and standard deviations were comparable across computerized and more traditional formats.
In a somewhat more complex and comprehensive evaluation of computer-based testing, Jemelka, Wiegand, Walker, and Trupin (1992) administered several computer-based measures to 100 incarcerated felons. The measures included brief mental health and personal history interviews, the group form of the MMPI, the Revised Beta IQ Examination, the Suicide Probability Scale, the Buss-Durkee Hostility Inventory, the Monroe Dyscontrol Scale, and the Veteran's Alcohol Screening Test. From this initial sample, they developed algorithms from a CBTI system that were then used to assign to each participant rakings of potential for violence, substance abuse, suicide, and victimization. The algorithms were also used to describe and identify the presence of clinical diagnoses based on the DSM-III-R. Clinical interviewers then rated the felons on the same five dimensions. The researchers then tested a new sample of 109 participants with eight sections of the computer-based DIS and found the agreement between the CBTI ratings and the clinician ratings to be fair. In addition, there was also high agreement between CBTI- and clinician-diagnosed DSM-III-R disorders, with an overall concordance rate of 82%.
Most of the research concerning the comparability of computer-based and standard personality assessment measures has been with the MMPI or the MMPI-2. Several studies reported possible differences between paper-and-pencil and computerized testing formats (e.g., Lambert, Andrews, Rylee, & Skinner, 1987; Schuldberg, 1988; Watson, Juba, Anderson, & Manifold, 1990). Most of the studies suggest that the differences between administrative formats are few and generally of small magnitude, leading to between-forms correlations of .68-94 (Watson et al., 1990). Moreover, some researchers have reported very high (Sukigara, 1996) or near-perfect (i.e., 92% to 97%) agreement in scores between computer and booklet administrations (Pinsoneault, 1996). Honaker, Harrell, and Buffaloe (1988) investigated the equivalency of a computer-based MMPI administration with the booklet version among 80 community volunteers. They found no significant differences in means or standard deviations between various computer formats for validity, clinical, and 27 additional scales. However, like a number of studies investigating the equivalency of computer and booklet forms of the MMPI, the power of their statistical analyses did not provide conclusive evidence regarding the equivalency of the paper-and-pencil and computerized administration format (Honaker et al., 1988).
The question of whether computer-administered and paper-and-pencil forms are equivalent was pretty much laid to rest by a comprehensive meta-analysis (Finger & Ones, 1999). Their analysis included 14 studies, all of which included computerized and standard formats of the MMPI or MMPI-2, that had been conducted between 1974 and 1996. They reported that the differences in T score means and standard deviations between test formats across the studies were negligible. Correlations between forms were consistently near 1.00. Based on these findings, the authors concluded that computer-administered inventories are comparable to booklet-administered forms.
The equivalence of conventional computerized and computer-adapted test administrations was demonstrated in the study cited earlier by Roper et al. (1995). In this study, comparing conventional computerized to adaptive computerized administrations of the MMPI, there were no significant differences for either men or women. In terms of criterion-related validity, there were no significant differences between formats for the correlations between MMPI scores and criterion measures that included the Beck Depression Inventory, the Trait Anger and Trait Anxiety scales from the State-Trait Personality Inventory, and nine scales from the Symptoms Checklist—Revised.
Equivalence of Standard and Computer-Administered Neuropsychological Tests
Several investigators have studied computer-adapted versions of neuropsychological tests with somewhat mixed findings. Pellegrino, Hunt, Abate, and Farr (1987) compared a battery of 10 computerized tests of spatial abilities with these paper-and-pencil counterparts and found that computer-based measures of static spatial reasoning can supplement currently used paper-and-pencil procedures. Choca and Morris (1992) compared a computerized version of the Halstead Category Test to the standard version with a group of neurologically impaired persons and reported that the computer version was comparable to the original version.
However, some results have been found to be more mixed. French and Beaumont (1990) reported significantly lower scores on the computerized version than on the standard version of the Standard Progressive Matrices Test, indicating that these two measures cannot be used interchangeably. They concluded, however, that the poor resolution of available computer graphics might have accounted for the differences. With the advent of more sophisticated computer graphics, these problems are likely to be reduced in future studies. It should also be noted that more than a decade ago, French and Beaumont (1990) reported that research participants expressed a clear preference for the computer-based response format over the standard administration procedures for cognitive assessment instruments.
Equivalence of Computer-Based and Traditional Personnel Screening Methods
Several studies have evaluated computer assessment methods with traditional approaches in the field of personnel selection. Carretta (1989) examined the usefulness of the computerized Basic Attributes Battery (BAT) for selecting and classifying United States Air Force pilots. A total of 478 Air Force officer candidates completed a paper-and-pencil qualifying test and the BAT, and they were also judged based on undergraduate pilot training performance. The results demonstrated that the computer-based battery of tests was adequately assessing abilities and skills related to flight training performance, although the results obtained were variable.
In summary, research on the equivalence of computerized and standard administration has produced variable results. Standard and computerized versions of paper-and-pencil personality measures appear to be the most equivalent, and those involving more complex stimuli or highly different response or administration formats appear less equivalent. It is important for test users to ensure that a particular computer-based adaptation of a psychological test is equivalent before their results can be considered comparable to those of the original test (Hofer & Green, 1985).
Was this article helpful?