How old can norms be and still remain accurate? Evidence from the last two decades suggests that norms from measures of cognitive ability and behavioral adjustment are susceptible to becoming soft or stale (i.e., test consumers should use older norms with caution). Use of outdated normative samples introduces systematic error into the diagnostic process and may negatively influence decision-making, such as by denying services (e.g., for mentally handicapping conditions) to sizable numbers of children and adolescents who otherwise would have been identified as eligible to receive services. Sample recency is an ethical concern for all psychologists who test or conduct assessments. The American Psychological Association's (1992) Ethical Principles direct psychologists to avoid basing decisions or recommendations on results that stem from obsolete or outdated tests.
The problem of normative obsolescence has been most robustly demonstrated with intelligence tests. The Flynn effect (Herrnstein & Murray, 1994) describes a consistent pattern of population intelligence test score gains over time and across nations (Flynn, 1984, 1987, 1994, 1999). For intelligence tests, the rate of gain is about one third of an IQ point per year (3 points per decade), which has been a roughly uniform finding over time and for all ages (Flynn, 1999). The Flynn effect appears to occur as early as infancy (Bayley, 1993; S. K. Campbell, Siegel, Parr, & Ramey, 1986) and continues through the full range of adulthood (Tulsky & Ledbetter, 2000). The Flynn effect implies that older test norms may yield inflated scores relative to current normative expectations. For example, the Wechsler Intelligence Scale for Children—Revised (WISC-R; Wechsler, 1974) currently yields higher full scale IQs (FSIQs) than the WISC-III (Wechsler, 1991) by about 7 IQ points.
Systematic generational normative change may also occur in other areas of assessment. For example, parent and teacher reports on the Achenbach system of empirically based behavioral assessments show increased numbers of behavior problems and lower competence scores in the general population of children and adolescents from 1976 to 1989 (Achenbach & Howell, 1993). Just as the Flynn effect suggests a systematic increase in the intelligence of the general population over time, this effect may suggest a corresponding increase in behavioral maladjustment over time.
How often should tests be revised? There is no empirical basis for making a global recommendation, but it seems reasonable to conduct normative updates, restandardizations, or revisions at time intervals corresponding to the time expected to produce one standard error of measurement (SEM) of change. For example, given the Flynn effect and a WISC-III
FSIQ SEm of 3.20, one could expect about 10 to 11 years should elapse before the test's norms would soften to the magnitude of one SEM.
Was this article helpful?