• Race (White, African American, Asian/Pacific Islander, Native American, Other).

• Ethnicity (Hispanic origin, non-Hispanic origin).

• Geographic Region (Midwest, Northeast, South, West).

• Community Setting (Urban/Suburban, Rural).

• Classroom Placement (Full-Time Regular Classroom, Full-Time Self-Contained Classroom, Part-Time Special Education Resource, Other).

• Special Education Services (Learning Disability, Speech and Language Impairments, Serious Emotional Disturbance, Mental Retardation, Giftedness, English as a Second Language, Bilingual Education, and Regular Education).

• Parent Educational Attainment (Less Than High School Degree, High School Graduate or Equivalent, Some College or Technical School, Four or More Years of College).

The most challenging of stratification variables is socioeconomic status (SES), particularly because it tends to be associated with cognitive test performance and it is difficult to operationally define. Parent educational attainment is often used as an estimate of SES because it is readily available and objective, and because parent education correlates moderately with family income. Parent occupation and income are also sometimes combined as estimates of SES, although income information is generally difficult to obtain. Community estimates of SES add an additional level of sampling rigor, because the community in which an individual lives may be a greater factor in the child's everyday life experience than his or her parents' educational attainment. Similarly, the number of people residing in the home and the number of parents (one or two) heading the family are both factors that can influence a family's socioeconomic condition. For example, a family of three that has an annual income of $40,000 may have more economic viability than a family of six that earns the same income. Also, a college-educated single parent may earn less income than two less educated cohabiting parents. The influences of SES on construct development clearly represent an area of further study, requiring more refined definition.

When test users intend to rank individuals relative to the special populations to which they belong, it may also be desirable to ensure that proportionate representation of those special populations are included in the normative sample (e.g., individuals who are mentally retarded, conduct disordered, or learning disabled). Millon, Davis, and Millon (1997) noted that tests normed on special populations may require the use of base rate scores rather than traditional standard scores, because assumptions of a normal distribution of scores often cannot be met within clinical populations.

A classic example of an inappropriate normative reference sample is found with the original Minnesota Multiphasic Personality Inventory (MMPI; Hathaway & McKinley, 1943), which was normed on 724 Minnesota white adults who were, for the most part, relatives or visitors of patients in the University of Minnesota Hospitals. Accordingly, the original MMPI reference group was primarily composed of Minnesota farmers! Fortunately, the MMPI-2 (Butcher, Dahlstrom, Graham, Tellegen, & Kaemmer, 1989) has remediated this normative shortcoming.

One of the principal objectives of sampling is to ensure that each individual in the target population has an equal and independent chance of being selected. Sampling methodologies include both probability and nonprobability approaches, which have different strengths and weaknesses in terms of accuracy, cost, and feasibility (Levy & Lemeshow, 1999).

Probability sampling is a random selection approach that permits the use of statistical theory to estimate the properties of sample estimators. Probability sampling is generally too expensive for norming educational and psychological tests, but it offers the advantage of permitting the determination of the degree of sampling error, such as is frequently reported with the results of most public opinion polls. Sampling error may be defined as the difference between a sample statistic and its corresponding population parameter. Sampling error is independent from measurement error and tends to have a systematic effect on test scores, whereas the effects of measurement error by definition is random. When sampling error in psychological test norms is not reported, the estimate of the true score will always be less accurate than when only measurement error is reported.

A probability sampling approach sometimes employed in psychological test norming is known as multistage stratified random cluster sampling; this approach uses a multistage sampling strategy in which a large or dispersed population is divided into a large number of groups, with participants in the groups selected viarandom sampling. In two-stage cluster sampling, each group undergoes a second round of simple random sampling based on the expectation that each cluster closely resembles every other cluster. For example, a set of schools may constitute the first stage of sampling, with students randomly drawn from the schools in the second stage. Cluster sampling is more economical than random sampling, but incremental amounts of error may be introduced at each stage of the sample selection. Moreover, cluster sampling commonly results in high standard errors when cases from a cluster are homogeneous (Levy & Lemeshow, 1999). Sampling error can be estimated with the cluster sampling approach, so long as the selection process at the various stages involves random sampling.

In general, sampling error tends to be largest when nonprobability-sampling approaches, such as convenience sampling or quota sampling, are employed. Convenience samples involve the use of a self-selected sample that is easily accessible (e.g., volunteers). Quota samples involve the selection by a coordinator of a predetermined number of cases with specific characteristics. The probability of acquiring an unrepresentative sample is high when using nonprobability procedures. The weakness of all nonprobability-sampling methods is that statistical theory cannot be used to estimate sampling precision, and accordingly sampling accuracy can only be subjectively evaluated (e.g., Kalton, 1983).

How large should a normative sample be? The number of participants sampled at any given stratification level needs to be sufficiently large to provide acceptable sampling error, stable parameter estimates for the target populations, and sufficient power in statistical analyses. As rules of thumb, group-administered tests generally sample over 10,000 participants per age or grade level, whereas individually administered tests typically sample 100 to 200 participants per level (e.g., Robertson, 1992). In IRT, the minimum sample size is related to the choice of calibration model used. In an integrative review, Suen (1990) recommended that a minimum of 200 participants be examined for the one-parameter Rasch model, that at least 500 examinees be examined for the two-parameter model, and that at least 1,000 examinees be examined for the three-parameter model.

The minimum number of cases to be collected (or clusters to be sampled) also depends in part upon the sampling procedure used, and Levy and Lemeshow (1999) provide formulas for a variety of sampling procedures. Up to a point, the larger the sample, the greater the reliability of sampling accuracy. Cattell (1986) noted that eventually diminishing returns can be expected when sample sizes are increased beyond a reasonable level.

The smallest acceptable number of cases in a sampling plan may also be driven by the statistical analyses to be conducted. For example, Zieky (1993) recommended that a minimum of 500 examinees be distributed across the two groups compared in differential item function studies for group-administered tests. For individually administered tests, these types of analyses require substantial oversampling of minorities. With regard to exploratory factor analyses, Riese, Waller, and Comrey (2000) have reviewed the psychometric literature and concluded that most rules of thumb pertaining to minimum sample size are not useful. They suggest that when communalities are high and factors are well defined, sample sizes of 100 are often adequate, but when communalities are low, the number of factors is large, and the number of indicators per factor is small, even a sample size of 500 may be inadequate. As with statistical analyses in general, minimal acceptable sample sizes should be based on practical considerations, including such considerations as desired alpha level, power, and effect size.

As we have discussed, sampling error cannot be formally estimated when probability sampling approaches are not used, and most educational and psychological tests do not employ probability sampling. Given this limitation, there are no objective standards for the sampling precision of test norms. Angoff (1984) recommended as a rule of thumb that the maximum tolerable sampling error should be no more than 14%

of the standard error of measurement. He declined, however, to provide further guidance in this area: "Beyond the general consideration that norms should be as precise as their intended use demands and the cost permits, there is very little else that can be said regarding minimum standards for norms reliability" (p. 79).

In the absence of formal estimates of sampling error, the accuracy of sampling strata may be most easily determined by comparing stratification breakdowns against those available for the target population. The more closely the sample matches population characteristics, the more representative is a test's normative sample. As best practice, we recommend that test developers provide tables showing the composition of the standardization sample within and across all stratification criteria (e.g., Percentages of the Normative Sample according to combined variables such as Age by Race by Parent Education). This level of stringency and detail ensures that important demographic variables are distributed proportionately across other stratifying variables according to population proportions. The practice of reporting sampling accuracy for single stratification variables "on the margins" (i.e., by one stratification variable at a time) tends to conceal lapses in sampling accuracy. For example, if sample proportions of low socioeconomic status are concentrated in minority groups (instead of being proportionately distributed across majority and minority groups), then the precision of the sample has been compromised through the neglect of minority groups with high socioeconomic status and majority groups with low socioeconomic status. The more the sample deviates from population proportions on multiple stratifications, the greater the effect of sampling error.

Manipulation of the sample composition to generate norms is often accomplished through sample weighting (i.e., application of participant weights to obtain a distribution of scores that is exactly proportioned to the target population representations). Weighting is more frequently used with group-administered educational tests than psychological tests because of the larger size of the normative samples. Educational tests typically involve the collection of thousands of cases, with weighting used to ensure proportionate representation. Weighting is less frequently used with psychological tests, and its use with these smaller samples may significantly affect systematic sampling error because fewer cases are collected and because weighting may thereby differentially affect proportions across different stratification criteria, improving one at the cost of another. Weighting is most likely to contribute to sampling error when a group has been inadequately represented with too few cases collected.

Was this article helpful?

## Post a comment