The Misuse of Testing

a pencil fills in bubbles on test

Photo credit: NUMBER1411/STOCK.ADOBE.COM

Inertia, Newton’s First Law of Motion, describes something’s resistance to any change in its velocity. Accordingly, static objects tend to remain at rest, while moving objects are apt to remain in motion—unless acted on by an external force. Indeed, many of society’s most cherished traditions have endured chiefly because they’ve not been confronted by an inertia-disrupting external force.

In America’s public schools, for instance, we have been employing traditional standardized tests in much the same way for more than a full century. Significantly, the pandemic represents an external force of sufficient strength to stimulate a long overdue rethink and, perhaps, a redo of education’s standardized testing. Let’s consider why such a shift might be beneficial.

Standardized Testing’s Lineage

A “standardized test” is one that is administered, scored, and interpreted in a standard, prespecified manner. Prior to World War I, only a handful of instances can be found in which standardized tests were employed by many American educators. World War I triggered the substantial use of a specific standardized test that, in time, led to U.S. educators’ widespread employment of standardized testing. This test was the Army Alpha.

At the outset of the war, U.S. Army officials called on the American Psychological Association to appoint a committee of testing experts to develop an easily administered paper-and-pencil test capable of predicting the success of recruits sent to Army officer training programs. The result was the Army Alpha.

Designed to function as an abbreviated intelligence test, the Alpha consisted of true-false and multiple-choice items intended to measure a recruit’s verbal ability, numerical ability, direction-following ability, and knowledge of information. Test-takers’ performances were then compared with one another so that those recruits who earned the highest scores could be chosen for officer training. Given the Alpha’s dependence on comparative measurement, it was soon recognized that higher scores tended to be earned by recruits with stronger educations and more affluent families.

The Army Alpha was administered to a whopping 1.7 million men—and was widely regarded as a stellar assessment success. Indeed, it became the dominant test-development strategy for nearly all U.S. standardized exams—achievement tests as well as intelligence tests—during the next 100 years.

Three Purposes

In education, standardized tests provide the evidence necessary to achieve three distinct purposes—each of which, in certain situations, can contribute markedly to the enhancement of educational quality. These three assessment functions—often dependent on students’ scores on standardized tests—are instruction, evaluation, and selection.

First, students’ performances on certain tests can be used for instruction—primarily to help teachers identify test-measured skills or bodies of knowledge that have been mastered by students. It is assumed that teachers, once informed via a test’s results of their students’ current knowledge and skills, can then make more individualized and more effective instructional plans than would have been possible without the data.

Yet, in today’s real world of schooling, where standardized tests are typically administered in the spring, students’ performances on those tests are often not reported to their teachers until precious little teaching time remains in the school year. Even worse, many score reports don’t arrive until students have been advanced to higher grades, new courses, and different teachers. Nonetheless, commercial vendors of standardized achievement tests often tout their standardized tests as being a boon to instruction.

Second, the results of students’ performances on standardized tests also can be used for evaluation, that is, to help determine the effectiveness of instruction delivered by an individual teacher or by groups of educators. Many of our nationally standardized educational tests, such as the Iowa Tests of Basic Skills, are currently employed in comparing performances of students being taught in different schools or in comparing scores of students drawn from such different societal segments as socioeconomic or ethnic subgroups. In truth, when many people think of standardized achievement tests these days, it is assumed that students’ performances on such tests will usually play a prominent role in evaluating the success of schooling. Regrettably, such assumptions are made despite decades of high correlations between students’ standardized test scores and their socioeconomic status suggesting that standardized tests often measure what students bring to school rather than what they learn there.

Finally, standardized tests can be employed for selection, that is, when students’ test performances are used, at least in part, to identify which students should take part in distinctive educational programs. Most commonly, based on students’ comparative test performances, selections are made so that the strongest or the weakest test-takers can be assigned to ability-aligned remedial or enrichment learning programs. It is, of course, perilous to attribute students’ standardized test scores to instructional quality when those scores are so patently linked to students’ backgrounds. Nonetheless, when thinking back to the original mission of the Army Alpha, it was most decidedly selection.

Historically, then, the three most common applications of standardized educational tests have been to illuminate the instruction, evaluation, or selection decisions. Less formal assessments, of course, such as teacher-made classroom tests, also can be used to collect useful information regarding these three sorts of decisions. Yet, to do a standardized test’s job properly, the test really should provide convincing evidence that its use augments the defensibility of instructional, evaluative, or selection decisions being made about the students.

Joint Standards

Almost all professional specializations have identified the profession-sanctioned precepts by which its specialists are supposed to carry out their work. In educational testing, such rules are collected in a document known as the Standards for Educational and Psychological Testing. Because of the importance of this collection of recommendations, many measurement specialists regard it as almost a rendition of the Holy Writ.

Revised periodically, these guidelines are referred to as the “Joint Standards” because they are prepared jointly by a coalition of the three U.S. professional associations most concerned with educational testing. They are the American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME). Versions of these standards were first published in 1954. Overseeing each revision of the guidelines is a carefully chosen group of assessment experts drawn from the three associations. After an exhausting review of proposed changes, an updated revision of the Joint Standards is published approximately every decade.

The current edition of the Joint Standards (2014) stresses the importance of an educational test being attentive to three overriding considerations: reliability, fairness, and validity. Although a standardized test’s reliability (that is, its consistency in measuring whatever it measures) and fairness (that is, the equitability of its treatment of test-takers) must always be present for a high-quality standardized test, assessment validity is regarded as more important. In fact, validity is seen by the Joint Standards as “the most fundamental consideration in developing and evaluating tests.”

A test’s validity evidence helps us determine the accuracy of the test’s score-based interpretations in relation to the intended purpose of that test. “It is the interpretations of test scores for proposed uses that are evaluated, not the test itself.” Accordingly, some validity evidence must illuminate the accuracy with which test scores are interpreted, but other validity evidence must indicate whether those score-based interpretations will contribute to attainment of the test’s intended assessment purpose.

The most crucial assessment-quality judgments deal with validity. Yet, few educators report genuine conversance with how a validity argument is crafted—or what sorts of evidence can be used to support such arguments. Accordingly, the current analysis will be concluded by presenting examples of the sorts of evidence that could be included in a validity argument.

Many, if not most, of today’s standardized tests were created merely by mimicking the steps used in the Army Alpha’s original test development process. The essence of that process was to compare recruits’ intellectual abilities so that the most cognitively capable recruits could be distinguished from their less-able counterparts. Typically, a standardized test’s validity evidence is presented in a technical manual accompanying the test in the form of a “validity argument.” The information provided in this argument permits potential users to judge, first, the likely accuracy of score-based interpretations and, second, how well those interpretations will support the test’s intended use.

Because technical manuals are sometimes crammed with off-putting statistical trappings, whenever possible a simplified version of a validity argument’s essentials should also be provided for nontechnical readers. Such plain-talk validity arguments can be genuinely helpful when deciding whether to adopt a given standardized test.

Validity argument

Like most arguments, a validity argument attempts to persuade potential adopters of a standardized test under consideration that, were the test to be chosen, assessment validity will be present. The mission of such an argument is to provide an accurate and honest picture of testing’s most important attribute, namely, validity.

First, analysis and supporting evidence must be presented regarding the likely accuracy of the score-based interpretations permitted by a test. Next, a second sort of evidence then must be presented concerning the degree to which test-takers’ scores will contribute to the test’s intended purpose. Ideally, these two groups of evidence will be clearly identified in a technical report as bearing on interpretation accuracy or purpose support. More ideally still, as noted above, the arguments and evidence presented will be provided at both a highly technical level and at a more humanely technical level.

The number of educators who, during their lifetimes, have read even a single validity argument is surely quite small. But this should not be so. Well-formulated validity arguments can provide potential standardized-test users with key information needed for sound decisions. In the field of education, of course, a poorly chosen standardized test can, over the years, sometimes result in serious educational harm for thousands of students.

a student sitting at her desk smiles

Photo credit: POLOLIA/STOCK.ADOBE.COM

The validity arguments found in standardized-test technical manuals are often laced with so many statistical analyses that they are intimidating to even careful readers. The problem is that, even when the consequences of using a given standardized test are unarguably high, many of us give scant heed to the validity-related information associated with a particular standardized test. Perhaps we assume that because the test is “standardized,” this somehow attests to its appropriateness.

Yet, we see the misuse of many standardized tests today because they are unaccompanied by strong validity arguments that embody sufficiently supportive evidence. Even worse, we also encounter standardized tests whose validity arguments make it apparent that the test under consideration is clearly being used for the wrong purpose.

First, however, those wishing to determine the usability of a standardized test need to consider the strength of the validity argument accompanying the test. The Joint Standards tell us that a sound validity argument must incorporate (1) evidence of interpretation accuracy and (2) evidence of purpose support. What might those two types of evidence look like?

Focusing first on evidence of interpretive accuracy, what’s needed is empirically and/or judgmentally based evidence that the score-based inferences about test-takers’ status will be accurate. That is, interpretations of the meaning of test-takers’ scores will be sufficiently congruent with test-takers’ actual abilities insofar as those abilities are displayed via a standardized test. What’s sought in this initial category of evidence are data bearing directly on whether accurate interpretations of test-takers’ performances are likely.

The second category of evidence to be presented in validity arguments deals with the degree to which a standardized test’s usage will contribute to the intended aim of a standardized test’s use. As noted earlier, a standardized test is used for three distinct missions, namely, to enhance better instruction, evaluation, and selection. Moreover, under each of those headings, numerous potent evidence options are usually possible. Some of those options will require collections of students’ test scores on other exams, some will require judgmental estimates of a test’s contribution to a test’s goal-attainment. The variety and strengths of validity evidence depend on the creativity of those creating the validity argument.

Disrupting External Force

As noted, American educators have been employing standardized tests in essentially the same ways for more than a century. So, given this reality, why is there any need to be concerned now about the ways we have been using those tests—even if the pandemic may have caused some educators to reconsider many of their traditional practices?

The answer may be unnerving, but it can’t be overlooked. In the U.S., many standardized tests are being used in pursuit of the wrong measurement mission. Such testing is surely having a harmful impact on our students. The time has come—indeed, the time is long overdue—for American educators to review the appropriateness of the ways we build and employ our standardized tests. Our same-old, same-old uses of many of our standardized tests must be identified for what it is—flat-out wrong.

Because the modifications stemming from the pandemic have caused us to consider the virtues of a host of traditional in-school operations, now is a perfect time to dig into the viscera of standardized testing to discern what needs to change. Your task as a policymaker is abundantly clear. You need to learn enough about the kinds of arguments and evidence necessary for us to judge and use standardized tests in ways that help, not harm, children.

Clearly, to the extent that America’s educators have allowed and, in some instances, encouraged the use of today’s many mispurposed standardized tests, it is not the tests that are at fault. We are.

W. James Popham (wpopham@g.ucla.edu) is an Emeritus Professor in the UCLA Graduate School of Education and Information Studies and the author of more than a dozen education books. His most recent is Classroom Assessment: What Teachers Need to Know (9th Edition).

CUBE 2021 Annual Conference

CUBE 2022 Annual Conference

CUBE 2023 Annual Conference

The Misuse of Testing
Standardized tests should help, not harm, students