Definition
Two fundamentally different questions can be asked about any assessment result: "How did this student perform compared to other students?" and "How did this student perform against a defined standard?" The first question produces a norm-referenced interpretation; the second produces a criterion-referenced one.
A norm-referenced assessment interprets a student's score relative to a norming group — typically a large, representative sample of students who took the same test. The score itself is less meaningful than the student's position in the distribution. A score of 72 means little until you know that it places the student at the 88th percentile. Classic examples include IQ tests, many college entrance exams, and nationally normed achievement batteries like the Iowa Assessments.
A criterion-referenced assessment interprets a student's score against a predetermined set of learning criteria, regardless of how other students perform. The question is whether the student demonstrated mastery of specific skills or content. If every student in the class scores 95%, that is a success, not a sign that the test was too easy. Examples include driving license exams, bar exams, and classroom tests built around learning objectives.
The distinction is not about the test itself but about how scores are constructed and interpreted. Assessment design choices, item difficulty, score reporting, cut scores, follow from which purpose the assessment is meant to serve.
Historical Context
The intellectual roots of norm-referenced assessment trace to Francis Galton's work on statistical distributions in the 1880s. Galton introduced the concept of ranking individuals on a normal curve, laying groundwork for the psychometric tradition. His student Karl Pearson formalized correlation and the statistical tools used in test norming.
The modern era of norm-referenced testing began with the Army Alpha and Beta tests developed by Robert Yerkes and colleagues during World War I (1917–1919). Faced with rapidly classifying 1.75 million recruits, the U.S. military needed instruments that sorted people efficiently. The Alpha test for literate recruits and the Beta test for illiterate or non-English-speaking recruits produced rank-orderings rather than mastery verdicts. This model shaped American educational testing for decades.
Lewis Terman's Stanford-Binet IQ test (1916) and later the development of the SAT by Carl Brigham in the 1920s extended the norm-referenced model into education. By mid-century, norm-referenced standardized tests dominated American schooling, particularly through instruments produced by publishers like Educational Testing Service (ETS) and the Iowa testing program.
The criterion-referenced alternative emerged explicitly in 1963 when psychologist Robert Glaser published "Instructional Technology and the Measurement of Learning Outcomes" in the journal American Psychologist. Glaser coined the term "criterion-referenced measure" and argued that educational measurement needed a framework grounded in specific behavioral objectives rather than comparative rankings. James Popham and T.R. Husek extended the theoretical framework in a 1969 paper in the Journal of Educational Measurement, which remains a foundational text.
The standards movement of the 1990s, culminating in the No Child Left Behind Act (2001) and later the Every Student Succeeds Act (2015), pushed American education strongly toward criterion-referenced state assessments tied to grade-level content standards, though norm-referenced instruments remained dominant in college admissions and gifted education screening.
Key Principles
Score Meaning Depends on the Reference Frame
A norm-referenced score answers a comparative question: where does this student stand relative to others? A criterion-referenced score answers a mastery question: what can this student do? These are different questions, and conflating them produces distorted conclusions. A student who scores at the 50th percentile on a norm-referenced reading test may or may not be a proficient reader — that depends entirely on what the norming group itself can do.
Norm-Referenced Tests Are Designed to Spread Students Apart
Test designers building norm-referenced instruments deliberately include items of varying difficulty and remove items that nearly everyone answers correctly or incorrectly. High discrimination between students is the design goal. A well-constructed norm-referenced test produces scores spread across the full range of the distribution. This design principle is appropriate for ranking purposes but actively counterproductive for measuring instructional outcomes, items that reflect what was taught tend to be answered correctly by most students after good instruction, which reduces variance and "hurts" a norm-referenced test psychometrically.
Criterion-Referenced Tests Define Mastery Before Testing Begins
The defining feature of criterion-referenced assessment is that the standard exists independently of student performance. The cut score for a driving license (e.g., 80% correct on the knowledge test) does not shift based on how other applicants perform on a given day. This requires deliberate specification of learning objectives, content domains, and performance standards before the test is administered. Robert Mager's work on behavioral objectives (1962) provided much of the practical framework for this design approach.
Both Types Have Legitimate Uses
Norm-referenced assessments serve selection, screening, and diagnostic comparisons across populations. They answer questions like: Is this school's reading performance above or below the national average? Which students are most likely to need intensive intervention? Criterion-referenced assessments serve instruction, certification, and accountability against standards. They answer: Has this student learned to multiply fractions? Is this graduate ready to practice law? Using a norm-referenced instrument to make criterion-referenced decisions, or vice versa, produces misleading conclusions.
Cut Scores on Criterion-Referenced Tests Involve Value Judgments
Setting the proficiency threshold on a criterion-referenced test is a policy decision, not a purely technical one. Methods like the Angoff method, the bookmark method, and the contrasting groups method are all defensible approaches, but they embed judgments about what "proficient" means. Robert Linn (2003) documented extensively how proficiency cut scores on state assessments varied dramatically across states, producing inconsistent conclusions about student achievement even when measuring similar content.
Classroom Application
Using Criterion-Referenced Assessments for Instructional Planning
A fifth-grade math teacher designing a unit on fractions writes specific learning objectives: students will add fractions with unlike denominators, compare fractions using benchmark fractions, and solve word problems involving fraction addition. The unit test is built directly from those objectives, with clear mastery thresholds (e.g., 80% correct within each objective cluster).
After scoring, the teacher disaggregates results by objective rather than looking at total scores. Several students mastered adding unlike denominators but struggled with word problems; a smaller group showed gaps in benchmark comparisons. Re-teaching targets these specific gaps. Total scores would have obscured this instructional information entirely.
Recognizing Norm-Referenced Thinking in Everyday Grading
A high school biology teacher grades on a curve after a difficult exam — the highest score was 78, so the teacher adds 22 points to every student's score. This is norm-referenced practice embedded in a classroom context. The consequence: students who learned the content poorly may receive passing grades, while the teacher receives no reliable information about which concepts need re-teaching. A criterion-referenced alternative is to examine why scores were low (Was the instruction sufficient? Was the test aligned to instruction?) and address the underlying cause rather than adjusting scores.
Combining Both Approaches for Screening and Instruction
A middle school literacy coordinator uses a nationally normed reading assessment (e.g., NWEA MAP) three times per year to identify students performing significantly below grade-level norms, a norm-referenced use. Students flagged receive a criterion-referenced diagnostic assessment (tied to specific decoding, fluency, and comprehension standards) to pinpoint instructional targets. The norm-referenced screen identifies who needs attention; the criterion-referenced diagnostic assessment identifies what instruction they need. Neither instrument alone would do both jobs well.
Research Evidence
Robert Glaser and Anthony Nitko's foundational work established the psychometric case for criterion-referenced assessment in educational contexts. Nitko's 1980 monograph Distinguishing the Many Varieties of Criterion-Referenced Tests provided the first comprehensive taxonomy of criterion-referenced approaches, clarifying distinctions that had been blurred in the decade following Glaser's 1963 paper.
James Popham's research on the instructional sensitivity of assessments — work he sustained from the 1970s through the 2010s, demonstrated that most large-scale standardized tests, including many state accountability tests nominally labeled as criterion-referenced, contain items dominated by socioeconomic background rather than instructional quality. His concept of "instructionally insensitive" tests (2007, Educational Researcher) challenged assumptions that standards-aligned tests automatically measure teaching effectiveness.
W. James Popham and Eva Baker (1970) conducted early empirical comparisons of norm- and criterion-referenced approaches, finding that teachers who received criterion-referenced performance data made more precise instructional adjustments than those receiving norm-referenced scores. This finding has been replicated in more recent work; Wiliam and Thompson (2007) in Ahead of the Curve reviewed the formative assessment literature and concluded that criterion-based feedback consistently outperforms comparative feedback for improving student learning.
Robert Linn's 2003 analysis in Educational Researcher, "Accountability: Responsibility and Reasonable Expectations," examined two decades of state assessment data and found that proficiency rate gains on state criterion-referenced tests frequently did not correlate with gains on NAEP (a nationally normed instrument), raising questions about whether state cut scores had been set at defensible levels. His work illustrated that criterion-referenced interpretation is only as meaningful as the quality of the criteria themselves.
Common Misconceptions
Misconception 1: Standardized tests are always norm-referenced. Many standardized tests are criterion-referenced. Standardized simply means administered and scored under consistent, uniform conditions. State tests tied to content standards (PARCC, SBAC, STAAR) are standardized and criterion-referenced. The SAT and ACT are standardized and norm-referenced. The term "standardized" describes the administration procedure, not the interpretive framework.
Misconception 2: Criterion-referenced assessments are easier to construct. Because criterion-referenced assessments require explicit, operationalized learning standards with defensible cut scores, they are often harder to build rigorously than norm-referenced instruments. A norm-referenced test can be assembled by selecting items that maximize score variance across a norming group. A criterion-referenced test requires upfront specification of exactly what students must be able to do, how performance will be sampled, and what threshold constitutes mastery — decisions that require both content expertise and deliberate validity work.
Misconception 3: Norm-referenced assessments have no place in classrooms. For some instructional decisions, norm-referenced comparisons are genuinely useful. A teacher wondering whether her class's writing development is on track relative to similar students nationally benefits from normed data. A school counselor identifying students who may need gifted services needs normative comparisons. The problem is not norm-referenced interpretation itself but using it for instructional decisions that require criterion-referenced information (i.e., what exactly does this student need to learn next?).
Connection to Active Learning
The choice between norm-referenced and criterion-referenced frameworks shapes how active learning functions in a classroom. Active learning methodologies — think-pair-share, Socratic seminar, project-based inquiry, are designed to build genuine competence in specific skills: analysis, argumentation, collaborative problem-solving. These outcomes are criterion-referenced by nature. A student has or has not developed the capacity to construct a reasoned argument from evidence. Norm-referenced ranking adds nothing to that question.
Standards-based grading operationalizes criterion-referenced principles at the reporting level, replacing percentage-based grades with mastery indicators tied directly to learning objectives. Teachers working in standards-based systems find that criterion-referenced assessments align naturally with formative cycles: assess against the standard, identify gaps, provide targeted practice, reassess. Norm-referenced grading disrupts this cycle because a student's grade depends partly on how classmates perform, not on their own mastery progress.
Summative assessment at the end of a unit or course serves a criterion-referenced purpose in most instructional contexts: did the student reach the learning goals? When summative grades are curved (a norm-referenced adjustment), they lose their diagnostic integrity and their usefulness as evidence of competence for future instructors or employers. Diagnostic assessment at the start of a learning sequence is almost always criterion-referenced: teachers need to know specifically what students already know and do not yet know, not how they rank relative to peers.
For active learning to function well, students need criterion-referenced feedback. Research on self-regulated learning (Zimmerman, 2002) shows that students adjust their effort and strategy based on gap information: "I have not yet mastered X" is actionable. "I am at the 43rd percentile" is not. Building assessment systems around defined criteria gives students the specific feedback that sustains productive struggle and genuine learning.
Sources
-
Glaser, R. (1963). Instructional technology and the measurement of learning outcomes: Some questions. American Psychologist, 18(8), 519–521.
-
Popham, W. J., & Husek, T. R. (1969). Implications of criterion-referenced measurement. Journal of Educational Measurement, 6(1), 1–9.
-
Linn, R. L. (2003). Accountability: Responsibility and reasonable expectations. Educational Researcher, 32(7), 3–13.
-
Nitko, A. J. (1980). Distinguishing the many varieties of criterion-referenced tests. Research Report RR-80-9. Educational Testing Service.