A variety of factors and initiatives seem to be converging to create a “perfect storm” of reform in education. Dissatisfaction with American students’ performance on international assessments, concerns about U.S. global competitiveness, the state of the economy, and evidence that many high school graduates are not “college or career ready” are all taking their toll. 

A variety of factors and initiatives seem to be converging to create a “perfect storm” of reform in education. Dissatisfaction with American students’ performance on international assessments, concerns about U.S. global competitiveness, the state of the economy, and evidence that many high school graduates are not “college or career ready” are all taking their toll. And as a result, Race to the Top initiatives, the Department of Education’s ESEA reauthorization blueprint, and plans of many states and state consortia all call for significant changes in curricula, instructional delivery, and the ways we monitor student achievement.

Education reform movements are nothing new. And there are some who would contend that past efforts have had little effect. Twenty-first century skill advocates point out that our current system of education was created to address the workplace needs of an emerging industrial nation—to turn out people who were armed with some basic, low-level skills and ready to take their place on an assembly line prepared to arrive on time, respect authority, and conform to established rules. They would assert that little about this system has changed since that time, despite the radical shift in the demands of the workplace.

New curricular emphases, programs, and instructional techniques have come and gone. But rather than seeing gargantuan improvements in student achievement, the small gains in large-scale assessment results have been far from adequate. If one accepts that interactions between students and teachers are the key to significant improvements in student achievement, then it becomes obvious where we should focus attention—on teaching and testing practices that have been shown to lead to real improvement.

Without shortchanging content, teachers in the future will be expected to better address a broad range of student skills, some cognitive (problem solving, critical thinking, communication) and some not (collaboration, self direction). They will be expected to place greater emphasis on project-based learning and assessment leading to multiple, scorable student products and performances. Done well, these activities can lead to greater depth of knowledge of content.

The learning environment will not be limited to a school building or classroom, but instead will make greater use of out-of-school resources. Computers and other technology tools will be relied on extensively in all aspects of teaching and learning. Thus, teachers will have to be comfortable with changing learning environments and proficient with new, high-tech tools and systems. (Partnership for 21st Century Skills, 2009).

Assessment Literacy

With all these changes, however, there is still something critical that teachers will need, something they have been lacking for some time – a far higher level of assessment literacy. They need a great deal more grounding in the use of assessment than the limited exposure to testing concepts they receive in pre-service training. I’m not talking about more definitions of such things as stanines or percentile ranks, but rather a far deeper understanding of the roles many kinds of assessment play in the processes of teaching and learning. Consider this response, which a teacher today might offer to answer a question about testing and grading practices.

I do formative assessment. I give quizzes or tests almost every day. I create the tests online from an item bank, and my students take the tests online, too. That way I get results back immediately and can use them to adjust my instruction. The scores are automatically recorded in my electronic grade book, and I can see right away how well each student did, as well as how the class did as a whole.

My district gives two interim assessments each year. These are developed by our curriculum coordinator working with teachers, also using the item bank. These are general assessments we use to monitor growth and to identify students who are likely to have trouble passing the state test at the end of the year. We also use the diagnostic information they give us.

On the surface, these comments may seem quite reasonable. But they may well depict poor practice. For example, there is a significant disconnect between the teacher’s concept of formative assessment and the dramatically effective process of formative assessment supported by research. The latter is an ongoing process that occurs during instruction and involves (1) letting students know the learning targets and criteria for success, (2) gathering rich evidence of student learning by a variety of means (e.g., observation, questioning, quizzes), (3) providing descriptive feedback on gaps in student learning, (4) the teacher and student using the feedback to adjust instruction and learning activities, (5) student self assessment, and (6) activating other students as resources (Wiliam, 2007).

Back to our teacher’s response, timing (immediacy of results) is only one attribute of effective formative assessment. A score on a multiple-choice quiz hardly constitutes rich evidence or descriptive feedback leading to appropriate changes in instruction. In fact, the assignment of scores to many kinds of student work before the completion of an instructional unit is one of many grading practices that destroy students’ motivation to learn and thus inhibit learning (Schafer, 1993).

The district testing the teacher describes also seems reasonable on the surface. Early warning and growth monitoring are legitimate uses of interim testing. With respect to the latter, I wonder if the test items were selected for the two tests in such a way that comparisons of performance on the two measures are appropriate. It is doubtful that the two tests were statistically equated. Were raw mean scores compared? If percentage of proficient students was reported, were the cut scores for proficiency arbitrarily set at 70 percent on both tests? In either case, what if the second test was just easier than the first—would a higher score on the second one really be an accurate reflection of growth?

A school administrator once mentioned to me that the district was looking forward to implementing a data management system, so that results from multiple tests could be aggregated to help the teachers better understand their students’ capabilities. With respect to total tests or subtests, how would the content covered by different measures compare? Are the results reported on the same scales? If not, how can they be aggregated? Does it make sense to aggregate data gathered months apart? For monitoring growth with respect to a general area or specific standard, are the measures comparable, based on content and difficulty?

Whether using self-developed tests or off-the-shelf instruments from the publishers, these are the kinds of questions to which district educators need to know the answers. Those answers, known in fact by too few, determine the tests’ legitimate uses, as well as what legitimate conclusions can be drawn from the results.

There are several categories of assessments that are used in schools today, and several approaches that might be used within each. Very different from the process of formative assessment described earlier are summative assessments, which could include teacher-made classroom tests, interim assessments like the district tests the teacher described above, and high-stakes external tests, such as state accountability assessments. Summative assessments are “those assessments that are generally carried out at the end of an instructional unit or course of study for the purposes of giving grades or otherwise certifying student proficiency” (Shepard et al, 2005). Some of these tests might be general achievement tests covering the whole domain of mathematics at a grade, for example, or benchmark tests, perhaps covering material taught within the last two or three months.

There are tests made up of multiple-choice questions, tests made up of constructed-response questions, and tests made up of combinations of item types. (Generally, extended constructed-response questions are better for testing higher-order thinking skills or greater depth of knowledge.) There are fixed tests (the same tests taken by all students in a group) and computer-adaptive tests, which are tailored to each student’s ability level. General achievement measures, whether fixed or adaptive, are not designed to provide rich, diagnostic information. They can be used to monitor growth. Also, they are quite useful as a source of information to guide program improvements that will benefit the next group of students to pass through a tested grade – such as general areas of weakness within a discipline or identification of low-performing subgroups of students.


It is true that teachers of the future will need to deal with many changes in education – including new environments and new tools. But these changes won’t lessen the need for a much greater level of assessment literacy, here defined as the knowledge and skills teachers need to:

  • identify, select, or create assessments optimally designed for various purposes, such as: grading or certifying proficiency, diagnosing specific student needs (gaps in learning), and assessing higher order thinking; and
  • analyze, evaluate, and use the quantitative and qualitative evidence generated by external summative and interim assessments, classroom summative assessments, and instructionally embedded formative assessment practices to make appropriate decisions to improve programs and specific instruction to advance student learning.

Better equipped with assessment literacy, teachers will be in a much better position to weather the “perfect storm of reform” and maximize student learning.


Partnership for 21st Century Skills. (2009) 21st century learning environments. White paper from series on support systems, http://www.21stcenturyskills.org/ route21/.

Schafer, W. (1993) Assessment literacy for teachers. Theory into Practice, 32(2), College of Education, The Ohio State University.

Shepard, L., Hammerness, K., Darling-Hammond, L., Rust, F. (2005) Assessment. In Darling-Hammond, L. and Bransford, J. (Eds.), Preparing Teachers for a Changing World, 275-326, San Francisco: Jossey-Bass.

Wiliam, D. (2007) Keeping learning on track: Classroom assessment and the regulation of learning. In Lester F. (Ed.), Second Handbook of Mathematics Teaching and Learning, 1053-1098, Greenwich, CT: Information Age Publishing.

Stuart Kahl is CEO of Measured Progress in Dover, NH, an educational testing company with contracts for state testing programs in over 20 states. A former elementary and secondary teacher, Dr. Kahl earned degrees from Johns Hopkins University and from the University of Colorado. Prior to cofounding his current company in 1983, he worked for the Education Commission of the States, the University of Colorado, Clark University, and RMC Research Corporation.