Learning from what doesn’t work in teacher evaluation


When rethinking their teacher evaluation systems under ESSA, education leaders will do well to look to the past for guidance on what not to do — and what to do better.


K1809_Amrein-Beardsley_Art_554x350pxReflecting on the many times his experiments and inventions went awry, Thomas Edison once said, “I have not failed. I just found 10,000 ways that won’t work. I am not discouraged, because every wrong attempt discarded is another step forward.”

Today’s education policy makers would do well to follow Edison’s lead and treat their missteps as learning opportunities. For example, consider the many efforts made in recent years to evaluate teachers by way of value-added models or other calculations based on their students’ scores on standardized achievement tests. Those models have been flawed, as we discuss below. However, their shortcomings have much to teach us about the challenges involved in assessing the work of classroom teachers.

In recent months, we’ve reviewed the arguments and decisions made in 15 lawsuits filed by teachers unions (identified by Education Week, 2015) that have sought to block states and districts from using such evaluation systems. Further, we’ve looked deeply into four of these lawsuits — from New York (Lederman v. King), New Mexico (State of New Mexico ex rel. the Hon. Mimi Stewart et al. v. Public Education Department), Texas (Houston Federation of Teachers v. Houston Independent School District), and Tennessee (Trout v. Knox County Board of Education) — combing through more than 15,000 pages of accompanying legal documents to identify the strongest objections to these measurement strategies.

The lessons to be learned from these cases are both important and timely. Under the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work. As a result, hundreds of state- and district-level teams (including teachers, administrators, union representatives, state department of education personnel, and others) are now trying to decide whether and how to redesign their existing teacher evaluation systems. If they’re not careful, they could easily repeat the four main blunders their predecessors made:

1) Ignoring inconsistent findings

Value-added or student-growth models (more generally referred to as VAMs) are meant to gauge individual teachers’ effectiveness by comparing their former students’ test scores with those of demographically similar students (based on factors such as past achievement, family income, race, English language fluency, and special education status) who’ve studied with other teachers.

This may sound like an objective form of evaluation, giving school administrators a perfectly fair way to determine precisely how much or how little each teacher contributes to student achievement, relative to their peers in similar schools and classrooms. However, in all 15 lawsuits highlighted by Education Week, plaintiffs have argued that VAMs often fall far short of their intention to create fair, apples-to-apples comparisons of teachers’ effectiveness because their statistical models cannot account for the subtle ways in which groups of students differ from one year to the next. In no lawsuit was this clearer than in Lederman v. King.

VAMs often fall far short of their intention to create fair, apples-to-apples comparisons of teachers’ effectiveness.

In 2014, Sheri Lederman — a highly praised, National Board-certified 4th-grade teacher with 18 years of experience, working in Great Neck, N.Y. — received just one out of 20 possible points on her VAM score, despite having received 14 out of 20 points the year before. Her students’ test scores were not markedly different across those two years, and there was no discernible change in her teaching methods, yet the evaluation system had given her wildly different ratings.

Lederman brought suit in the New York State Supreme Court, prompting a well-publicized case that elicited affidavits from a number of testing experts (including the coauthor of this article), many of whom argued that this and other VAMs were unreliable (i.e., they lacked consistency over time). Lederman won her case, with the presiding judge ruling that the state’s teacher evaluation system, based primarily on teachers’ VAM scores, was “arbitrary and capricious” and “taken without sound basis in reason or regard to the facts” (Lederman v. King, 2016).

However, though this ruling was widely publicized — and though a number of researchers have found VAMs to be unreliable (e.g., Chiang et al., 2016; Yeh, 2013) — many districts continue to use VAMs as the basis for high-stakes accountability decisions, such as firing teachers or awarding them tenure or merit pay.

2) Overlooking evidence of bias

Research suggests that VAMs, whether used on their own or in combination with classroom observations (Gill et al., 2016; Steinberg & Garrett, 2016; Whitehurst, Chingos, & Lindquist, 2014), tend to be not just unreliable but also biased.

In Lederman’s case, for example, it appears that the evaluation system penalized her for working with students who tended to get high test scores — since her students’ tests scores were high to begin with, it was more or less impossible for Lederman to show significant improvement over time. (This “ceiling effect,” referring to the difficulty of raising scores that are already about as high as they can go, was noted also by the plaintiffs in Houston Federation of Teachers v. Houston Independent School District.) Similarly, researchers have found that many evaluation systems are biased also against teachers who work with low-scoring students (who are more likely to come from low-income families, to be non-White, and/or to be enrolled in English language or special education programs) (Kane, 2017; Newton et al., 2010, Rothstein, 2010). Teachers teaching disproportionate numbers of these low-scoring students are more likely to be evaluated not for what they did in terms of their teaching but for factors outside their control.

Further, VAMs have also been found to show bias toward teachers of specific subject areas and grade levels, as argued by the plaintiffs in the Tennessee and New Mexico lawsuits. How, for example, can an evaluation system be described as fair when it consistently rates English teachers higher than math teachers, or rates 4th- and 8th-grade teachers higher than teachers of grades 5-7 (Holloway-Libell, 2015)? If a system is designed to measure the performance of individual teachers, then the results shouldn’t vary by the groups to which teachers belong.

3) Allowing people to game the system

Policy makers may have the power to design and introduce new teacher evaluation systems, but school leaders have the power to implement those systems, and they are unlikely to do so in ways that run counter to their professional expertise and interests. It should be no surprise, then, that local administrators will often find ways to inflate the ratings of teachers they want to protect, especially when they believe the given teacher evaluation system to be inaccurate, unfair, or prejudiced. Frequently, for example, principals will give teachers very high scores on their classroom observations to counterbalance their low VAM scores (Geiger & Amrein-Beardsley, 2017).

Conversely, administrators may be tempted to lower their ratings of teachers’ classroom practice to better align them with the VAM scores they’ve already received. So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their “subjective” classroom ratings were in sync with the VAM’s “objective” scores.

Whether administrators inflate or downgrade their ratings, such behavior distorts the validity (or “truthfulness”) of the entire teacher evaluation system. Yet, such gaming of the data appears to be common practice (Grossman et al., 2014; Hill, Kapitula, & Umland, 2011; Polikoff & Porter, 2014), in violation of the professional standards created by the major research organizations in education, psychology, and measurement (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), which call for VAM scores and observation ratings to be kept separate — one should never be adjusted to fit the other.

The important lesson here is that teacher evaluation systems, like any invention, need to be tested in the real world to see how they translate from theory to practice and to give their designers the chance to make improvements. In Houston and Tennessee, that never happened. Rather, evidence suggests that policy makers pushed their teacher evaluation systems into place without first trying them out on a limited scale.

4) Avoiding transparency

The creators of VAMs tend to incorporate unique statistical components into their models, which allows each developer to claim that their product is different from (and superior to) others on the market. Indeed, the competition for market share is fierce. Some states and school districts have been willing to spend millions of dollars for tools that provide “objective” data they can use to evaluate teachers — for example, Houston paid $680,000 per year for its VAM (Amrein-Beardsley, 2016). Given the considerable profits at stake, evaluators are extremely protective of their proprietary materials, taking great care to guard their trade secrets. And when asked for more information, they tend to be reluctant to provide it, arguing that their “methods [are] too complex for most of the teachers whose jobs depended on them to understand” (Carey, 2017; see also Gabriel & Lester, 2013).

However, the money spent on these teacher evaluation systems comes from public tax revenues. In some parts of the country (New York, for example), state law requires that such uses of public funds be transparent to increase the likelihood that those services are, in fact, being delivered as promised. Further, critics of VAMs argue that, whatever the state law, public officials have no business purchasing tools and services that put the rights of consumers at risk. Teachers want (and deserve) to know how they are being evaluated, especially when their salaries and jobs may be at stake. Not only should they have access to this information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.

Indeed, the presiding judge in the Houston lawsuit ruled, in 2017, that teachers did have legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well. Thus, if policy makers want to escape challenges to their teacher evaluation systems, they will need to open their VAMs to public scrutiny. The vendors of those data systems will likely object, but teachers and the larger tax-paying public have a right to know what they are paying for and to see whether those goods and services are as advertised.

Toward better teacher evaluation systems

Inventors, by necessity, must be optimists. They work at difficult problems with the assumption that some new angle or idea will provide a breakthrough. In that spirit, we conclude not by rehashing our critiques but by asking what steps might lead toward significant improvements in teacher evaluation. Given what we’ve learned from VAMs’ shortcomings, where should we go next?

Three approaches strike us as promising: using multiple measures to evaluate teachers, designing teacher evaluation systems that emphasize formative uses, and engaging teachers throughout the process of creating and refining these systems.

Multiple measures

Under existing VAM-based evaluation systems, teachers are often tempted to discount the ratings they receive, on the basis that their scores “jump around” from year to year (i.e., aren’t reliable), fail to account for what makes their teaching situation unique (i.e., aren’t valid), and/or are biased against them because of the students they happen to teach (whether high achievers, who have no room to improve, or students struggling with poverty or other challenges that make their improvement less predictable).

It is far better to build evaluation systems that educators trust and that align with their beliefs about teaching and learning.

But if VAM scores are just one kind of indicator among many, then these methodological problems become less serious. For example, if teachers’ VAM scores point in the same direction as their ratings from principals’ observations, feedback from colleagues, and ratings from student surveys about their instruction, then it becomes difficult for them to argue that the evaluation isn’t credible. And if these indicators point in different directions, then it becomes difficult for administrators to argue that they have all the information they need — divergent ratings should prompt a closer look at what’s going on.

The use of multiple measures has become common in many other fields. In the business world, for example, hiring managers tend to consider applicants’ resumes, recommendations, andratings from interviews. Similarly, admissions directors at selective colleges consider far more than just, say, test scores — they look also at students’ grade point averages, class ranks, recommendation letters, writing samples, talents, and interests. So, too, should state and district policy makers create evaluation systems that rely on multiple measures of teachers’ effectiveness.

An emphasis on the formative

Second, we recommend that policy makers shift the focus of teacher evaluation from “who should I fire?” to “how can I help teachers improve?” In a recent study of teacher evaluation in a large urban school district, for example, researchers found that in the more effective schools, administrators tended to use evaluation data first and foremost to inform professional development, rather than to reward or discipline teachers (Reinhorn, Moore Johnson, & Simon, 2017). In our own research, too, we’ve found that many state departments of education now seem to be shifting their focus in this way, encouraging districts to look for opportunities to use their teacher evaluation systems to provide formative feedback rather than using them strictly for accountability purposes (Close, Amrein-Beardsley, & Collins, 2018).

Teacher ownership and engagement

Lastly, we recommend that policy makers look for ways to include teachers in the process of developing and improving their own evaluation systems. As we noted earlier, when educators distrust the given approach, they often respond by trying to “game the system,” distorting their professional judgments to protect themselves and their colleagues from consequences they see as unfair and illegitimate. It is far better, we think, to build evaluation systems that educators trust, that align with their beliefs about teaching and learning, and that they see no reason to game — or, for that matter, to challenge in court. The best way to encourage such buy-in among teachers is to include them in deciding how they should be evaluated and with what tools and indicators. Since passage of the Every Student Succeeds Act, many states and districts have, in fact, brought more stakeholders into the process of designing teacher evaluation systems, and we’re hopeful that this will lead to improvement.

At present, things may seem fairly chaotic in the world of teacher evaluation, given the many lawsuits that have been brought against the use of and overreliance on VAMs. We’re optimistic about the coming years, though. As Mary Wollstonecraft Shelley wrote in her introduction to Frankenstein, “Invention . . .  does not consist of creating out of void, but out of chaos.” If policy makers learn from the mistakes of the recent past, then they should be able to design teacher evaluation systems that are consistent, valid, fair, and useful.



Citation: Close, K. & Amrein-Beardsley. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan 100 (1), 15-19.


KEVIN CLOSE (kclose1@asu.edu) is a doctoral student in the Learning, Literacies, and Technologies Program at Mary Lou Fulton Teachers College at Arizona State University in Tempe.
AUDREY AMREIN-BEARDSLEY (audrey.beardsley@asu.edu) is a professor in the Educational Policy and Evaluation Program at Mary Lou Fulton Teachers College, Arizona State University. She is the author of Rethinking Value-Added Models in Education: Critical Perspectives on Tests and Assessment-Based Accountability (Routledge, 2014) and coeditor of Student Growth Measures in Policy and Practice: Intended and Unintended Consequences of High-Stakes Teacher Evaluations (Palgrave, 2016).


  • Joel Berg, Ph.D.

    Last sentence: “evaluation” not “education”.

    The best teaching is the improvement of attitudes, leading to eagerness and enjoyment in learning. Abilities will develop in all subject areas under this emphasis.

    A fine article.

  • Laura H. Chapman

    Then there is the fact that the VAM rituals are not applicable to teachers for whom there are not standardized state tests. The last estimates for that population was about 69%. THe distortion of the whole of education is too rarely noted in all of the data-chasing on behalf of scores in reading and math, occassionally science, perhaps social studies. The truncated test-driven curriculum has been enabled by the attention given to VAM. I am grateful that the sham factors in VAM are being exposed. I work in Ohio where all teacher evaluations are a sham, especially EVASS, our version of VAM, and the infamous SLO writing exercise for teachers of “untested subjects.”

What can work in teacher evaluation: Lessons from Boys in the Boat

Daniel James Brown’s 2013 bestseller, The Boys in the Boat, illustrates four powerful ideas that ought to inform our teacher evaluation systems.


You would think by now both educators and policy makers would have learned a lesson or two about what does not work in teacher evaluation. In recent years, both the federal government and the Gates Foundation have put results-oriented teacher evaluation at the front and center of the school reform movement. However, as Kevin Close and Audrey Amrein-Beadsley describe in their article, using students’ scores on standardized achievement tests to assess teaching effectiveness has proven to be quite problematic. Not only are the statistics behind value-added models unreliable and biased but administrators are “tempted” to align their assessment of teachers’ classroom practice to the VAM scores they’ve already received.

While Close and Amrein-Beadsley draw upon the arguments and decisions made in 15 lawsuits that have raised questions about the legal legitimacy of VAM to judge teachers, their analyses comport neatly with a recent 600-page RAND report on Gates’ investments in the Intensive Partnerships (IP) for Effective Teaching. In short, RAND found “no evidence” that the use of test-based teacher evaluations led to improved student learning and retention of effective teachers at the IP sites, which included three school districts and four charter management organizations. Perhaps due to the high burdens placed on principals’ time and the incomplete (and sometimes inaccurate) data produced by the tests, the sites were not able to improve the effectiveness of their current teachers through their systems of coaching, mentoring, and professional development.

As I reflect on the analyses by Close and Amrein-Beadsley as well as RAND’s post-mortem of the IP, I cannot help but think back to another RAND report on teacher evaluation, this one published in 1984, revealing how the data were rarely used to inform school priorities as well as how “evaluation processes needed to yield descriptive information that illuminates sources of difficulty, as well as viable courses for change.”

I also have to consider recent studies from numerous scholars — such as John Papay, Matt Ronfeldt, and Alan Daly — who have shown how teacher collaboration and networking power up student learning. We now know much about how teachers, especially those teaching diverse, high-need students, improve their pedagogical practices and take instructional risks when they learn from other teaching colleagues they trust. Unfortunately, though, as found by yet another recent RAND investigation fewer than one in three teachers in the U.S. have sufficient time to collaborate with their teaching colleagues, and 44% report that, in a typical month, they never observe another teacher’s classroom to get ideas for instruction or to offer feedback.

We should rethink the assumption that teacher evaluation must focus on assessing and developing the skills of individual teachers — it can be more effective to cultivate teams of teachers.

At the same time, Close and Amrein-Beardsley’s article also brings to my mind a decidedly non-academic publication, Daniel James Brown’s 2013 bestseller The Boys in the Boat, which recounts the stunning gold-medal performance by the University of Washington rowing crew at the 1936 Berlin Olympics. What does a poignant story about a rowing team have to do with rethinking teacher evaluation? Central to the narrative is Joe Rantz, one of the UW crew members, who overcame a hardscrabble Depression-era childhood to become a successful engineer. But the legendary come-from-behind victory in the Berlin games was not about his individual effort, or that of any of his eight teammates, but about how, over time, they became attuned to one another’s performance and well-being. In fact, Joe was at one time the “weak link in the crew” who often “struggled to master the technical side of the sport,” but he and his crewmates were “fiercely determined” to make sure none of them failed. And it wasn’t just a matter of teamwork. The Boys would not have won the gold medal without the right kind of technical and moral support from their coaches as well as the engineered precision of the boat, The Husky Clipper, which was designed to maximize the particular strengths of its crew members.

Brown’s story illustrates at least four powerful ideas that ought to inform the next generation of teacher evaluation systems:

  1. We should rethink the assumption that teacher evaluation must focus on assessing and developing the skills of individual teachers — it can be more effective to cultivate teams of teachers, bringing individuals together in ways that meet their students’ particular needs;
  2. We should redesign schools to ensure that the most accomplished practitioners — both teachers and administrators — have opportunities to share ideas and provide leadership;
  3. We should retool our teaching evaluation rubrics so that they rate teachers at the top of the scale only if they share their expertise with their colleagues; and
  4. We should repurpose the job of the school principal to encourage administrators to cultivate teacher leadership and reward them for doing so.

These recommendations are not pie-in-the-sky. They are being played out today in school systems both overseas and here in the U.S. For example, in Singapore, teacher evaluation is not a checklist but a narrative that begins with self-assessment and focuses on contributions to the holistic development of students. Master teachers are identified only when they demonstrate how they help their colleagues improve.

Closer to home, in the Pomona (Calif.) Unified School District (where my organization provides technical assistance), district administrators are shifting professional development to emphasize teacher voice and choice and offer opportunities for teachers to coteach, redesign learning environments for students, and take time to share standards-based lessons and resources with their colleagues. Early evidence has shown that this approach is improving student achievement among the district’s high-need students. And now the district and the union are beginning to reshape teacher evaluation to focus on both individual growth and teamwork that leads to more equitable student outcomes.

Over the last three decades, we have learned a great deal about teacher evaluation. Researchers have surfaced many lessons from a wide range of statistical analyses, legal decisions, and qualitative studies. But the future of teacher evaluation may be best informed by how nine boys in a boat developed and shared responsibility among themselves — with supportive coaches and a well-designed shell to match their strengths — and became the #1 crew in the world.


Daly, A., Moolenaar, N., Der-Martrosian, C., & Liou, Y. (2014). Accessing capital resources: Investigating the effects of teacher human and social capital on student achievement. Teachers College Record, 116 (7), 1-42.

Johnston, W.R. & Tsai, T. (2018). The prevalence of collaboration among American teachers: National findings from the American Teacher Panel. Santa Monica, VA: RAND Corporation.

Papay, J., Taylor, E.S., Tyler, J., & Laski, M. (2016, February). Learning job skills from colleagues at work: Evidence from a field experiment using teacher performance data. Cambridge, MA: National Bureau of Economic Research.

Ronfeldt, M., Farmer, S.O., McQueen, K., & Grissom, J. (2015). Teacher collaboration in instructional teams and student achievement. American Educational Research Journal, 52 (3), 475-514.

Stecher, B. et al. (2018). Improving teaching effectiveness: Final report of the Intensive Partnerships for Effective Teaching through 2015–2016. Santa Monica, CA: RAND Corporation.

Wise, A., Darling-Hammond, L., Tyson-Bernstein, H., & McLaughlin, M. (1984). Teacher evaluation: A study of effective practices. Washington DC: RAND Corporation.

BARNETT BERRY (bberry@teachingquality.org) is the founder of the Center for Teaching Quality, a national nonprofit dedicated to igniting change inside of public education in order to transform student learning.
