When rethinking their teacher evaluation systems under ESSA, education leaders will do well to look to the past for guidance on what not to do — and what to do better.
Reflecting on the many times his experiments and inventions went awry, Thomas Edison once said, “I have not failed. I just found 10,000 ways that won’t work. I am not discouraged, because every wrong attempt discarded is another step forward.”
Today’s education policy makers would do well to follow Edison’s lead and treat their missteps as learning opportunities. For example, consider the many efforts made in recent years to evaluate teachers by way of value-added models or other calculations based on their students’ scores on standardized achievement tests. Those models have been flawed, as we discuss below. However, their shortcomings have much to teach us about the challenges involved in assessing the work of classroom teachers.
In recent months, we’ve reviewed the arguments and decisions made in 15 lawsuits filed by teachers unions (identified by Education Week, 2015) that have sought to block states and districts from using such evaluation systems. Further, we’ve looked deeply into four of these lawsuits — from New York (Lederman v. King), New Mexico (State of New Mexico ex rel. the Hon. Mimi Stewart et al. v. Public Education Department), Texas (Houston Federation of Teachers v. Houston Independent School District), and Tennessee (Trout v. Knox County Board of Education) — combing through more than 15,000 pages of accompanying legal documents to identify the strongest objections to these measurement strategies.
The lessons to be learned from these cases are both important and timely. Under the Every Student Succeeds Act (ESSA), local education leaders once again have authority to decide for themselves how to assess teachers’ work. As a result, hundreds of state- and district-level teams (including teachers, administrators, union representatives, state department of education personnel, and others) are now trying to decide whether and how to redesign their existing teacher evaluation systems. If they’re not careful, they could easily repeat the four main blunders their predecessors made:
1) Ignoring inconsistent findings
Value-added or student-growth models (more generally referred to as VAMs) are meant to gauge individual teachers’ effectiveness by comparing their former students’ test scores with those of demographically similar students (based on factors such as past achievement, family income, race, English language fluency, and special education status) who’ve studied with other teachers.
This may sound like an objective form of evaluation, giving school administrators a perfectly fair way to determine precisely how much or how little each teacher contributes to student achievement, relative to their peers in similar schools and classrooms. However, in all 15 lawsuits highlighted by Education Week, plaintiffs have argued that VAMs often fall far short of their intention to create fair, apples-to-apples comparisons of teachers’ effectiveness because their statistical models cannot account for the subtle ways in which groups of students differ from one year to the next. In no lawsuit was this clearer than in Lederman v. King.
VAMs often fall far short of their intention to create fair, apples-to-apples comparisons of teachers’ effectiveness.
In 2014, Sheri Lederman — a highly praised, National Board-certified 4th-grade teacher with 18 years of experience, working in Great Neck, N.Y. — received just one out of 20 possible points on her VAM score, despite having received 14 out of 20 points the year before. Her students’ test scores were not markedly different across those two years, and there was no discernible change in her teaching methods, yet the evaluation system had given her wildly different ratings.
Lederman brought suit in the New York State Supreme Court, prompting a well-publicized case that elicited affidavits from a number of testing experts (including the coauthor of this article), many of whom argued that this and other VAMs were unreliable (i.e., they lacked consistency over time). Lederman won her case, with the presiding judge ruling that the state’s teacher evaluation system, based primarily on teachers’ VAM scores, was “arbitrary and capricious” and “taken without sound basis in reason or regard to the facts” (Lederman v. King, 2016).
However, though this ruling was widely publicized — and though a number of researchers have found VAMs to be unreliable (e.g., Chiang et al., 2016; Yeh, 2013) — many districts continue to use VAMs as the basis for high-stakes accountability decisions, such as firing teachers or awarding them tenure or merit pay.
2) Overlooking evidence of bias
Research suggests that VAMs, whether used on their own or in combination with classroom observations (Gill et al., 2016; Steinberg & Garrett, 2016; Whitehurst, Chingos, & Lindquist, 2014), tend to be not just unreliable but also biased.
In Lederman’s case, for example, it appears that the evaluation system penalized her for working with students who tended to get high test scores — since her students’ tests scores were high to begin with, it was more or less impossible for Lederman to show significant improvement over time. (This “ceiling effect,” referring to the difficulty of raising scores that are already about as high as they can go, was noted also by the plaintiffs in Houston Federation of Teachers v. Houston Independent School District.) Similarly, researchers have found that many evaluation systems are biased also against teachers who work with low-scoring students (who are more likely to come from low-income families, to be non-White, and/or to be enrolled in English language or special education programs) (Kane, 2017; Newton et al., 2010, Rothstein, 2010). Teachers teaching disproportionate numbers of these low-scoring students are more likely to be evaluated not for what they did in terms of their teaching but for factors outside their control.
Further, VAMs have also been found to show bias toward teachers of specific subject areas and grade levels, as argued by the plaintiffs in the Tennessee and New Mexico lawsuits. How, for example, can an evaluation system be described as fair when it consistently rates English teachers higher than math teachers, or rates 4th- and 8th-grade teachers higher than teachers of grades 5-7 (Holloway-Libell, 2015)? If a system is designed to measure the performance of individual teachers, then the results shouldn’t vary by the groups to which teachers belong.
3) Allowing people to game the system
Policy makers may have the power to design and introduce new teacher evaluation systems, but school leaders have the power to implement those systems, and they are unlikely to do so in ways that run counter to their professional expertise and interests. It should be no surprise, then, that local administrators will often find ways to inflate the ratings of teachers they want to protect, especially when they believe the given teacher evaluation system to be inaccurate, unfair, or prejudiced. Frequently, for example, principals will give teachers very high scores on their classroom observations to counterbalance their low VAM scores (Geiger & Amrein-Beardsley, 2017).
Conversely, administrators may be tempted to lower their ratings of teachers’ classroom practice to better align them with the VAM scores they’ve already received. So argued the plaintiffs in the Houston and Tennessee lawsuits, for example. In those systems, school leaders appear to have given precedence to VAM scores, adjusting their classroom observations to match them. In both cases, administrators admitted to doing so, explaining that they sensed pressure to ensure that their “subjective” classroom ratings were in sync with the VAM’s “objective” scores.
Whether administrators inflate or downgrade their ratings, such behavior distorts the validity (or “truthfulness”) of the entire teacher evaluation system. Yet, such gaming of the data appears to be common practice (Grossman et al., 2014; Hill, Kapitula, & Umland, 2011; Polikoff & Porter, 2014), in violation of the professional standards created by the major research organizations in education, psychology, and measurement (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 1999), which call for VAM scores and observation ratings to be kept separate — one should never be adjusted to fit the other.
The important lesson here is that teacher evaluation systems, like any invention, need to be tested in the real world to see how they translate from theory to practice and to give their designers the chance to make improvements. In Houston and Tennessee, that never happened. Rather, evidence suggests that policy makers pushed their teacher evaluation systems into place without first trying them out on a limited scale.
4) Avoiding transparency
The creators of VAMs tend to incorporate unique statistical components into their models, which allows each developer to claim that their product is different from (and superior to) others on the market. Indeed, the competition for market share is fierce. Some states and school districts have been willing to spend millions of dollars for tools that provide “objective” data they can use to evaluate teachers — for example, Houston paid $680,000 per year for its VAM (Amrein-Beardsley, 2016). Given the considerable profits at stake, evaluators are extremely protective of their proprietary materials, taking great care to guard their trade secrets. And when asked for more information, they tend to be reluctant to provide it, arguing that their “methods [are] too complex for most of the teachers whose jobs depended on them to understand” (Carey, 2017; see also Gabriel & Lester, 2013).
However, the money spent on these teacher evaluation systems comes from public tax revenues. In some parts of the country (New York, for example), state law requires that such uses of public funds be transparent to increase the likelihood that those services are, in fact, being delivered as promised. Further, critics of VAMs argue that, whatever the state law, public officials have no business purchasing tools and services that put the rights of consumers at risk. Teachers want (and deserve) to know how they are being evaluated, especially when their salaries and jobs may be at stake. Not only should they have access to this information for instructional purposes, but if they believe their evaluations to be unfair, they should be able to see all of the relevant data and calculations so that they can defend themselves.
Indeed, the presiding judge in the Houston lawsuit ruled, in 2017, that teachers did have legitimate claims to see how their scores were calculated. Concealing this information, the judge ruled, violated teachers’ due process protections under the 14th Amendment (which holds that no state — or in this case organization — shall deprive any person of life, liberty, or property, without due process). Given this precedent, it seems likely that teachers in other states and districts will demand transparency as well. Thus, if policy makers want to escape challenges to their teacher evaluation systems, they will need to open their VAMs to public scrutiny. The vendors of those data systems will likely object, but teachers and the larger tax-paying public have a right to know what they are paying for and to see whether those goods and services are as advertised.
Toward better teacher evaluation systems
Inventors, by necessity, must be optimists. They work at difficult problems with the assumption that some new angle or idea will provide a breakthrough. In that spirit, we conclude not by rehashing our critiques but by asking what steps might lead toward significant improvements in teacher evaluation. Given what we’ve learned from VAMs’ shortcomings, where should we go next?
Three approaches strike us as promising: using multiple measures to evaluate teachers, designing teacher evaluation systems that emphasize formative uses, and engaging teachers throughout the process of creating and refining these systems.
Under existing VAM-based evaluation systems, teachers are often tempted to discount the ratings they receive, on the basis that their scores “jump around” from year to year (i.e., aren’t reliable), fail to account for what makes their teaching situation unique (i.e., aren’t valid), and/or are biased against them because of the students they happen to teach (whether high achievers, who have no room to improve, or students struggling with poverty or other challenges that make their improvement less predictable).
It is far better to build evaluation systems that educators trust and that align with their beliefs about teaching and learning.
But if VAM scores are just one kind of indicator among many, then these methodological problems become less serious. For example, if teachers’ VAM scores point in the same direction as their ratings from principals’ observations, feedback from colleagues, and ratings from student surveys about their instruction, then it becomes difficult for them to argue that the evaluation isn’t credible. And if these indicators point in different directions, then it becomes difficult for administrators to argue that they have all the information they need — divergent ratings should prompt a closer look at what’s going on.
The use of multiple measures has become common in many other fields. In the business world, for example, hiring managers tend to consider applicants’ resumes, recommendations, andratings from interviews. Similarly, admissions directors at selective colleges consider far more than just, say, test scores — they look also at students’ grade point averages, class ranks, recommendation letters, writing samples, talents, and interests. So, too, should state and district policy makers create evaluation systems that rely on multiple measures of teachers’ effectiveness.
An emphasis on the formative
Second, we recommend that policy makers shift the focus of teacher evaluation from “who should I fire?” to “how can I help teachers improve?” In a recent study of teacher evaluation in a large urban school district, for example, researchers found that in the more effective schools, administrators tended to use evaluation data first and foremost to inform professional development, rather than to reward or discipline teachers (Reinhorn, Moore Johnson, & Simon, 2017). In our own research, too, we’ve found that many state departments of education now seem to be shifting their focus in this way, encouraging districts to look for opportunities to use their teacher evaluation systems to provide formative feedback rather than using them strictly for accountability purposes (Close, Amrein-Beardsley, & Collins, 2018).
Teacher ownership and engagement
Lastly, we recommend that policy makers look for ways to include teachers in the process of developing and improving their own evaluation systems. As we noted earlier, when educators distrust the given approach, they often respond by trying to “game the system,” distorting their professional judgments to protect themselves and their colleagues from consequences they see as unfair and illegitimate. It is far better, we think, to build evaluation systems that educators trust, that align with their beliefs about teaching and learning, and that they see no reason to game — or, for that matter, to challenge in court. The best way to encourage such buy-in among teachers is to include them in deciding how they should be evaluated and with what tools and indicators. Since passage of the Every Student Succeeds Act, many states and districts have, in fact, brought more stakeholders into the process of designing teacher evaluation systems, and we’re hopeful that this will lead to improvement.
At present, things may seem fairly chaotic in the world of teacher evaluation, given the many lawsuits that have been brought against the use of and overreliance on VAMs. We’re optimistic about the coming years, though. As Mary Wollstonecraft Shelley wrote in her introduction to Frankenstein, “Invention . . . does not consist of creating out of void, but out of chaos.” If policy makers learn from the mistakes of the recent past, then they should be able to design teacher evaluation systems that are consistent, valid, fair, and useful.
American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.
Amrein-Beardsley, A. (2016, March 16). Alleged violation of protective order in Houston lawsuit, overruled. VAMboozled! http://vamboozled.com/alleged-violation-of-protective-order-in-houston-lawsuit-overruled
Carey, K. (2017, May 19). The little-known statistician who taught us to measure teachers. The New York Times.
Chiang, H., McCullough, M., Lipscomb, S., & Gill, B. (2016). Can student test scores provide useful measures of school principals’ performance? Washington, DC: U.S. Department of Education.
Close, K., Amrein-Beardsley, A., & Collins, C. (2018). State-level assessments and teacher evaluation systems after the passage of the Every Student Succeeds Act: Some steps in the right direction. Boulder, CO: National Education Policy Center.
Education Week. (2015, October 6). Teacher evaluation heads to the courts.
Gabriel, R. & Lester, J.N. (2013). Sentinels guarding the grail: Value-added measurement and the quest for education reform. Education Policy Analysis Archives, 21(9), 1-30.
Geiger, T. & Amrein-Beardsley, A. (2017). The artificial conflation of teacher-level “multiple measures” [Commentary]. Teachers College Record.
Gill, B., Shoji, M., Coen, T., & Place, K. (2016). The content, predictive power, and potential bias in five widely used teacher observation instruments. Washington, DC: U.S. Department of Education, Institute of Education Sciences.
Grossman, P., Cohen, J., Ronfeldt, M., & Brown, L. (2014). The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher, 43(6), 293–303.
Hill, H.C., Kapitula, L., & Umland, K. (2011). A validity argument approach to evaluating teacher value-added scores. American Educational Research Journal, 48(3), 794-831.
Holloway-Libell, J. (2015). Evidence of grade and subject-level bias in value-added measures. Teachers College Record.
Houston Federation of Teachers Local 2415 et al. v. Houston Independent School District, 251 F. Supp. 3d 1168 (S.D. Tex., 2017).
Kane, M.T. (2017). Measurement error and bias in value-added models (ETS RR-17-25). Princeton, NJ: Educational Testing Service.
Lederman v. King, No. 26416, slip op. (N.Y. May 10, 2016). https://law.justia.com/cases/new-york/other-courts/2016/2016-ny-slip-op-26416.html
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18 (23).
Polikoff, M.S. & Porter, A.C. (2014). Instructional alignment as a measure of teaching quality. Education Evaluation and Policy Analysis, 36(4), 399-416.
Reinhorn, S.K., Moore Johnson, S., & Simon, N.S. (2017). Investing in development: Six high-performing, high-poverty schools implement Massachusetts’ teacher evaluation policy. Educational Evaluation and Policy Analysis, 39 (3), 383–406.
Rothstein, J. (2010). Teacher quality in educational production: Tracking, decay, and student achievement. Quarterly Journal of Economics, 125 (1), 175-214.
State of New Mexico ex rel. the Hon. Mimi Stewart et al. v. Public Education Department (First Judicial District Court). www.aft.org/sites/default/files/nm-complaint-teacherevals_1114.pdf
Steinberg, M.P. & Garrett, R. (2016). Classroom composition and measured teacher performance: What do teacher observation scores really measure? Educational Evaluation and Policy Analysis, 38(2), 293-317.
Trout v. Knox County Board of Education, 163 F.Supp.3d 492 (E.D. Tenn. 2016).
Whitehurst, G.J., Chingos, M.M., & Lindquist, K.M. (2014). Evaluating teachers with classroom observations: Lessons learned in four districts.Washington, DC: Brookings Institution.
Yeh, S.S. (2013). A re-analysis of the effects of teacher replacement using value-added modeling. Teachers College Record, 115 (12), 1-35.
Citation: Close, K. & Amrein-Beardsley. (2018). Learning from what doesn’t work in teacher evaluation. Phi Delta Kappan 100 (1), 15-19.