Houston provides a cautionary lesson for districts making high-stakes decisions with flimsy tools.
By Audrey Amrein-Beardsley and Tray Geiger
In 2011, the Houston Independent School District terminated 221 teachers because they did not demonstrate, with surefire evidence, that they were contributing enough to student achievement. This happened shortly after a new superintendent came to town — a real maverick who pledged to hold teachers more accountable for their efforts, or a lack thereof, to improve student learning.
His strategy? Spend $680,000 per year to buy the district an intricate value-added model (VAM) — a statistical technique first used to estimate the effect of feed on cattle, as indicated by the quality of the beef ultimately produced and purchased per head — to measure the extent to which each Houston teacher contributed to improving student test scores. The VAM, he argued, would allow him to round up all of the system’s ineffective teachers, drive them out of town, and turn around the district’s historically poor performance. More specifically, he adopted the VAM to hold teachers more accountable for the growth their students demonstrated (or did not demonstrate) every year, with merit pay attached to incentivize teachers and threats of termination to disincentivize poor performance, all in the hope of increasing achievement and positioning the district as one of the best urban districts in the nation.
Five years and millions of dollars later, the Houston school board gave up on the VAM, discontinued the contract, and began looking at other ways to evaluate teachers’ performance. For his part, the superintendent managed to get out of Dodge, retiring in early 2016 with a satisfactory rating from the board, along with a $98,600 bonus (atop his $300,000 annual salary). However, the district finds itself embroiled in a lawsuit. It turns out that the superintendent’s VAM wasn’t just an ineffective way to measure teacher quality; it appears to have been downright unfair, giving teachers who were penalized or fired a legitimate reason to sue. (A lawsuit, brought by the local teachers union and a group of six Houston teachers, is now pending. In May 2017, the 5th Circuit Court ruled that the plaintiffs have a legitimate claim since the use of the VAM to determine sanctions and dismissals appeared to violate their Fourteenth Amendment due process protections. The court date is not yet known, but the case is set to go to trial.)
Given the cautionary experience in Houston, why in the world would other districts and states even consider adopting the same system?
The EVAAS: Why so widespread?
The VAM that Houston purchased is called the Educational Value-Added Assessment System® (EVAAS). Developed in the mid-1980s by William Sanders, an adjunct faculty member in agricultural statistics at the University of Tennessee, EVAAS gained broad appeal in the early 1990s when Tennessee stipulated that teachers, schools, and districts be held accountable for student learning based on the state’s educational goals. VAMs in general, and EVAAS in particular, then picked up more steam in response to No Child Left Behind and later the Race to the Top initiative. Today, the North Carolina-based business analytics software company SAS Institute Inc. customizes EVAAS for sale to individual states and school districts. (Arkansas, Georgia, Indiana, Texas, and Virginia are among the states that have given districts leeway to select whichever VAM they like.)
At the heart of EVAAS are advanced statistical models that estimate the value that individual teachers purportedly add to (or detract from) students’ test scores, relative to the value added by similar teachers (i.e. teachers working in the same subject area, at the same grade level, in the same district, with similar students, and so on). Based on these comparisons, EVAAS then calculates where teachers fall along a continuum, ranging from high value-added to low value-added. And these scores are then factored into teachers’ overall evaluations and subsequent rankings (e.g., highly effective, effective, ineffective, highly ineffective).
Nationally, EVAAS has become the biggest player in the market, partly because it is the oldest and best-known VAM around — when the Obama administration created the Race to the Top initiative (which pushed states to adopt value-added models to hold teachers accountable for student learning), EVAAS was well-positioned to beat out other software products.
Further, EVAAS’s marketing claims are powerful and attractive. For example, SAS Institute Inc. advertises EVAAS as the most comprehensive reporting package of value-added metrics available, one that yields precise, reliable and unbiased results that go far beyond what other simplistic [value-added] models found in the market today can provide. If implemented effectively, it will enable educators to recognize progress and growth over time, empowering them to make steady improvements in teaching and learning, which will open up a clear path to achieving the U.S. goal of leading the world in college completion by the year 2020 (SAS Institute, Inc., n.d.). (This last claim was casually added to EVAAS’s marketing claims to align with President Barack Obama’s goal to lead the world in college readiness and completion by 2020 [U.S. Department of Education, 2010]).
Keep in mind that the state- and district-level officials who went ahead and purchased EVAAS had little capacity to judge whether such promises were realistic or wildly overblown. Very few of them had the sort of advanced knowledge of statistics that might have allowed them to examine or challenge the research evidence behind the company’s marketing strategy.
It should be no surprise, then, that EVAAS has been unable to live up to its own hype. In Houston and many other parts of the country, it has proven itself to be about as useful as a set of udders on a bull.
See, for example, Figure 1, which shows that while EVAAS was in use for educational reform purposes in Houston (i.e. to increase student achievement), Houston students saw no improvements of the sort that had been promised in grades 3-8 in reading, grades 4 and 7 in writing, grades 5 and 8 in science, and grade 8 in social studies (Figure 1, blue trend lines). In those subject areas and grades, tests scores declined overall from 2012 to 2015, as compared to other similar students throughout the state (black trend lines). Indicators include those derived via Texas’s (relatively new) State of Texas Assessments of Academic Readiness (STAAR) tests (Houston Independent School District, 2015a; see also Amrein-Beardsley et al., 2016). Recall that this is the growth (or lack thereof) that Houston posted at the same time it usedthe EVAAS for more high-stakes, consequential purposes than anywhere else in the nation.
Further, and as Figure 2 shows, student scores on high school end-of-course tests in Algebra I, Biology, English I and II, and U.S. History (blue trend lines), actually declined from 2012 to 2015 — again, as compared to similar students throughout the state (black trend lines) (Houston Independent School District, 2015b).
Given the student achievement data, it seems clear that the use of EVAAS did not result in the gains that Houston’s superintendent anticipated; there is no evidence of that. Moreover, and as we describe below, we’ve found evidence suggesting that EVAAS did not even provide the valid, reliable, and unbiased teacher performance data that it promised (and for which the district paid a great deal of money).
What has happened — or didn’t happen — in Houston should, accordingly, be seen as a cautionary tale for education officials everywhere. EVAAS is still the most popular proprietary VAM in use throughout the U.S. (see, for example, Collins & Amrein-Beardsley, 2014), as it is mandated and used statewide in some parts of the country (e.g., North Carolina, Ohio, and Tennessee), with some states (e.g., Texas) and other large and small school districts either using or looking to adopt it for similar reform purposes.
Over the past few years, researchers have raised serious questions about the validity and usefulness of VAMs in general (AERA, 2015; ASA, 2014; NASSP, n.d.). In our own recent analysis, we looked specifically at Houston’s experience with EVAAS: We analyzed more than 1,700 Houston teachers’ value-added results, from 2012 to 2015, to investigate the software maker’s core claim about the quality of its product, the argument that it delivers precise, reliable, and unbiased results that go far beyond what other simplistic [value-added] models found in the market today can provide.
Not to suggest that these are the only issues that are worth exploring. For example, we are also concerned about the fairness of the model, since teachers of core subject areas have been disproportionally rewarded and punished based on calculations that may not be valid. And we are concerned about model transparency, since we have heard anecdotally and seen reported in recent research (e.g., Collins, 2014; Kappler Hewitt, 2015) that EVAAS is less accessible and more enigmatic than other VAMs in the market. For this study, however, we narrowed our investigation to the three issues that EVAAS itself highlights in its promotional materials: precision, reliability, and lack of bias.
Overall, we found that EVAAS performed no better than other VAMs (that have been used in other large school districts) in terms of validity and reliability. And when it comes to bias, EVAAS’s record seems to be worse than others: It yielded more biased value-added estimates than those produced by other VAMs. In short, EVAAS, at least in Houston, failed to live up to its own marketing claims. (Below, we provide a short overview of our findings. We expect a full, peer-reviewed account of our study, research methodology, and findings to be published within the next several months.)
Precision (or validity)
In Houston, did EVAAS yield valid and accurate inferences about the quality of teachers’instruction, and how did they compare to teachers’ ratings from being observed in the classroom?
We found that the correlations between Houston teachers’ instructional practice scores (which were derived from supervisor’s observations of teachers during instruction) and EVAAS scores over three years were statistically significant — meaning that EVAAS produced teacher ratings that were at least somewhat consistent with the ratings given by trained classroom observers. However, and more important, those correlations were fairly weak, and they were no stronger than the correlations that have been found between other VAMS and other teacher observation systems. In short, we found evidence to suggest that EVAAS ratings of teacher performance were more precise, or valid, than those produced by other VAMS on the market. Rather, EVAAS is on par with other VAMs, if not a bit below average.
Reliability (or consistency)
In Houston, did teachers’ EVAAS scores remain at least somewhat consistent over time (as they should if the results are to be deemed appropriately stable)?
We found that the correlations between Houston teachers’ EVAAS estimates across years were statistically significant, showing a moderate relationship among teachers’ EVAAS estimates over time. But even so, while some Houston teachers received similar EVAAS score from year to year, most received dissimilar scores, and these scores sometimes varied wildly. During the three years of our study, more than 65% of Houston teachers received scores that differed by two, three, or four EVAAS categories — categories range from highly ineffective to highly effective, effective to ineffective, and all variations in between — from one year to the next. In short, EVAAS did not, provide especially reliable teacher ratings. Its indicators of consistency, or (in)stability, were commensurate with what other VAMs tend to produce, which is to say that a single teacher might easily be ranked as effective one year, ineffective the next, highly effective the year after that, and so on.
Bias (or the lack thereof)
In Houston, were EVAAS estimates unbiased, or could they have been biased against certain teachers such as those who choose, or are assigned, to work with particular student populations?
We found that teachers with the least experience — those with fewer than two years — and teachers on probationary contracts had significantly lower EVAAS scores than all other teachers. This makes sense, given the assumption that experienced teachers tend to be more effective than novice teachers.
However, what does not make sense is that mathematics teachers had significantly higher EVAAS scores compared to those of other subject-area teachers. Are we to believe that mathematics teachers in Houston happen to be better than teachers of other subject areas? Or, as seems more likely, are the EVAAS scores biased in favor of teachers who teach mathematics and biased against teachers who do not? This interpretation is consistent with recent findings on the use of EVAAS elsewhere in the country (Holloway-Libell, 2015).
But that’s just the tip of the iceberg: We found also that teachers tended to get significantly higher EVAAS scores when they taught in schools with the smallest populations of racial minority students, the lowest numbers of English language learners, the smallest percentages of lower-income students, and the lowest numbers of special education students.
Again, two interpretations are possible. First, Houston may just happen to have more highly skilled and effective teachers working in schools that enroll the fewest numbers of minority students, English learners, students who are eligible for free and reduced-price lunch, and students with special needs. (If true, that should raise serious questions about how the district allocates its best teachers.) Second, and more likely, EVAAS generated biased results, producing lower ratings for teachers who happen to work with these student populations (as others have found to be the pattern with VAMs in general; Collins & Amrein-Beardsley, 2014; Kane, 2017; Koedel, Mihaly, & Rockoff, 2015; Newton et al., 2010).
In sum, the results suggest that the EVAAS did not, at least in Houston and perhaps elsewhere, offer precise, reliable, and unbiased results that go far beyond what other simplistic [value-added] models found in the market today can provide, as the SAS Institute Inc. claims. Rather, evidence shows that EVAAS ratings were no more precise or reliable than those offered by other VAMS, and they were biased against certain teachers.
Although EVAAS data from Houston does not entirely negate the possible benefits of VAMs in general, it does call into serious question the purported benefits of using such measures for high-stakes decision making, such as teacher termination, merit pay, and denial of teacher tenure. State and district policy makers ought to hesitate before relying on what might be a set of largely exaggerated and uncorroborated claims.
Perhaps others should learn from what happened in Houston. Again, the district’s former superintendent adopted the VAM to hold teachers more accountable for what they were doing (or not doing) to add value to their students’ learning and achievement. Holding them accountable, so his logic went (as was also informed by the marketing ploys advanced by EVAAS proprietors), would help reform Houston students’ historically low academic performance. Clearly, this did not work — even worse, the attachment of high-stakes consequences to EVAAS output (also leading to the termination of 221 teachers in 2011) has landed the district in court.
In short, anyone who is considering buying the EVAAS system (or any VAM for that matter) would be well-advised to investigate the model’s advertised strengths and claims thoroughly before making a substantial investment in the system. They should ask, for example, is there any real evidence to suggest that the EVAAS will help us become proactive, [make] sound instructional choices and [use] resources more strategically to ensure that every student has the chance to succeed? Is there any legitimate reason to believe that it will allow us o shrink the gap between education rhetoric and education reform? (SAS Institute, Inc., n.d.).
And if state and district officials are unsure that they can judge the claims made by the companies that market VAMs, or are unsure they can make sense of recent evaluations of the use of VAMs, they should look for somebody who can help. There are plenty of us out here who’ve seen the data, know the research, and will be happy to explain why the VAM industry is all sizzle, no steak.
American Educational Research Association (AERA) Council. (2015). AERA statement on use of value-added models (VAM) for the evaluation of educators and educator preparation programs. Educational Researcher, 44 (8), 1-5. http://bit.ly/AERAonVAMS
American Statistical Association (ASA). (2014). ASA statement on using value-added models for educational assessment. www.amstat.org/asa/files/pdfs/POL-ASAVAM-Statement.pdf
Amrein-Beardsley, A., Collins, C., Holloway-Libell, J., & Paufler, N.A. (2016). Everything is bigger (and badder) in Texas: Houston’s teacher value-added system. [Commentary]. Teachers College Record. www.tcrecord.org/Content.asp?ContentId=18983
Collins, C. (2014). Houston, we have a problem: Teachers find no value in the SAS Education Value-Added Assessment System (EVAAS®). Education Policy Analysis Archives, 22, 2-139
Collins, C. & Amrein-Beardsley, A. (2014). Putting growth and value-added models on the map: A national overview. Teachers College Record, 16 (1). www.tcrecord.org/Content.asp?ContentId=17291
Holloway-Libell, J. (2015). Evidence of grade and subject-level bias in value-added measures. Teachers College Record, 117.
Houston Federation of Teachers (Plaintiff) v. Houston Independent School District (Defendant), Civil No. 4:14-CV-01189. (2015). United States District Court, Southern District of Texas, Houston Division.
Houston Independent School District. (2015a). State of Texas Assessments of Academic Readiness performance, grades 3-8, spring 2015. Houston, TX: Author. http://bit.ly/HISD_STAAR
Houston Independent School District. (2015b). State of Texas Assessments of Academic Readiness end-of-course results, spring 2015. Houston, TX: Author. www.houstonisd.org/Page/69852
Kane, M.T. (2017). Measurement error and bias in value-added models (ETS RR–17- 25). Princeton, NJ: Educational Testing Services. doi:10.1002/ets2.12153. http://onlinelibrary.wiley.com/doi/10.1002/ets2.12153/full
Kappler Hewitt, K. (2015). Educator evaluation policy that incorporates EVAAS value-added measures: Undermined intentions and exacerbated inequities. Education Policy Analysis Archives, 23 (76), 1-49.
Koedel, C., Mihaly, K., & Rockoff, J.E. (2015). Value-added modeling: A review. Economics of Education Review, 47, 180-195.
National Association of Secondary School Principals (NASSP). (n.d.). Value-added measures in teacher evaluation: Position statement. Reston, VA: Author. www.nassp.org/who-weare/board-of-directors/position-statements/value-added-measures-in-teacher-evaluation?SSO=true
Newton, X., Darling-Hammond, L., Haertel, E., & Thomas, E. (2010). Value-added modeling of teacher effectiveness: An exploration of stability across models and contexts. Educational Policy Analysis Archives, 18 (23). doi: 10.14507/epaa.v18n23.2010
SAS Institute Inc. (n.d.). SAS EVAAS for K-12: Assess and predict student performance with precision and reliability. Cary, NC: Author. www.sas.com/en_ph/industry/k-12-education/evaas.html
U.S. Department of Education. (2010). International education rankings suggest reform can lift U.S. https://blog.ed.gov/2010/12/international-education-rankings-suggest-reform-can-lift-u-s/
AUDREY AMREIN-BEARDSLEY (firstname.lastname@example.org) is a professor of educational policy and evaluation and TRAY GEIGER is a doctoral student, both at Mary Lou Fulton Teachers College, Arizona State University, Tempe, Ariz.
Originally published in October 2017 Phi Delta Kappan 99 (2), 53-59. © 2017 Phi Delta Kappa International. All rights reserved.