Testing international education assessments

News stories on international large-scale education assessments (ILSAs) tend to highlight the performance of the media outlet’s home country in comparison with the highest-scoring nations (in recent years, typically located in East Asia). Low (or declining) rankings can be so alarming that policy-makers leap to remedies—often ill-founded—on the basis of what they conclude is the “secret sauce” behind the top performers’ scores. As statisticians studying the methods and policy uses of ILSAs (1), we believe the obsession with rankings—and the inevitable attempts to mimic specific features of the top performing systems—not only misleads, it diverts attention from more constructive uses of ILSA data. We highlight below the perils of drawing strong policy inferences from such highly aggregated data, illustrate benefits of conducting more nuanced analyses of ILSA data both within and across countries, and offer concrete suggestions for improving future ILSAs.

Despite our critiques, ILSAs’ high costs, and policy-makers’ often misguided inferences, we certainly do not believe that ILSAs—such as Programme for International Student Assessment (PISA), Trends in International Mathematics and Science Study (TIMSS), or Progress in International Reading Literacy Study (PIRLS)—should be abandoned. ILSAs provide a unique reference framework for understanding national results and patterns of relationships. As such, they can be invaluable for mobilizing a country’s political will to invest resources in education. ILSA data have also served as the foundation for hundreds of secondary analyses that have addressed important education policy questions. But to fulfill their promise, changes are needed in their interpretation, dissemination, and analysis and in the strategies used to design future assessments.

Unpacking East Asian Successes

In 2012, the seven jurisdictions with the highest mean mathematics PISA scores were Shanghai, Singapore, Hong Kong, Taiwan, South Korea, Macau, and Japan. Before concluding that adopting any feature of East Asian education—such as mastery textbooks, the “solution du jour” recently adopted by Great Britain (2)—would improve the test scores of students elsewhere, consider the following.

ILSA samples aren’t necessarily representative of a jurisdiction’s relevant age (or grade) group. PISA defines its target population as “15-year-olds enrolled in education institutions full-time.” Until recently, China had an internal passport system in which migrants from the countryside could not enroll in urban schools. In 2014, the Organization for Economic Cooperation and Development (OECD), which sponsors PISA, admitted that the 2012 Shanghai sample excluded 27% of 15-year-olds (in the United States, the comparable statistic was ∼11%) (3). Less-developed OECD countries, such as Mexico and Turkey, have similar problems because as many as 40% of their 15-year-olds have already dropped out. The full consequences of these exclusions for country-level means is not known, but it is likely that excluded students come from the lower end of the achievement distribution.

Results from single cities (such as Shanghai), city states (such as Singapore), or countries with national education systems (such as France) aren’t comparable with those from countries with decentralized systems (such as the United States, Canada, and Germany). For decentralized systems, country-level summaries are almost meaningless because they conceal substantial within-country heterogeneity in policies and practices. In 2015, for example, Massachusetts participated in PISA as a separate jurisdiction. Its mean reading score was statistically indistinguishable from those of the top performing East Asian nations; its mean mathematics score was more middling, but if the state were treated as a country, it would still have ranked 12th (4).

The number of credible school-based predictors of student test scores always exceeds the number of countries assessed. Even as ILSA coverage expands, most include 50 to 75 countries or jurisdictions. Education experts could suggest 50 to 75 credible predictors of country-level test scores: Mastery textbooks? Teacher training? Teaching methods? Peer effects? With as many plausible predictors as there are countries in any analysis, it’s impossible to conclude whether any particular feature of an education system—even one highly correlated with country-level mean test scores—definitively explains differences in student performance.

Test scores are also affected by many factors outside of school, so it’s a mistake to treat test scores as unconditionally valid indicators of the quality of an education system. For example, in East Asia, as elsewhere, private tutoring is widespread. Korea is the most prominent example, where approximately half the 2012 PISA participants reported receiving private tutoring, often focused on test preparation. Overall, expenditures on private tutoring added 2.6% of gross domestic product to the government’s contribution of 3.5% toward education (5). One plausible interpretation of Korea’s and other East Asian jurisdictions’ ILSA results is that they are not a result of the public education systems but rather this substantial private investment.

Student motivation to take low-stakes assessments varies across countries. ILSA scores have no consequences for individuals, so students must be motivated to do their best. It’s plausible that students from East Asian cultures score higher partially because they’re conditioned to perform their best on any test, even a low-stakes ILSA.

Rankings are derived from country-level means, and the corresponding confidence intervals are sufficiently wide that countries with substantially different ranks may be statistically indistinguishable. For example, in PISA 2015, Canada ranked 10th in mathematics, but its 95% confidence interval overlaps with Korea (ranked 7th) and Germany (ranked 16th) (6). Over time, rankings are also affected by which jurisdictions take part in a particular ILSA administration, so that a country’s ranking can shift even when its performance remains stable.

Even if no single critique “explains away” the strong performance of students in East Asian countries, this constellation of concerns calls into question any simplistic conclusion based on rankings. Yet as long as ILSA results are primarily reported as league tables, a mix of nationalism, fears about global competitiveness, and human nature inevitably lead policy-makers in countries with poor or declining performance toward unitary “silver bullet” solutions based on highly aggregated data.

More Promising Nuanced Analyses

We see much more promise in analyses of ILSA data disaggregated to levels below the country, be it by geographic region, province, or state, or by school or student characteristics. The best of these analyses incorporate a second data source now common in ILSAs: background questionnaires filled out by students, parents, teachers, and/or principals.

Jerrim (7), for example, used 2012 PISA data from Australia to probe the East Asian success stories. The subgroup of second-generation immigrants from top-performing East Asian countries had scores comparable with those East Asian countries, despite attending Australian schools that generally had much lower scores. Even acknowledging that immigrants are not a random sample of individuals from the originating country, this analysis suggests that family factors play a substantial role in ILSA test score differences and illustrates insights that come from within-country analyses.

Within-country analyses, of course, do not require ILSA data. The promise of ILSAs comes from comparing within-country analyses across countries. For example, Schmidt and McKnight (8) found that countries that have less curricular coverage of particular topics—for example, the United States does not emphasize physical sciences in curricula before high school—have poorer performance on those topical dimensions of TIMSS than countries whose curriculum covers more of that content (such as Korea). Secondary analyses such as these abound but rarely have the same impact as the initial league tables because these results typically appear long after the data release, sometimes after the next round of ILSA rankings has been released.

Another persistent analytic challenge is the construction of indicators of background characteristics that are equally reliable and valid across countries and cultures. Creating comparable composites—for example, of socioeconomic status (SES)—is even harder. PISA’s SES indicator, the International Socio-Economic Index of Occupational Status, faces multiple obstacles, including varying views of parental occupational status (such as the relative status of engineers, doctors, and teachers differs across countries) and whether the items combined (for example, having a desk at home or a cellphone) really tap into a common underlying variable across countries (1).

Despite cross-cultural equivalence challenges, we find three types of cross-country comparisons of within-country analyses especially promising because—even when they do not yield causal inferences, and they usually do not—they can suggest interesting hypotheses worthy of further study.

Benchmark within-country relationships across culturally similar countries. Consider two illustrative comparisons of 2012 PISA mathematics scores: (i) Hong Kong and Taiwan have very similar means, but the strength of the relationship between students’ scores and SES is three times greater in Taiwan than in Hong Kong; and (ii) Canada’s mean is 37 points higher than that of the United States, but the relationship between scores and SES is noticeably weaker in Canada than in the United States (9). We believe that benchmarking analyses such as these are more likely than league tables to lead countries with less equitable systems (here, Taiwan and the United States) to experiment with strategies their neighbors have used to improve educational equity.

A focus on the distribution of achievement within countries is needed. Recognizing the limitations of comparing means, ILSA league tables are increasingly supplemented with the percentages of students in each jurisdiction with especially high or low scores. For example, only 9% of U.S. students scored in the top two categories on the 2012 PISA mathematics assessment. By comparison, five OECD countries and six participating jurisdictions had more than double that percentage (9). Although these percentages are imprecise (as we explain above, regarding country-level means), differences of this magnitude suggest that the United States might consider experimenting with different approaches to offer high-achieving students appropriately challenging learning opportunities.

Comparing within-country natural experiments across countries would also be valuable. Natural experiments provide a credible basis for causal inference when researchers can effectively argue that the assignment of individuals to “treatments” is approximately random. Natural experiments that use ILSA data go a step further by assessing—using a common outcome—whether similar treatment effects are found in multiple jurisdictions. Bedard and Dhuey (10), for example, examined the effect of age at kindergarten entry on student achievement. Younger students—in both fourth and eighth grades—scored significantly poorer, on average, than their older peers, across a wide range of countries, demonstrating that despite the hope that differences by age within grade would fade over time, they do persist into adolescence. This suggests the need to study further the long-term effects of school entry policies on student outcomes and, perhaps, experiment with alternative policies (for example, allowing parents to wait another year before enrolling a child who would otherwise be especially young for his or her grade).

Five Suggestions for Improvement

ILSAs have improved considerably since the First International Mathematics Study of 1964: The tests are better; cross-cultural equivalence is a recognized, if not fully realized, priority; test administration and scoring methods have been upgraded; and sampling strategies and analytic approaches have been enhanced. But opportunities for improvement remain. We offer five concrete suggestions that we believe are most likely to yield payoffs commensurate with the increased costs associated with each.

ILSA sponsors—OECD for PISA and the Programme for the International Assessment of Adult Competencies (PIAAC), and the International Association for the Evaluation of Educational Achievement (IEA) for TIMSS and PIRLS—should develop communications materials and strategies that de-emphasize rankings. There is no doubt that country rankings (league tables) are a major driver of ILSA-related publicity. All ILSAs have governing boards of representatives from participating countries that play a major role in setting ILSA policies; these boards, if they had the political will, could press sponsors to disseminate results by using strategies that make it easier for the media to do a better job of reporting results and presenting balanced interpretations.

For example, when the U.S. National Center for Education Statistics released data from 2014 PIAAC—the U.S. ranked 13th—they color-coded the graphs to help readers compare the United States to three groups of countries: the seven with means that were significantly higher, the eight with means that were statistically indistinguishable (some higher; some lower), and the six with means that were significantly lower (11). Statistics Canada’s comparable PIAAC report presented graphs of country-level means accompanied by 95% confidence intervals and box plots displaying each country’s 5th, 25th, 75th, and 95th percentiles; they also displayed the results for Canada as a country and for each of its 10 provinces (12). Individual jurisdictions are particularly well positioned to present data this way because they can use their home country as the focal anchor.

National statistical agencies should facilitate linking their ILSA data to other data sources. Many secondary analyses of ILSA data would benefit from better measures of background characteristics, especially SES. In economically well-developed countries where census data are routinely collected (including most OECD countries), reliable indicators of school and community characteristics collected for other purposes could be linked to ILSA data. These indicators would be more accurate and fine-grained than anything currently available in ILSAs. Countries with national registration systems (such as Norway and Sweden) have even greater opportunities. Sweden’s National Agency for Education, for example, linked TIMSS scores to students’ school grades, finding a strong positive correlation (13). This lends greater credibility to the validity claims of these low-stakes assessments, at least in Sweden and perhaps in other Scandinavian countries.

Capitalizing on the move to digitally based assessments (DBAs) should be a priority. ILSAs are shifting to computerized administration (as are many national assessments). DBAs will improve administration and data processing and can enhance accuracy through adaptive testing, which tailors successive questions to the current estimate of a student’s proficiency level. Log files documenting participant interactions with the assessment can be used to study response patterns and their relationships to student effort and performance (1).

Piloting the addition of longitudinal components to current cross-sectional designs. would also be helfpul. Learning is about change, not status. To compare student performance and credibly identify its predictors both within and across education systems, longitudinal studies are essential (14). Realizing the benefits of longitudinal data, many countries track their students: some are based on ILSA samples (for example, Denmark and Switzerland have followed PISA participants), whereas others are homegrown (such as the Early Childhood Longitudinal Study in the United States and the National Educational Panel Study in Germany). Although country-specific tests can, in principle, be linked to ILSAs across countries for particular ages or grades, the linkages have been too weak to support cross-country analyses of within-country longitudinal data. Of course, there are some reasonable objections to this suggestion: (i) longitudinal tracking is expensive, (ii) attrition is a major concern, and (iii) it isn’t clear that the knowledge gain would be worth the cost. But the only way to address these concerns is to try some pilots and use the experience to better gauge costs against benefits. We believe the move to DBAs may make experimentation with longitudinal extensions feasible and worthwhile.

An alternative that has achieved some success is tracking a particular cohort by drawing (different) random samples of that cohort over time so as to construct “pseudo-panel” data. For example, the PISA cohort of 15-year-olds is contained within the PIAAC population of 20- to 25-year-olds 10 years later. A related approach is to use a differences-indifferences methodology to make putative causal inferences from such data. The core idea is to relate, at the country level, changes over time in explanatory factors to corresponding changes in test performance.

Findings from ILSA analyses should be used to stimulate randomized field trials (RFTs) that test the effects of specific interventions. This is less about improving ILSAs per se and more about appropriate steps after ILSA analyses suggest potential policy interventions. Of our five suggestions, we believe that this one is most likely to yield improvements in education policy. OECD and IEA have conducted RFTs investigating the effects of various features of ILSA design (such as modes of administration), but to our knowledge, none have suggested that countries use the results of ILSA analyses—that, at best, raise interesting questions—to sponsor RFTs in order to rigorously test the putative causal effects of policies identified through analyses of ILSA data.

Despite their many problems, ILSAs are—and should be—here to stay. They can serve as powerful instruments of change, although the nature and extent of their influence varies by country and over time (15). But to truly fulfill their promise, the ILSA community should begin planning trials and experiments with these and other strategies (1).