Recent Research: Quantitative Methods for Policy Research

Methodological Training for Education Research

IES-Sponsored Research Training

Now in its eighth consecutive year, the IES Summer Institute on Cluster-Randomized Trials (CRT) took place from July 7–17 in Evanston. Organized by Hedges and Spyros Konstantopoulos of Michigan State University, the institute seeks to provide researchers from around the country not only with the basic design and analysis skills used in CRT experiments but also with a rigorous methodological framework and perspective. The sessions encompass a broad range of topics in the design and execution process, from relevant statistical software to the more conceptual challenges, such as the framing of results.

From July 21–23, Hedges and Konstantopoulos conducted a specially designed workshop, the Research Design Workshop for Faculty from Minority-Serving Institutions. It welcomed 15 participants from Queens College, Tuskegee University, and the University of New Mexico, among others. The organizers worked with IES’ Katina Stapleton and Christina Chhin to create a workshop to improve their research methodology and to better equip participants to take part in the world of rigorous education research. Participants received instruction on research design fundamentals, including threats to validity, regression, and a brief overview of hierarchical linear modeling. 

Improving the Design and Quality of Experiments

Validity of Interrupted Time Series

Although randomized controlled trials (RCTs) are the gold standard of experimental testing, they are not always feasible for ethical or practical reasons. One extremely popular quasi-experimental alternative is the interrupted time series (ITS) experiment, wherein the impact of an intervention is measured by comparing data before and after the intervention takes place. While ITS experiments are extremely common and popular in practice, hardly any previous studies address the actual validity of this process. In the American Journal of Evaluation, Cook, Travis St. Clair of the University of Maryland, College Park, and Kelly Hallberg of the American Institutes for Research, the latter both former IPR postdoctoral fellows, conduct both a randomized experiment and a comparative ITS (CITS) experiment on an educational topic and compare the effects. They find that CITS designs produce impact estimates that are extremely close to experimental benchmarks, both with and without matching of the comparison schools. The three co-authors conclude that with correct modeling, adding time points does provide an advantage without increasing bias in results. 

Using the d-Statistic in Single Case Designs

One of the most common experimental design methodologies utilized in the fields of psychology, education, and human behavior is that of the single-case design (SCD). In a single-case design study, one subject is observed and is used as its own control group. SCD methodology is widely used for its flexibility and ability to highlight variation in individual responses to intervention effects. Although SCD studies generally include several individuals, the effects are measured on a per-individual basis. One of the major statistical issues then becomes the problem of expressing effects found in SCD studies in the same metric used in standard randomized experiments involving multiple groups of subjects—often called between-subject design (BSD) experiments. The primary method of comparing SCD results with those from BSD studies is the use of statistical measurements that describe the relative size of the effects. One of these, the d-statistic, is used to measure effect sizes. In a Journal of School Psychology article, IPR education researcher and statistician Larry Hedges and his colleagues explain how a modified d-statistic can be used to improve both analysis and meta-analysis of SCD data. The strengths of this new d-statistic include its formal statistical development, the existence of associated power analyses, and its ability to be easily calculated using macros in SPSS software. Hedges is Board of Trustees Professor of Statistics and Education and Social Policy. 

Generalizing from Unrepresentative Experiments

In the Journal of the Royal Statistical Society, Applied Statistics, Hedges and the University of Chicago’s Colm O’Muircheartaigh point to the rarity of experiments with probability samples in the fields of education, medicine, and the social sciences. Given that social experiments typically aim to glean information to inform policy decisions, the authors enumerate why they believe probability sampling is likely to remain scant in these fields. For education research, they single out three main roadblocks: High costs and long timeframes mean using one experiment for policy decisions in many areas; populations of interest might not be known in advance; and the inference population might not be part of the population included in the experiment. The two researchers propose using propensity-score matching to mitigate bias and simulate a random sample. They apply it to evaluating a pair of cluster-randomized trials (CRTs), assessing if SimCalc, a math software, affected students’ test scores in a nonrandom sample of 91 Texas public schools. While the CRTs indicate that it did improve math test scores, could the state draw policy implications for all schools? They can, the authors conclude, if one uses propensity-score matching to enumerate the assumptions necessary to justify inferences about average treatment effects for generalizing policies to other student populations. The study was supported by an NSF grant. 

Propensity-Score Stratified Sampling

Though randomized experiments have become increasingly common in education, random selection of units (random sampling) into the experiment has not; in fact, recent work suggests that just 3 percent of social experiments have used this dual randomization process. In the Journal of Research on Educational Effectiveness, Hedges and his co-authors—who include Columbia University’s Elizabeth Tipton, a former IPR graduate research assistant—address the problem of sample selection in experiments, particularly scale-up experiments, by providing a method for selecting the sample so that the population and sample are similar in composition. Their method requires that the study’s inference population and eligibility criteria are well defined before study recruitment begins. When these two populations differ, however, the authors provide a method for sample recruitment based on stratified selection and a propensity score. The researchers also illustrate the problem by discussing how to select districts for two scale-up experiments, Open Court Reading and Everyday Math programs, during their recruitment phases. 

Protecting Privacy in State Longitudinal Datasets

The U.S. Department of Education’s Institute of Education Sciences (IES) has spent more than $600 million helping states develop longitudinal data systems to better understand and improve how American school systems perform. Yet concerns about protecting privacy and the Federal Education Rights and Privacy Act (FERPA) are creating data-access barriers for researchers. With funding from IES, the National Science Foundation, and the Spencer Foundation, and with the cooperation of a dozen states, Hedges and his research team are investigating methods to make large datasets available while protecting individuals’ privacy. 

Design Parameters for Education Experiments

An IES-sponsored project led by Hedges and the University of Chicago’s Eric Hedberg with IPR project coordinator Zena Ellison seeks to establish design parameters for education experiments at state, local, school, and classroom levels. Many current education experiments use designs that involve the random assignment of entire pre-existing groups, such as classrooms and schools, to treatments, but these groups are not randomly composed. As a result, statistical clustering occurs (when individuals in the same group tend to be more alike than those in different ones). Experimental sensitivity depends on the amount of clustering in the design, which is difficult to know beforehand. This project seeks to provide empirical evidence on measures of clustering, such as intraclass correlations and related design parameters, making these available to education researchers. This evidence is publicly available in a database on IPR’s website. The project has already produced important results, including new methods of calculating standard errors for intraclass correlations and the software to compute them.

Data Use, Quality, and Cost in Policy Research

Internal and External Validity in Policy Research

In the Journal of Policy Analysis and Management, IPR social psychologist and methodologist Thomas D. Cook contends that at every step of experimental design and implementation, there are assumptions and compromises that weaken the adage that “internal validity is the sine qua non for external validity.” For example, random sampling is often considered a panacea for all questions of external validity, but true random experiments are extremely rare in policy analysis, with nonprobability sampling quasi-experiments used the most. Furthermore, Cook argues that it is unreasonable to assume that a study’s essential characteristics, i.e., setting, sample population, treatments, and outcomes, are static across different time periods. Even causal relationships themselves are not very stable across past studies. Cook addresses common extrapolation practices used in public policy experiments today, identifying several major fallacies. He also criticizes the common statistical assumption that since internal validity measures are so strict, laxer standards for external validity are acceptable. He argues instead for developing stricter external validity measures. To address prevalent methodological issues in future public policy research, he posits that statisticians should shift their emphasis from current policy relevance and immediate use to a more rigorous treatment of external validity, causal representation, and extrapolation. Cook holds the Joan and Sarepta Harrison Chair in Ethics and Justice. 

Improving Reporting of Rape, Sexual Assault

A panel of the National Research Council published “Estimating the Incidence of Rape and Sexual Assault” about its recommendations to improve underreporting of such crimes in the National Crime Victimization Survey, gathered by the U.S. Department of Justice’s Bureau of Justice Statistics. A national panel of experts, including IPR statistician Bruce Spencer, collaborated on the report. Its major recommendation is the creation of a separate survey to measure these types of victimizations, and that this new survey should use updated and more precise definitions of ambiguous words such as “rape” to improve the reporting of these crimes. Furthermore, by shifting the survey focus from a criminal justice perspective, or a “point-in-time” event, to a longer-term public health perspective, the survey can more accurately track occurrence of these crimes and improve response accuracy. The panel also identified several key methodological obstacles to correct, including improving sample design for low-incidence events and developing methods to increase respondents’ privacy by allowing them to report anonymously. 

Improving the 2020 Census

In an IPR working paper with graduate research assistant Zachary Seeskin, Spencer considers how to measure the benefits from improving the accuracy of the 2020 Census, taking the examples of two high-profile census uses—apportionment and fund allocation. Apportionment of the 435 seats in the U.S. House of Representatives is based on census numbers, and distortions in census results mirror distortions in numbers of seats allocated to the states. Spencer and Seeskin expect that roughly $5 trillion in federal grant and direct assistance monies will be distributed at least partly on the basis of population and income data following the 2020 Census, and again distortions in census results cause distortions in the allocations of funds. After describing loss functions to quantify the distortions in these two uses, they then undertake empirical analyses to estimate the expected losses arising from alternative profiles of accuracy in state population numbers. 

Better Metrics for Earthquake Hazard Maps

In 2011, the 9.0-magnitude Tohoku earthquake and the resulting tsunami killed more than 15,000 people and caused nearly $300 billion in damages. The shaking from the earth- quake was significantly larger than Japan’s national hazard map had predicted, devastating areas forecasted to be relatively safe. Such hazard-mapping failures prompted three Northwestern researchers—geophysicist and IPR associate Seth Stein, who is William Deering Professor of Earth and Planetary Sciences, Spencer, and doctoral student Edward Brooks—to search for better ways to construct, evaluate, and communicate the predictions of hazard maps. In two IPR working papers, the scholars point out several critical problems with current hazard maps and offer statistical solutions to improve mapping.

Currently, no widely accepted metric exists that can gauge how well one hazard map performs compared with another. In the first working paper, the researchers use 2,200 years of Italian earthquake data to highlight several different statistical models that could be used to compare how well maps work and to improve future maps. Since underestimating an earthquake’s impact can leave areas ill-prepared, the scholars developed asymmetric models that weigh underprediction heavily and can account for the number of affected people and properties. In a second working paper, the scholars offer further methodological guidance on when—and how—to revise hazard maps using Bayesian modeling, which allows multiple probabilities to stack up with evidence. 

Interdisciplinary Methodological Innovation 

Generalizability of Survey Experiments 

Along with IPR political scientist James Druckman, IPR sociologist Jeremy Freese is co-principal investigator of Time-sharing Experiments for the Social Sciences (TESS), a National Science Foundation (NSF)-funded online platform for survey experiments that aims to make them easier and cheaper for researchers to conduct. Researchers apply to TESS, which collects data for their experiments at no cost to them. In 2014, TESS held two special competitions, one for young investigators who typically find it difficult to field larger-scale studies and another for proposals of experiments that offer monetary rewards to participants.

For its data collection, TESS contracts with GfK Knowledge Networks, a survey panel that guarantees a sample representative of the U.S. population and uses random-digit- dialing and address-based survey techniques. A fast-growing, and cheaper, alternative to conduct survey experiments is by using a crowdsourcing platform such as Amazon’s Mechanical Turk (MTurk). MTurk offers convenience, speed, price, and a high volume of responses—researchers can pay participants as little as one cent per experiment. However, are MTurk results representative of the greater U.S. population? If not, then experiments conducted on the platform could be skewed. Given the reinvigorated debates about convenience samples’ external validity in experiments, Freese, Druckman, and Kevin Mullinix, an IPR graduate research assistant, conduct 20 experiments across the two platforms and compare each platform’s results to one another. In experiments not moderated by age or education, MTurk and GfK participants responded similarly. But in those where age and education did matter, the responses differed greatly. In short, the choice of a survey panel such as GfK or a crowdsourcing platform such as MTurk can affect an experiment’s results. Still, they see the crowdsourcing model as offering exciting possibilities for researchers if one pays attention to the caveats. Their findings appeared in an IPR working paper. In another research article published in Sociological Science, Freese, with Northwestern doctoral students Jill Weinberg and David McElhattan also compared the two populations, confirming the demographic differences and a yield of comparable results. Their study suggests that crowd-sourced data might prove useful due to its “relative affordability and the surprising robustness and accuracy of its results.” Freese is Ethel and John Lindgren Professor of Sociology, and Druckman is Payson S. Wild Professor of Political Science. 

False Precision and Strengthening Social Science

How can one translate high-quality statistical data and regression output into palatable information for the average reader? Should more precise data—with four or five digits trailing the decimal—be reported at the expense of potentially complicating results? Or should researchers cut their reported results to one or two trailing digits to provide more concise, readable conclusions? Some researchers have argued that the extra digits tend to falsely imply a level of accuracy that does not actually exist in the experiment. In Sociological Science, IPR sociologist Jeremy Freese argues that statistical precision, while oftentimes unwieldy, is actually necessary to preserve the statistical and essential meaning of results. In a simple confidence interval calculation, Freese shows that by reducing an estimate from five trailing digits to two, the resulting confidence interval becomes 10 times as wide—that is, 10 times less meaningful. Freese concludes that coefficients should be reported to at least three digits to preserve statistical precision and ensure the validity and reproducibility of current and future research. 

Regression Models for Categorical Data in Stata

For the third edition of Regression Models for Categorical Dependent Variables Using Stata (Stata Press, 2014), Freese and co-author J. Scott Long of Indiana University–Bloomington rewrote the book, adding 60 pages more than the second edition—even though 150 pages were deleted—and revisited their original “suite” of authored commands and detailed new ones. Among those added are the change of factor variables and the “margins” command, affecting how variables can be estimated and interpreted. Additionally, Long and Freese completely revamped their popular SPost commands, which make it easier to include powers or interactions of covariates in regression models and work seamlessly with models estimated with complex survey data. Starting with a detailed introduction to Stata and then moving into general treatment of models, the authors illustrate each of their chapters with concrete examples, including a new chapter on how to interpret regression models using predictions. They have also posted all of the examples, datasets, and their authored commands from the book on their website.