Coding resources for randomized evaluations

Authors
Contributors
Summary

This page compiles links to resources on software, user-written commands for randomized evaluations, coding in teams, and writing reproducible code. User-written commands listed below include common checks for randomized evaluations and faster versions of frequently used commands in Stata and R.

Helpful user-written commands for RCTs

User-written programs and code can support checks of whether key steps in a randomized evaluation are running as planned. J-PAL and IPA (Innovations for Poverty Action) have written several commands in Stata and R that run helpful checks and comparisons.

  • Balance checks report whether variables are balanced across treatment and control groups.

    • orth_out – Stata command for exporting summary statistics or orthogonality (balance) tables. IPA wrote and maintains this command and provides a tutorial for this and related commands.

  • Back checks allow researchers to compare a mini-survey to a larger original survey in order to assess the consistency of survey answers and enumerators’ adherence to survey protocols.

    • bcstats – Stata program for analyzing back check data by comparing it to original survey data.

    • bcstatsR – R version of the Stata command bcstats.

  • High-frequency checks of incoming research data can monitor a number of additional types of indicators and potential red flags.

    • ipacheck – Stata package for running multiple high-frequency checks on research data.

  • Protecting personally identifiable information (PII) is an essential part of data collection and analysis.1 The commands below may be helpful for scanning for obvious personally identifiable information; however, a scan that reports no PII is no guarantee that some variables or combinations of variables do not convey personally identifiable information.

    • stata_PII_scan – Stata program to scan for personally identifiable information.

    • PII-Scan – R code to scan for obvious PII.

    • PII_detection – application and Python script to identify, remove, and/or recode PII from field experiment data sets.

  • Comparing two datasets can help check whether data is being recorded and stored correctly—for example, by checking for discrepancies between the way two people entered the same paper-based data into electronic form (a best practice for paper surveys) or checking whether data have changed if there are two versions of what should be the same dataset.

    • cfout - Stata user-written command to compare two datasets.

  • Commands and tips for larger datasets, including faster versions of common commands, may be useful for researchers working with large datasets in Stata.

    • gtools - Stata package of user-written commands for faster versions of collapse, reshape, xtile, tabstat, isid, egen, pctile, winsor, contract, levelsof, duplicates, and unique/distinct.

Efficiency suggestions for large datasets - Stata tips for extracting subsets of data and reducing run time for common operations, compiled on the NBER website.

Coding with teams working in the social sciences

Because randomized evaluations may involve lengthy coding projects and multiple research staff, it is essential to have clear internal guidelines for how to code. The following guidelines and tools may assist readers with coding questions specific to the social sciences.

Guidance for reproducible coding

Reproducibility is an important consideration in coding for randomized evaluations. The following resources offer guidelines and tools for reproducibility.

Writing randomization code

Users who are new to writing randomization code may find it useful to work through the guide below for writing randomization code in Stata or a user-written version of the guide in R.

  • Writing Randomization Code in Stata: A Guide uses data and an annotated Stata do-file to illustrate, step by step, how to conduct simple randomization using Stata.

  • This user-written guide to writing randomization code in R by Jorge Cimentada is a translation of the guide from Stata into R.

  • The randtreat command in Stata performs random treatment assignment with different numbers of treatments and uneven treatment fractions. It also provides methods to deal with “misfits” that arise in treatment assignment when observations cannot be neatly distributed.

  • The random number generator may change between versions of software. For reproducibility of randomization code, be sure to specify and set the version of the software used. For example, in Stata, this can be done with the command –version #–

Acknowledgments

Acknowledgements: Thanks to Aileen Devlin, Laura Feeney, Louise Geraghty, Sarah Kopper, Sam Ayers, and Rose Burnam for their suggestions and advice. Chloe Lesieur copyedited this document. This work was made possible by support from Arnold Ventures. Any errors are our own.

1.
Data security procedures for researchers provides more information on protecting PII.
Additional Resources
Self-guided learning in Stata
  1. Stata resources from UCLA | The UCLA Institute for Digital Research and Education

    These resources are organized by topic. The search functionality allows for looking up resources on specific commands.

  2. Short introduction to Stata | Professor Germán Rodriguez

    A tutorial for new users with an emphasis on data management and graphics.

  3. Stata 101 | IPA (direct download)

    A self-guided learning module for users with little or no knowledge of Stata.

  4. Stata 102 | IPA (direct download)

    As self-guided learning module for users with some Stata experience, but who are not especially comfortable with the program.

  5. Stata 103  | IPA (direct download)

    A self-guided learning module for users who are familiar with the basic Stata commands and are comfortable working with the program.

  6. Stata 104 | IPA (direct download)

    A self-guided learning module for advanced users, with a focus on data cleaning.

  7. Stata cheat sheets | Stata.com

    Data scientists Tim Essam and Laura Hughes have created "cheat sheets" on using Stata for data science tasks and analysis. These may be of interest to both novice and advanced Stata users.

Self-guided learning in R
  1. Base R Cheat Sheet | J-PAL (direct download)

    A cheat sheet by RStudio that provides an overview of basic R commands.

  2. R for Statistics 571 | J-PAL (direct download)

    Bret Larget teaches the basics of R and some statistical applications of R, including statistical tests and data visualization.

  3. Short Introduce to R | Professor Germán Rodriguez

    A short introduction to R for new users, with an emphasis on fitting linear and generalized linear models.

  4. R resources from UCLA | UCLA Institute for Digital Research and Education

    The UCLA Institute for Digital Research and Education provides a library of somewhat advanced resources and tutorials for R.

  5. Randomization Inference (RI) | The Comprehensive R Archive Network (direct download)

    An R package for performing randomization-based inference for experiments.

  6. R-bloggers

    A website compiling the blogs of data analysts who share the work they do with R, including examples of data analysis and visualization.

  7. Data Analysis for Social Scientists | MIT Economics MicroMasters program

    This course, part of J-PAL and MIT's MicroMasters Program in Data, Economics, and Development Policy, uses R to discuss methods for harnessing data to answer questions of economic and policy interest.

  8. Designing and Running Randomized Evaluations | MIT Economics MicroMasters program

    This course, part of J-PAL and MIT's MicroMasters Program in Data, Economics, and Development Policy, provides templates for calculating power, monitoring data collection, and managing data with R.

Power Calculations
  1. Power Calculations in Stata: A Guide | J-PAL (direct download)

    This guide uses data and an annotated Stata do-file to illustrate how to conduct power calculations step by step. It provides examples of both parametric and non-parametric simulation methods of calculating statistical power.