Managing files, data, and documentation for randomized evaluations

Summary

Several challenges arise from the length and complexity of randomized evaluations, including the management of multiple data sources or multiple rounds of surveys, sensitive or personally identifiable information, and multiple people or institutions coordinating on code and analysis. Addressing these challenges requires consistent protocols and documentation.

This section aims to assist researchers in designing and using data flows, coding practices, and file management systems that will be consistent throughout the course of a study.

File management

Setting up a clear folder structure early on in a study will enable consistent management of data, documents, and other files. Researchers typically will negotiate a data flow with study partners—and in tandem, can begin creating a plan for storing and managing data and other files. Along the way, researchers and partners can work together to document code versions, randomization, task management, and other key steps.

Just as house keys, kitchen utensils, lone socks, or paper documents can become lost if they don’t have a designated “home,” files are often lost or folders cluttered if file structures are not comprehensive and intuitively designed from the start. Clear folder structures can be especially important for lengthy randomized evaluations since a longer time frame increases the likelihood that people will forget where files are stored.

A range of file types may be saved within project folders for randomized evaluations, such as: code, data, meeting notes, a project log, data use agreements (DUAs), consent forms, references, publication drafts, and IRB protocols.

Tips on folder structures

Create a deliberate folder structure before files exist. Using a folder structure template for initial setup ensures that there is deliberate organization of folders from the beginning of a project.

  • For staff working across multiple projects, using similar (but separate) folder structures for each project can reduce the effort required for folder setup and provide consistency. 
  • Template folder structures show how major project files might be organized:
    • See the Appendix for a template file map.
    • IPA template folder structures1 are available as an appendix to “IPA’s Best Practices for Data and Code Management” (Pollock et al., 2015, 13) .
  • Templates can be used flexibly, with shortcuts or temporary file path changes, to accommodate different project stages. For example, a folder that will be intensely used for several weeks (e.g., DUAs) might be pulled temporarily to the main directory and re-nested when no longer active. Be sure to check that this interim move will not affect code, and make a plan for reorganizing temporarily moved folders back into the main folder structure. Folder shortcuts can help eliminate errors or confusion that can arise from moving folders.

Data management

Data flow strategies

A data flow strategy maps the ways in which data will be linked and accessed, including:

  • How identifying information will be gathered for the study sample;
  • Which identifiers will be used to link datasets (intervention data, treatment assignment, and/or administrative data);
  • Who will create de-identified study IDs, if applicable;
  • Which entity or team will perform the link;
  • Which individuals and organizations will have access to which data;
  • How data will be securely stored, used, and regularly backed up; 
  • What algorithm will be used to link data (e.g., probabilistic match, exact match); and
  • What software will be used to link data.

Researchers and partners may find it helpful to develop a data flow strategy in tandem with the design of a randomized evaluation. Ethical and legal restrictions often have important implications for how data is shared.2

Data management plans

An implementation plan for data management will determine how the data flow strategy will function in practice. Having a thorough plan can help ensure consistent practices that comply with ethical and legal requirements and protect the research integrity of a study. MIT Libraries’ Data Management page defines data management plans and shares resources for creating them. These include DMPTool, which organizes information according to specific funder requirements, and project start and end checklists for data management. For more detail on data management plans, see “Resources” below.

Documentation

Documenting randomization

Replicable and robust randomization protocols and code are major determinants of the credibility of findings. Choices about how to randomize may seem self-explanatory at the time, but may be difficult to recall at the analysis stage without careful documentation. 
At minimum, document or save:

  • Explicit definitions of how data are sorted;
  • How and when a seed is set for randomization;
  • Randomization code;
  • Software version for randomization code, since randomization algorithms may change between versions;3
  • The original list of data to be randomized;
  • Checks that running the code will always give the same result; and
  • The original list of treatment assignments.

Documenting code 

Randomized evaluations may involve lengthy coding projects with different staff members at different project stages, which make it essential to have clear internal guidelines for writing and documenting code that the research team will realistically be able to maintain. Good practices around coding documentation are also key for supporting reproducibility (see the section on Coding Resources for more information).4   

  • Naming and structuring practices within code. Code is “self-documenting” if variables, functions, macros, and files are named descriptively, clearly, and consistently and the code is structured to guide readers.5  To supplement descriptive naming practices and structures, version control software or comments within code can document why decisions were made—for example, why a certain type of outlier is dropped or why the code uses one command versus another (Pollock et al., 7). IPA’s Best Practices for Data and Code Management contains more tips on documenting decisions. Some teams may decide to format comments in a standard way for consistency and clarity; some teams may choose an entirely self-documenting system for documentation that uses no comments at all. While it takes time and effort to learn any system for writing self-documenting code, the goal of such a system is to avoid errors arising from outdated comments and to reduce the maintenance costs of time and effort required to keep comments up to date (Gentzkow and Shapiro, 2014, 28).  
  • Version control. Make a prospective plan and set norms for how to track decisions and versions, whether through norms or through specialized version control software. Version control software such as Git can manage and document changes to code.6 Git and other version control software provide clear systems for making changes and documenting decisions when there are multiple people working on code for one project. 

Documenting tasks and key decisions

To keep research teams organized and to create a record of the project for future reference, maintain documentation of work allocations, tasks, and decisions.

  • Manage tasks with a task management system—and remember that email is not a task management system. Some free task management systems include Asana, Wrike, Flow, Trello, and Slack (Gentzkow and Shapiro, 33).
  • GitHub and other version control software can also be used to track and manage coding tasks.
  • A project log (maintained as a text document, describing a project in a more narrative form) documents decisions about a project. A project log can help maintain a bird’s-eye view of the project and may help resolve confusion in the future about when and why decisions were made.      

Appendix - template folder structure

This template, based on folder structures for projects that J-PAL staff have supported, can help guide the initial setup of folders for a project. This structure is simplified to only a few levels; most projects will have additional, project-specific folder levels. Many folders have an _archive subfolder not shown here. Archive folders can store previous drafts and outdated reference materials that do not need to be available for frequent access. This template emphasizes legibility over strict adherence to folder naming styles. Researchers may wish to customize the use of spacing, capitalization, abbreviation, or number of folders based on their team and their system’s needs.

Top-level folders

  • Admin
  • Call notes
  • Code and data   
  • Data collection process
  • Papers and presentations
  • Research design

Subfolders

Admin

  • DUAs
  • Funding and finance
    • Budgets
    • Grant documents
    • Proposals
  • IRB
  • Project management
    • Data management plans
    • Team protocols
  • RA recruitment
  • Trial registry

Call notes

  • Calls and emails internal
  • Calls and emails with partner
  • Memos on research design

It can be helpful to keep a specific record of memos between the research team and partners about research design. For quick access, this subfolder could be housed near other calls and emails. After the design stage is complete, the subfolder might be archived under “Research Design.” This is an example of how subfolders might be re-nested during different phases of a project.

Code and data

Code and data may be stored on a separate server or may be housed within this “Code and Data” folder with subfolders arranged so that the code (within “Analysis”) and raw data (within “raw”) are separated from outputs.

  • Analysis: Subfolders for different analyses can include further subfolders for code, logs, and output.
  • Derived
  • Drafts
  • lib
  • raw

Data collection and study fielding

  • Consent forms
  • Data checks: Check the quality/quantity of data against expectations; track for any unexpected results on a real-time basis. Outputs for tracking this (tables, summary documents) can be published here to share with the project team.
  • Data generation process or questionnaires
  • Field visits 
  • Quality control plan
  • Tracking and monitoring: Track that randomization was implemented as planned; track other indicators of proper implementation. 
  • Trainings for implementing staff

Papers and presentations

  • Analysis manuscripts
  • Policy outreach
  • Presentations

Research design

  • Background research: As necessary, this can include background research on previous literature, similar interventions, legislation, and partner organizations.
  • Model notes
  • Power calculations
  • Pre-analysis plan
  • Randomization
  • Sampling
  • Theory of change
Acknowledgments

Thanks to Aileen Devlin, Louise Geraghty, Sam Ayers, and Rose Burnam for their advice and suggestions. Chloe Lesieur copyedited this document. This work was made possible by support from Arnold Ventures. Any errors are our own.

1.
IPA’s examples of folder structures outline how project files could be organized into different folders by file type and project stage.
2.
For discussion of strategies and tools for designing data flows, see the Data Flow Strategies section in J-PAL North America’s guide on Using administrative data for randomized evaluations. This document shares guidance, examples, and four different options for data flow and matching strategies. These options take into account ethical, legal, and resource restrictions, and describe how data flow strategies might depend on the sensitivity of the data in question and on relationships with implementing partners.
3.
In order to be able to reproduce the random generation of IDs later, keep in mind that randomization algorithms sometimes change between software versions. One way to address this is to specify the software version clearly in the randomization code. 
4.
Innovations for Poverty Action (IPA) created a resource, “Reproducible Research: Best Practices for Data and Code Management,” which offers guidelines on reproducible coding in Stata. The BITSS Manual of Best Practices in Transparent Social Science Research includes suggestions about reproducibility related to pre-analysis plans, workflow, Stata-specific practices, and more.
5.
Guidelines on self-documenting code and other best practices for coding in teams are included in a guide by researchers Matthew Gentzkow and Jesse Shapiro, “Code and Data for the Social Sciences: A Practitioner’s Guide.”
6.
Some teams coordinate task management using GitHub. This tutorial on version control with Git can aid users who are relatively new to using Git. More version control software tools include Subversion and Mercurial.
Additional Resources
Data management plans
  1. Guidelines for Effective Data Management Plans and Data Management Plan Resources and Examples (ICPSR): These resources provide a framework for creating a plan and links to examples of data management plans in various scientific disciplines, with an overview of the elements of a data management plan. A helpful list of definitions clarifies the elements of a data management plan.

  2. Example Plans from Creating a Data Management Plan (University of Minnesota Libraries): These resources include a template for creating a Data Management Plan, examples of plans, and other references from a range of project types. These supplement the resources above by providing specific examples of entire data management plans.  

  3. Managing your data – Project Start & End Checklists (MIT Data Management Services): This checklist (PDF format) is a big-picture overview meant to help researchers set up and maintain robust data management practices for the full life of a project. This will be a helpful resource for readers who prefer an all-in-one list of what steps to take. 

Data management protocols for evaluations using administrative data and/or survey data
  1. J-PAL North America’s “Using administrative data for randomized evaluations.” This guide discusses data requests, data flow, matching, data security, ethics, compliance, data use agreements, timelines, and other considerations for data and file management for evaluations that use administrative data.

  2. IPA’s Research Protocols: These protocols contain “minimum must dos” for data quality, data security, research ethics, knowledge management, and transparency of randomized evaluations, primarily focusing on evaluations that use survey data.

Coding resources
  1. Coding resources from J-PAL (GitHub) and IPA (GitHub) share commands for back checks, scans for personally identifiable information (PII), and other steps that may be helpful for checking that data management plans are running as intended.

  2. Researchers Matthew Gentzkow and Jesse Shapiro created a guide, “Code and Data for the Social Sciences: A Practitioner’s Guide," outlining best practices for coding with teams.

  3. Innovations for Poverty Action (IPA) created a resource, “Reproducible Research: Best Practices for Data and Code Management,” which offers specific guidelines for coding in Stata.

  4. See the section on Coding Resources for more suggestions of coding tools to help with data management and other topics.

Feeney, Laura, Jason Bauman, Julia Chabrier, Geeti Mehra, and Michelle Woodford. 2015. Updated 2018. “Using Administrative Data for Randomized Evaluations.” J-PAL North America. https://toolkit.povertyactionlab.org/resource/using-administrative-data-randomized-evaluations

Gentzkow, Matthew, and Jesse M. Shapiro. 2014. Code and Data for the Social Sciences: A Practitioner's Guide. University of Chicago mimeo, http://faculty.chicagobooth.edu/matthew.gentzkow/research/CodeAndData.pdf, last updated January 2014.

Pollock, Harrison Diamond, Erica Chuang, and Stephanie Wykstra. 2015. “Reproducible Research: Best Practices for Data and Code Management.” Innovations for Poverty Action (IPA). Accessed March 19, 2019. https://www.poverty-action.org/publication/ipas-best-practices-data-and-code-management