Conduct power calculations
This resource is intended for researchers who are designing and assessing the feasibility of a randomized evaluation with an implementing partner. We outline key principles, provide guidance on identifying inputs for calculations, and walk through a process for incorporating power calculations into study design. We assume some background in statistics and a basic understanding of the purpose of power calculations. We provide links to additional resources and sample code for performing power calculations at the end of the document.
1) Do power calculations early
The benefit of doing any power calculation early on—even if rough—can be large.
- If the study is not feasible, early calculations help you and your partner learn this quickly and avoid weeks or months spent designing a study that would ultimately not proceed.
- Don’t be discouraged by not having data for your study population—initial power calculations can use summary statistics obtained from public sources and existing literature.
- Prepare for an iterative process. If initial calculations suggest the study is feasible, you may iterate and refine calculations as part of the study design process.
2) The hardest part is choosing a reasonable minimum detectable effect (MDE).
There is no universal rule of thumb for determining a "good minimum detectable effect (MDE)—it depends on what is meaningful to the parties involved weighed against the opportunity cost of doing the research.
- For researchers, this might be informed by the existing literature: what have previous studies of comparable interventions found? What would be the smallest effect size that would be interesting to be able to reject?
- For partners, this might be the smallest effect that would still make it worthwhile to run this program (from their own perspective, or from a funder’s or policymaker’s perspective), as opposed to dedicating resources elsewhere. This may mean the smallest effect that meets their cost-benefit assessment, the smallest effect that is clinically relevant, or some other benchmark.
3) Power calculations are a rough guide, not an exact science.
Power calculations are most useful to assess an order of magnitude. Some degree of refinement—such as using covariates to soak up residual variance or redoing calculations on a more complete dataset—can be valuable. But remember that the exact ex post value of inputs to power will necessarily vary from ex ante estimates; one can quickly hit diminishing returns with continued fine-tuning based on ex ante estimates.
4) The first stage has an outsized effect on power.
The strength of the first stage (taking into account factors like rates of take-up and compliance) is commonly under-appreciated in calculating power or required sample size. Overly optimistic assumptions for the first stage can lead to a severely underpowered second stage. For instance, to be powered to detect the same effect size with 25% take-up, we would need to offer treatment to 16 times more people and provide treatment to 8 times more people (assuming equal numbers of treatment and control) than if we had 100% take-up (McKenzie 2011).1
Process of power calculations
Given those key principles, we now provide more details on each step of the process, including gathering the information needed, introducing the concept of statistical power to partners, running "back-of-envelope" calculations, deciding whether to proceed, refining calculations, and ultimately deciding whether to run a research study.
Gathering inputs to power calculations
The implementing partner may be a key source of input for power calculations. Some inputs—such as the maximum sample size, a policy- or program-relevant MDE, and a feasible unit of observation—can only be found out by discussing these parameters with a partner. Rough estimates of other inputs—such as mean and variance of key outcomes, take-up rates and intra-cluster correlation—can be found in previous research or publicly-available data. If readily available, data or summary statistics from the partner’s current operations, or from the data source(s) that will be used in the final analysis, may be preferable.
- Run calculations with the best data that are readily accessible. At early stages, avoid getting bogged down seeking access to non-public data. After a proposed project passes a basic test of feasibility, data access arrangements can be sought, and power calculations updated with new data. Before then, consider using:
- Summary statistics from existing literature—experimental and non-experimental academic research or reports from government or non-profits can be useful to benchmark what effect size would be realistic.
- Published data, for example:
- Perform sensitivity analysis for key assumptions—for example, what will it look like if the sample size increased or decreased by an order of magnitude. What if take-up is half of what you expected? Or if intra-cluster correlation is substantially higher or lower? Test how power changes with changes to any critical assumptions. But beware diminishing returns – power calculations are most useful to assess an order of magnitude.
- You may not need to find data for each "ingredient" to run power calculations. From sensitivity analysis you will learn which inputs have the largest impact on power. Focus the team’s efforts on finding good estimates for the inputs that matter most.
- Consider both analytical and simulation methods. Simulated calculations are especially helpful for more complex study designs (McConnell and Vera-Hernandez 2015). With good data on the study population, they can also be used to calculate simulated confidence intervals around null effects, or in power calculations for a small sample where some of the parametric assumptions about probability distributions may not hold.
- Use existing code. J-PAL has created template Stata code and a training exercise for both analytical and simulation methods, available for download as a zip file here, and EGAP provides code for conducting simulations in R (Coppock 2013). The Stata blog has a helpful post on calculating power using Monte Carlo simulations (Huber 2019).
- Consider software that helps you visualize the relationship between sample size and minimum detectable effect. This can be helpful for communicating calculations back to partners. See, for instance:
After running initial calculations, set aside time for a call or meeting with the research partner to discuss the calculation results and decide together whether it makes sense to proceed with the study.
- Walk through the numbers that go into the estimates. Clarify which inputs are your assumptions and which are based on programmatic information. Because implementing partners have in-depth knowledge of the program and context, they may be able to use this information to suggest creative ways to boost power.
- Ensure that you and your partner understand the costs and benefits of proceeding if the study would be underpowered. The risks of an underpowered evaluation go well beyond the risk of spending time and resources on a study that might not yield useful insights. Underpowered evaluations carry important risks for implementing partners. An underpowered study that finds no significant effect may be misinterpreted as the program not having any effect at all, potentially leading implementing organizations or funders to (maybe incorrectly) conclude that the program is ineffective and should be discontinued.2
- Make a decision together. At the end of the call or meeting, discuss next steps. You may jointly decide:
- To discontinue further discussions—perhaps the potential sample size is orders of magnitude too small to be powered to detect a meaningful effect, or the changes to study design that would be required to achieve sufficient power are operationally infeasible or too costly. This is a difficult decision, but it is ultimately better for all parties to have this conversation early than to invest time in designing a study that is very unlikely to proceed. This may be an opportunity to discuss whether there are other potential questions you and the partner could explore.
- To continue discussions—perhaps the study looks promising, but it is still not clear, based on initial calculations, whether it would be sufficiently powered. You may decide to iterate further on the study design and/or on the power calculations before determining whether to proceed.
- To move ahead with the study—you may jointly decide that, based on your assumptions and calculations, the study is likely to be sufficiently powered. If assumptions or the study design change significantly, you may still continue to refine the power calculations.
If the study looks promising, but it is still not clear, based on initial calculations, whether it will be sufficiently powered, research teams can iterate over the details of the study design with the research partner.3 During this stage, there are two key situations where refinements may be particularly helpful.
- Assessing the effects of design decisions on power. Significant changes in design from what was assumed in initial calculations—such as changing the number of treatment arms, changing intake processes (which might affect take-up), changing the unit of randomization, or deciding you need to detect effects on particular subgroups—should inform, and be informed by, estimates of statistical power.
- Finding better estimates of key inputs. If the study has passed a basic feasibility test, but there were first-order inputs for which you were unable to find satisfactory estimates for initial power calculations, it may be worth seeking additional data to refine power estimates. Consider requesting detailed operational data from the partner, or requesting non-public survey or administrative data from a third party.
There are diminishing marginal returns to refining power calculations. If initial calculations were satisfactory, refinements may be minimal or may not be necessary at all. However, if the following points were not considered in initial calculations, they should be considered before making final design decisions:
- Determine primary and secondary outcomes that the study should be powered to detect, and run calculations on all outcomes.
- If the study design includes multiple arms, do each as a pairwise comparison. If, for instance, the control group will be compared with two different treatments, ensure that the control group is large enough to be powered for the smallest MDE among the two comparisons and, if desired, that the study is powered to distinguish a meaningful difference between the two treatment groups.
- If randomization will be clustered, ensure that power calculations incorporate estimates of within- and between-cluster variance of the outcome variable.
After refining power calculations, you may jointly decide that the study is not feasible and to discontinue discussions. Alternatively, if the research team is satisfied that the study would be adequately powered, and the research partner is satisfied that the chosen MDE is meaningful to them, you may jointly decide take a leap and launch the study.4
Ingredients to perform power calculations: sources and tips
Ingredients based on decisions or assumptions
- Primary and second outcomes: There may exist many potential interesting outcomes, but each outcome will require its own calculations. To facilitate initial calculations, agree upon a limited set of crucial outcomes on which to focus. When refining calculations, run power calculations on all outcomes of interest.
- Sample size: Ask questions to clarify the potential sample size:
- How many people currently receive services within a given period?
- If required to achieve a large enough sample size, would the partner be open to running a study over a longer period?
- Is there capacity to serve more people? Could there be a creative way to increase the sample while still being mindful of existing service constraints?5
- Minimum detectable effect:
- Effects found in previous research
- Estimates provided by program's implementer or designer
- Are the effects found by previous studies likely to be positively or negatively biased?
- What effect size would be academically interesting?
- What effect size would be needed for the benefits of the program to outweigh the costs? Or for this program to be preferable to alternatives?
- What effect size would make funders or policymakers interested in scaling up the program?
- A partner's perceptions of the impact of their program may be different from a decision-relevant effect size.6
- If your initial estimate of the MDE is in standard deviations, assess the practical relevance of the MDE in absolute terms as well.7
- Allocation to treatment and control: Start by assuming equal allocation to each study arm. In some cases, the marginal cost of additional control group units is very low compared to treatment group units (e.g., when using administrative datasets to measure outcomes). In these cases, for a particular budget constraint, power may be maximized by increasing the ratio of control to treatment (McConnell and Vera-Hernandez 2015).8
- Unit of observation and level of randomization:
- Studies that randomize at more granular levels (e.g. at the classroom level instead of school level) generally have greater statistical power for a given number of individuals.
- The unit of observation and level of randomization do not need to be the same. Depending on intra-cluster correlation (see below), studies randomizing at a higher level may be able to boost power somewhat by measuring outcomes at a more granular unit of observation (e.g., randomizing at the classroom level but observing student-level outcomes).
Ingredients requiring data
- Estimates from previous research, observational literature, or publicly-available data
- Estimates from partner's operational records
- For binary variables, variance can be calculated from the mean. Using binary outcomes (e.g., hospital admission, homeless shelter re-entry, or college enrollment) may allow you to base initial calculations on reports that do not publish standard deviations. Requesting initial summary statistics from implementing partners can also be simpler if the request is only for means and not standard deviations.
If P (x=1) = p
var(x) = p * (1-p)
- Take-up rates or compliance/attrition:
- Previous research
- Existing program data or pilot study
- Take-up and compliance assumptions are often overly optimistic, and power calculations can be very sensitive to this assumption. For example, to be powered to detect the same effect size with 25% take-up, we would need to offer treatment to 16 times more people and provide treatment to 8 times more people (assuming equal numbers of treatment and control) than if we had 100% take-up.9
- If low take-up is a concern, consider designing the study so that treatment/control status is assigned only after participants agree to participate.10
- Covariate controls:
- A description or data dictionary for the planned analysis dataset may have a list of variables that may be used as covariates
- Can examine the correlation between outcomes and covariates in other datasets
- Including covariates in calculations can boost power. For initial calculations, assume no covariates. When refining study design, consider which covariates will be available in your dataset and which are likely to be highly correlated with the outcome (so could potentially soak up a lot of residual variation and thus improve power).
- Intra-cluster correlation:
- Partner's input
- Previous research
- Estimates from publicly available data
- Test sensitivity with a range of reasonable assumptions.
Before beginning conversations with a partner about ingredients of power, implications of power calculations, or program changes necessary to achieve a certain level of power, take the time to explain the concept of statistical power. After establishing a common understanding around what power is and why it is important, talk through the ingredients in more detail. Talking points and resources to introduce non-technical partners to statistical power are presented at the end of this resource. The following points often come up in conversations about study design and are worth clarifying early:
- Establish whether sample size is a constraint. Typically, studies fall into one of two categories. Either:
- The potential sample is fixed. Research teams use power calculations to estimate the MDE that the study would be powered to detect, then decides whether that MDE is reasonable, or
- The potential sample is flexible (either because there is the possibility of adding to the sample or of reducing intervention arms). In this case, the research team can first decide on what a reasonable MDE would be, then use power calculations to estimate what sample size would be needed to adequately power the study.
- Beware inflated effect estimates, both in the literature and in partners’ suggested MDE. In discussing the MDE, it may be helpful to talk through potential reasons why partners’ perceptions of the program’s impact could be different from its measured impact.
- For instance, for a service-based program, partners may calculate impact based on program participants with whom they worked intensively, and exclude those who signed up for the program but never returned to receive services. In a randomized evaluation, outcomes for both of these groups would be included in estimating the treatment effect.
- Consider the strength of the identification strategy and publication bias when reviewing the effect sizes in published literature.
- Define "sample size" with the partner. Researchers think of sample size as the total number of units in the treatment and control groups. However, partners must understand how the required sample size will affect the scale of their operations—i.e., how many individuals or units they must recruit into the study or how many they must treat. For this reason, they may interpret "sample size" as the number allocated to the treatment group, or the number receiving treatment after accounting for take-up rates. When discussing estimates of potential sample size, recruitment rates, or attrition rates, clarify that you and the partner have a common understanding.
Talking points for non-technical conversations about power
Early discussions with partners about power can help set researchers up for a successful partnership later on—both because understanding the reasons for design decisions can help increase partners’ investment in the success of the study, and because a better understanding of power can enable partners to flag potential threats to design that may arise during implementation.
What is statistical power?
The power of an evaluation reflects how likely we are to detect any meaningful changes in an outcome of interest brought about by a program. For example, most studies aim to have power of 80 percent or higher. Power of 80 percent means that there is a 20 percent chance of concluding that an intervention does not have an impact of a particular size when, in fact, it does. The sample size needed to achieve sufficient power varies from case to case.
I trust you to do the math—why do I need to understand this?
Say, for instance, we are studying the impact of a job training program on participants’ income. We set the MDE at 10%, which means we are powered to detect a 10% (or more) increase in participants’ income due to the program. Imagine that the actual effect of the program is lower than 10%—for example, a 7% increase in income. This might still provide a substantial improvement in quality of life for participants, may more than pay for the cost of delivering the training, and may be exciting for policymakers and funders. But because our MDE is 10%, our study may not be able to distinguish this 7% increase from zero (i.e., we may not find a statistically significant result). Instead, we may conclude that the program had no detectable effect.
All else equal, with a larger sample, we are better able to detect true effects that are smaller. We need to agree on an acceptable effect size, and ensure that you are aware of what we will and will not be able to learn from the results.
Why is sufficient power important?
Budgetary, program, and timing constraints may create pressure to conduct an “underpowered” evaluation—but there are risks to doing so. An underpowered evaluation may consume substantial time and monetary resources while providing little useful information, or worse, tarnishing the reputation of a (potentially effective) program. When a study with insufficient power does not find a statistically significant result, we say that we found no evidence of an effect, but this does not mean that we found evidence of no effect. However, funders, media, and the general public can easily conflate “finding no evidence of an effect” with a “finding of no effect.” As a result, inconclusive findings can damage the reputation of an organization or program nearly as much as conclusive findings of no effect.
Thanks to Maya Duru, Amy Finkelstein, Noreen Giga, Kenya Heard, Sarah Kopper, Rohit Naimpally, and Anja Sautmann for their thoughtful contributions. Caroline Garau copy-edited this document. This work was made possible by support from the Alfred P. Sloan Foundation and Arnold Ventures.
Rachel Glennerster’s lecture, "Sampling and Sample Size" https://www.povertyactionlab.org/research-resources/teaching [video recording] provides an introduction to the concept of statistical power.
The book Running Randomized Evaluation: A Practical Guide (Glennerster and Takavarasha 2013) includes a detailed chapter on statistical power and its ingredients. The companion website, runningres.com, includes data and sample exercises for power.
EGAP’s "10 things you need to know about statistical power" is an accessible guide that provides both information on what power calculations are and why they are important, and practical guidance on implementing them (Coppock 2013).
The section "Power calculations: how big a sample size do I need?" in the World Bank’s e-book Impact Evaluation in Practice (Gertler, Martinez, Premand, Rawlings and Vermeersch 2010) provides an introduction to the concept, and works through examples of power calculations for different study designs.
The chapter “Sample Size and Power Calculations” from Data Analysis Using Regression and Multilevel/Hierarchical Models provides an in-depth technical overview of considerations related to power (Geldman and Hill 2006).
The blog post “Did you do your power calculations in standard deviations? Do them again…” provides further information about MDE in terms of standard deviations and in absolute terms (Ozler 2016).
The blog post “What is success, anyhow?” discusses considerations related to decision-relevant effect sizes in more detail (Goldstein 2011).
The blog post “Power Calculations 101: Dealing with Incomplete Take-up” provides information on incomplete take-up and power, as well as a detailed description of the effect of the first stage on power. (McKenzie 2011).
The Stata blog has a helpful post on calculating power using Monte Carlo simulations (Huber 2019).
J-PAL’s Six Rules of Thumb for Determining Sample Size and Statistical Power is a tool for policymakers and practitioners describing some of the factors that affect statistical power and sample size.
J-PAL’s "The Danger of Underpowered Evaluations" highlights reasons why an underpowered evaluation may consume substantial time and monetary resources while providing little useful information, or worse, tarnish the reputation of a (potentially effective) program.
J-PAL's Code: Power Calculations in Stata (download link)
The Institute for Fiscal Studies’ online guide "Going beyond simple sample size calculations: a practitioner’s guide" (McConnell and Vera-Hernandez, 2015) provides a technical guide to more complex study designs, with accompanying spreadsheets to implement calculations.
EGAP’s "10 things you need to know about statistical power" provides sample code for power simulations in R (Coppock 2013).
The package Declare Design walks through power simulations in R.
MDRC’s "Statistical power in evaluations that investigate effects on multiple outcomes: a guide for researchers" (Porter, 2016) provides guidance on incorporating adjustments for multiple-hypothesis testing into power calculations.
Glennerster, Rachel, and Kudzai Takavarasha. 2013. Running Randomized Evaluations: A Practical Guide. Princeton: Princeton University Press.
Goldstein, Markus. 2011. “What is success, anyhow?” Text. Development Impact (blog). April 19, 2011. http://blogs.worldbank.org/impactevaluations/what-is-sucess-anyhow
McConnell, Brendon, and Marcos Vera-Hernandez. 2015. “Going beyond Simple Sample Size Calculations: A Practitioner’s Guide.” IFS Working Paper, September 2015. https://www.ifs.org.uk/publications/7844
McKenzie, David. 2011. “Power Calculations 101: Dealing with Incomplete Take-Up.” Text. Development Impact (blog). May 23, 2011. http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up.