Household survey data and selection criteria
Data used in this analysis are drawn from the WHO’s Household Energy Database16, a regularly updated compilation of nationally-representative household survey data for WHO Member States from various sources, detailed in Supplementary Table 1. Surveys in the database were downloaded manually and collated using Microsoft Excel (version 16.50) and occasionally Stata/SE (version 15.1). The version of the database used for this analysis (30th January 2020) comprises 1353 surveys collected from a total of 170 countries (including high income countries) between 1960 and 2018.
For this analysis we exclude surveys from before 1990, and only include data from surveys providing individual fuel breakdowns and with less than 15% of the population in total categorized as “missing”, “not cooking in the household”, or “mainly cooking with “other fuels”. There was no differentiation in the model between surveys that reported only household-weighted or population-weighted fuel use estimates. Where surveys reported both household-weighted and population-weighted estimates, only population-weighted estimates were used, in order to best estimate the population reliant on different cooking fuels. Using this selection criteria, 1136 surveys—collected from 153 countries—were used for modeling. Supplementary Table 1 shows both the number of surveys in the database and the number used for modeling from each data source. Meanwhile, Supplementary Table 2 shows the number of survey data points excluded for failing to meet inclusion criteria.
Surveys included in the database are inconsistent in the questions posed to households about cooking (typical questions by survey source are included in Supplementary Table 1). Most survey questions focus on the main or primary type of cooking fuel or energy rather than the cooking device, and thus the database version included in this study does not contain comprehensive data on solid fuel stove type (e.g., forced draft, brand information). Almost all surveys only assess the primary, or main, cooking fuel, or energy source which constrains the analysis to the primary fuel and technology used for cooking, although it is well documented that households often “stove-stack” or use multiple stoves and/or fuels17,18,19. Most surveys report the percentage of respondents mainly using each fuel separately for urban and rural areas. The definitions of urban and rural may vary by country, and we adopt these reported values directly rather than applying any standard definition of urban and rural.
The WHO Household Energy Database contains data on the proportion of households mainly using a wide variety of cooking fuels, including alcohol fuels (e.g., ethanol), biogas, charcoal, coal, crop residues, dung, electricity, kerosene, liquid petroleum gas (LPG), natural gas, solar energy, and wood. However, surveys are not always consistent in the fuel options they present to respondents. In particular, some surveys combine fuels into a single option (notably natural gas and LPG are often combined into the category “gas”). The result of this is that the time series of survey data for certain individual fuels can be unstable or unreliable in some countries.
Where appropriate in terms of similarity of health impacts, and relevance to policymakers, these issues can be remedied by combining affected fuels into a single category for modeling purposes. Here, we combine wood, crop residues, and dung into the category “biomass”, representing the combined use of unprocessed/raw biomass fuels, and we combine LPG, natural gas and biogas into the category “gas”—refer to Fig. 1 for a visual representation of these categories. Although solar and ethanol are considered clean fuels, they have been included under the category “other fuels”, due to the sparse number of data points available for these fuels (105 total data points for solar energy ranging between 0 and 0.8%; seven total data points for ethanol ranging between 0 and 0.14%).
We therefore estimate the population mainly using six fuel types: 1. biomass, 2. charcoal, 3. coal, 4. kerosene, 5. gas, and 6. electricity. A final category, “other fuels” represents the aggregate use of minor clean fuel types, e.g., solar and ethanol. Estimates for overall “polluting” and overall “clean” fuel use are then derived by aggregating estimates of relevant fuel types. “Other fuels” were not modeled individually but are included in the aggregate “clean” category.
The global household energy model
Previous statistical models for estimating fuel use have focussed on a single variable, i.e., solid fuel use or polluting fuel use9,20. Instead, we sought to model how a strongly related set of variables (the proportion of the population using each individual fuel type) changes over time, under the key constraint that as the use of one fuel increases the sum of the others must decrease, so that the total never exceeds 100%. No standard statistical procedure is available to achieve this while also properly quantifying the uncertainty associated with estimates for each fuel, which merited the development of the bespoke Global Household Energy Model11 (GHEM), a state-of-the-art Bayesian hierarchical approach21 to jointly estimating the use of individual fuels for cooking.
Trends in the proportions using each fuel type are modeled together for both urban and rural areas of each country using smooth functions of time (thin-plate splines) as the only covariate. Estimates produced by the model are realistic in the sense that, for each country, urban, rural, and overall fuel use is linked by estimates of the survey sample urban proportion (including for years without surveys), also based on smooth functions of time.
The model outputs Bayesian “posterior” probability distributions for fuel use in a given year and country, which can be used to answer questions like “What is the probability that the use of coal exceeds 10% in urban areas of Mongolia?”. For reporting purposes, summaries of these distributions can be taken to provide both point estimates (e.g., means or medians, the latter being what we present here and in the Supplementary Information/Data) and measures of uncertainty (e.g., 95% prediction intervals (PIs)—which mean there is a 95% probability that fuel use lies within the given range). Here, we use the term “uncertainty interval” to describe central 95% posterior credible/prediction intervals.
GHEM is implemented using custom code (fully provided in Supplementary Software 1) in the R programming language (version 4.0.0) and the NIMBLE22 software package (version 0.10.1) for Bayesian statistical modeling with Markov chain Monte Carlo (MCMC). We also used the following R packages for our analysis: abind (1.4-5); coda (0.19-3); doParallel (1.0.15); ggfan (0.1.3); ggplot2 (3.3.0); grid (4.0.0); gridExtra (2.3); mgcv (1.8-33); openxlsx (4.1.5); Rcolorbrewer (1.1-2); readxl (1.3.1); reshape2 (1.4.4); rgdal (1.5-16); scales (1.1.0); and tidyverse (1.3.0). The version of GHEM used for this analysis differs from the previously published version11 in that no regional structures were assumed a-priori. Non-informative prior distributions were assumed for all model parameters11. We ran four MCMC chains from distinct randomly generated sets of initial values, using different random number generator seeds for each chain. We ran the chains for 80,000 iterations, discarding the first 40,000 from each chain as “burn-in” and then thinning by a factor of 40 to reduce system memory usage. The result is a total of 4000 posterior samples for each model parameter, which are used to calculate posterior medians and central 95% posterior credible/prediction intervals.
The probability distributions assumed for input survey data do not allow for inputs where the sum of the percentage mainly using all mutually exclusive fuel categories exceeds 100% (110 surveys, with a median total excess of 0.01%), which can occur due to rounding at different stages of data collection. For these surveys, fuel use values were uniformly scaled (divided by the sum of mutually exclusive categories), to have a total of 100%. Countries classified as high-income according to the World Bank country classification23 (60 countries) are assumed to have fully transitioned to clean household energy and are reported as >95% access to clean fuels and technologies1. In addition, no estimates are provided for LMICs where no surveys were available or suitable for modeling post-1990 (Bulgaria, Cuba, Lebanon, and Libya). Modeled estimates for the use of overall clean, overall polluting and specific fuels are therefore provided for a total of 130 countries—128 LMICs plus two countries with no World Bank income classification (Cook Islands and Niue).
Population data from the United Nations Population Division (2019 version) were used to derive the population-weighted regional and global aggregates. We present aggregate estimates for the eight SDG regions, as well as for the six WHO regions. LMICs without suitable survey data were excluded from all regional calculations and high-income countries were excluded from regional calculations for specific fuels—this means our regional estimates for specific fuels (e.g., gas) refer only to LMICs in those regions. Values of 100% clean fuel use were used for high income countries when calculating regional aggregates of clean and polluting fuel use.
We also project observed trends in fuel use into the future using GHEM. These future projections were developed by extrapolating observed trends, representing a “business-as-usual” scenario assuming no new policies or interventions.
The degree of uncertainty associated with such projections depends on a number of factors which vary by country, including the number of surveys conducted near present day and how changeable the trends are estimated to be over the available data period (1990–2018)—for example, projections for a country where trends are linear may display less uncertainty than a country with sudden changes in fuel use (e.g., Indonesia). The model has been validated11 for making fuel use predictions up to 5 years beyond the end year of the data. Hence for years close to the end of the data period (e.g., 2019, 2020, 2021), point estimates and 95% prediction intervals can be interpreted as predictions of what may happen based on trends in the data. Further into the future, uncertainty tends to grow beyond practical levels but point estimates remain useful for policy purposes with a specific interpretation: what may happen if observed trends continue and no new policies or interventions are introduced.
Our estimates of the populations mainly using polluting fuels for cooking are used by the WHO to estimate the global burden of disease from household air pollution3. Future WHO burden of disease estimates are anticipated to be calculated based on estimated populations mainly using specific fuels and technologies for cooking.
Other institutions have also developed burden of disease estimates for household air pollution based on cooking fuels, all with varying results but ultimately telling the same message: millions of premature deaths annually and hundreds of millions of years of healthy life lost due to exposure to household air pollution24,25,26.
The authors alone are responsible for the views expressed in this article and they do not necessarily represent the views, decisions or policies of the institutions with which they are affiliated.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.