The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.
Guide to Using Datasets
One of the primary goals of The DHS Program is to produce high-quality data and make it available for analysis in a coherent and consistent form. However, national surveys in developing countries are prone to incomplete or partial reporting of responses. Additionally, complex questionnaires inevitably allow scope for inconsistent responses to be recorded for different questions. For the analyst this results in a data file containing incomplete or inconsistent data, complicating the analysis considerably.
In order to avoid these problems, The DHS Program has adopted a policy of editing and imputation which results in a data file that accurately reflects the population studied and may be readily used for analysis. Primary data quality policies include:
On This Page
During processing of the survey data and in the survey final reports, there are specific rules on how to deal with missing values and other special circumstances. Knowing these rules helps data users and analysts make sense of the survey data.
A "missing value" is defined as a variable that should have a response, but does not have a response — either because the question was not asked (due to interviewer error) or the respondent did not want to answer. The general rule for the survey data processing is that under no circumstances should an answer be made up. Instead, a missing value is assigned in the data file. In the final report, the way the missing values are handled varies depending on whether the table shows 1) a percent distribution or 2) individual cell percentages of respondents that do not sum to 100 percent. For tables presenting a percent distribution that sums to 100 percent, missing values are shown when they account for at least 1 percent of cases in any row. When missing values account for less than 1 percent of the distribution in every row, it depends on the author whether they are shown or not. For tables showing individual cell percentages of respondents, rows of missing values are not shown.
Other special responses and codes are: “inconsistent,” “don’t know,” and “blank” (or "not applicable.") “Missing,” “inconsistent,” “don’t know,” and “blank” codes are excluded when calculating statistics such as means or medians; otherwise they are treated as real values.
In a survey, sampling weights are adjustment factors applied to each case in tabulations to adjust for differences in probability of selection and interview between cases in a sample, either due to design or happenstance.
In a nationally representative survey, many times the sample is selected with unequal probability to expand the number of cases available for certain areas or subgroups for which statistics are needed. In this case, weights need to be applied when tabulations are made to produce the proper representation.
There are four different types of median calculations in the survey statistics, and results vary according to the type of variable being analyzed.
(1) Medians for completed time periods: These are medians for variables such as intervals between events or ages calculated at different events. For example, current age, age at first union, and age at sterilization in the DHS belong to this category. Medians for this type of variables take into consideration that ages are given in completed years. A respondent who is currently 20 years old could be somewhere between 20 years and 1 day old or 20 years and 364 days old.
(2) Medians for continuous variables: These are medians for variables such as children’s weight at birth or any other type of measurement in a continuous scale.
(3) Medians for discrete variables: These types of medians apply to variables where the only possible values are discrete values. Examples include: number of children or number of prenatal care visits by the woman. A respondent can only have one, two, or any integer number of children. It is not possible to have 2.3 children.
(4) Medians using the current status data: These types of medians are calculated for variables where 100 or close to 100 percent of the population have that characteristic at the beginning of an event and the percentages diminish as time passes by. For example, 100 percent of children do not know how to walk at birth. As time progresses, some children begin to walk, and there is an age (in months) where 50 percent or more of the children learn to walk.
Rates: The frequency of demographic events in a population in a specified time period. Rates tell how frequently an event is occurring – how common it is. Crude rates are rates computed for an entire population. Specific rates are rates computed for a specific subgroup, usually the population at risk of having the event occur (for example, General Fertility Rate: births per 1,000 women age 15-49 years). Thus, rates can be age-specific, race-specific, occupation-specific, and so on.
Ratio: The relation of one population subgroup to another subgroup in the same population; that is, one subgroup divided by another (for example, sex ratio: 102 males per 100 females).
Proportion: The relation of a population subgroup to the entire population; that is, a population subgroup divided by the entire population (for example, the proportion urban: 26.7 percent of the population).
All women factors are used in the DHS and AIS surveys to adjust ever-married women samples in order to estimate statistics based on all women.
These factors are as follows:
The Wealth Index was introduced in the DHS and AIS surveys and is presented in the Final Reports as a background characteristic. Information on the wealth index is based on data collected in the Household Questionnaire on household assets. This questionnaire includes questions concerning the household's ownership of a number of consumer items such as a television and car; dwelling characteristics such as flooring material; type of drinking water source; toilet facilities; and other characteristics that are related to wealth status.
Each household asset for which information is collected is assigned a weight or factor score generated through principal components analysis. The resulting asset scores are standardized in relation to a standard normal distribution with a mean of zero and a standard deviation of one. These standardized scores are then used to create the break points that define wealth quintiles as: Lowest, Second, Middle, Fourth, and Highest.
Each household is assigned a standardized score for each asset, where the score differs depending on whether or not the household owned that asset (or, in the case of sleeping arrangements, the number of people per room). These scores are summed by household, and individuals are ranked according to the total score of the household in which they reside. The sample is then divided into population quintiles -- five groups with the same number of individuals in each.
A single asset index is developed on the basis of data from the entire country sample and used in all the tabulations presented. Separate asset indices are not prepared for rural and urban population groups on the basis of rural or urban data, respectively.
Wealth quintiles are expressed in terms of quintiles of individuals in the population, rather than quintiles of individuals at risk for any one health or population indicator. This approach to defining wealth quintiles has the advantage of producing information directly relevant to the principal question of interest, for example, the health status or access to services for the poor in the population as a whole. This choice also facilitates comparisons across indicators for the same quintile, since the quintile denominators remain unchanged across indicators. However, some types of analysis may require data for quintiles of individuals at risk.
All health, nutrition and population indicators are calculated after applying the sampling weights so that the resulting numbers are generalizable to the total population. For each indicator in these tables, the total or population average presented is the weighted sum of the quintile values for that indicator, where the weight assigned to each quintile value is the proportion of the total number of individuals at risk in that quintile. The total value for indicators produced by this weighting scheme are representative of the total population, as they take into account the fact that the numbers of individuals at risk may vary across wealth quintiles. Similarly, each quintile value itself can be reproduced as a weighted average of urban/rural rates (weighted by proportions urban/rural) or the male/female rates (weighted by the proportion male/female). As a result of this weighting scheme, the population average for a given indicator presented in the tables will usually differ from a simple mean of the population subgroups.