- ABOUT THE DATA
- UNDERSTANDING SURVEY STATISTICS
- WORKING WITH DATASETS
- DATASET ACCESS
Each dataset can be distributed in several different file formats. Dataset users should select the format of data that best meets their analysis requirements. Recode data are available in formats that facilitate their use in the most common statistical software.
On this page
Recode datasets (including HIV and Other Biomarker Test Results) are available in five electronic formats:
- Hierarchical CSPro File
- Flat Files
- ASCII Data with Syntax Files
- Stata Data File
- SPSS Data File
- SAS Data File
Most researchers use the flat file designed for the statistical software that they intend to use for analysis. All datasets are distributed in archived ZIP files that include the data file and its associated documentation. ASCII files distributed in flat format include SPSS, SAS and Stata data definitions (syntax file). Hierarchical files include a CSPro data dictionary describing the data file. All zipped datasets have meaningful file names. Learn more about dataset filenames.
Hierarchical File Format
Hierarchical files contain a varying number of records for each case and are designed for use with packages supporting complex data structures. When DHS data are originally entered and saved using CSPro, this is the format that is used. Each data file averages 4–35 megabytes in size. The hierarchical structure defined by CSPro has several advantages and disadvantages. Among the advantages, the following can be highlighted:
- All the data are stored in just one ASCII file.
- Since all the data are stored in the same file, it is easy to maintain the integrity of the data in terms of data structure related to levels and records.
The major disadvantage is that this structure can be easily handled by CSPro and other software supporting hierarchical data structures, or by a customized program written in computer language, but are less easy to use with most statistical software.
Flat File Format
Flat files contain a single record, sometimes with more than 2,000 characters for each case in the data file. These data files are approximately 10–60 megabytes in size.
In a flat file there is one record for each case. All variables in each case are placed one after the other on the same record. The multiple or repeating records of the file are placed one after the other on the record, with the maximum number of occurrences of each section being represented in every case. Each variable in a repeating section is placed immediately after the preceding variable of the same occurrence, such that all variables for occurrence 1 precede all variables for occurrence 2 of a section. The length of each record in the flat data file is fixed.
Multiple occurring variables and sections of data represent the main disadvantage to flat files. Each occurrence of every such variable must have its own name because statistical packages do not generally support the use of arrays or subscripts. For example, the third occurrence of the variable named V304 would be named V304$03 in SPSS, or V304_03 in SAS and Stata.
ASCII Flat Files are distributed with description (syntax) files for reading the data into Stata (.do and .dct), SPSS (.sps) and SAS (.sas), together with other documentation to support reading the data into other software.
Stata, SPSS and SAS Data Files
The Data Files for Stata, SPSS and SAS are binary files specific to those packages that can be read quickly by the relevant statistical package. We provide data files for these three popular statistical analysis packages: Stata (.dta), SPSS (.sav), and SAS (.sas7bdat). Each data file contains all the data and descriptive information required to define and use the data, including variable names, variable labels, value labels, missing values, etc. These software specific data files are the preferred format for the datasets for most users.
The main advantages of using the Stata, SPSS or SAS data file (rather than an ASCII data file in conjunction with a syntax file) include faster processing time, not having to modify the syntax file to refer to the correct path of the data file, and the ease of saving changes. The main disadvantage is that system files are platform dependent. For example, the data files that we provide generally cannot be used on a Macintosh computer. While these have not been tested, it is more likely that the ASCII data and syntax files can be accessed on different platforms, but will probably require some modification by the user.
These data file are designed for use in the following versions (or later) of the software packages:
Geographic data are provided in .DBF and .MDB format. Each data file averages around 100KB in size. Users may import either file type into their GIS software package.
The following information is included in the geographic data file:
- The administrative units in which the point is located
- Whether the point is located in an urban or rural area
- The coordinates of the center of the populated area surveyed, expressed in decimal degrees and degrees and minutes.
- The altitude at the cluster location in meters.
- Whether the point was collected with a GPS unit or approximated using a map or gazetteer.
- The geographic datum used in recording, WGS84, unless otherwise specified in the file.
Linking GPS data to DHS/AIS data files
The geographic data file contains the cluster ID that corresponds to the cluster ID in each of the survey datasets (individual woman, child, births, etc.). Depending on the type of analysis, you may choose to aggregate the DHS data or simply attach new information to the DHS data files within your GIS.
The DHS Program is authorized to distribute, at no cost, unrestricted survey data files for legitimate academic research. Registration is required for access to data.
Guide to Using Datasets