International Development Research Centre (IDRC) Canada     
idrc.ca HOME > Publications > IDRC Books > All our books > POPULATION AND HEALTH IN DEVELOPING COUNTRIES >
 Topic Explorer  
IDRC Books
     New
     in_focus
     Development/evaluation
     Economics
     Environment/biodiversity
     Food/agriculture
     Health
     IT/communication
     Natural resources
     Science/technology
     Social/political sciences
    All our books

IDRC's 40th anniversary

Subscribe

Free Online Books
 People
Bill Carman

ID: 42996
Added: 2003-08-28 12:22
Modified: 2004-11-03 21:27
Refreshed: 2010-03-14 05:14

Click here to get the URL for the RSS format file RSS format file

Chapter 4. PROCESSING DSS DATA
Prev Document(s) 15 of 38 Next

Introduction

Compiling longitudinal population information poses unique data-management challenges. Projects must maintain changing individual-level information on the composition and household structure of a large, geographically defined population. Events that arise — births, deaths, migrations, etc. — must be linked to individuals and other entities at risk of these events. These events affect not only demographic rates, for instance, but also relationships within and between households. As event histories grow, records of new events must be logically consistent with those of events in the past. Seemingly obvious checks on data to meet minimal standards of integrity can result in hundreds of lines of code.

Relating critically needed auxiliary data to dynamic population registers poses further challenges. Morbidity and cause-of-death data must be entered, linked, and stored. Most DSS projects also maintain socioeconomic data such as on marriage, family relationships, and economic conditions, owing to the strong correlation between health and socioeconomic status. These must be logically consistent with other longitudinal data on the population at risk and relationships among individuals under surveillance. Moreover, projects are often launched to assess the impacts of health technologies, service strategies, or policies, and this necessitates data entry, management, and checking procedures for the internal consistency of service information, as well as procedures to link this information to demographic histories. Variance in exposure to interventions must be monitored at the individual level, in conjunction with precise registration of demographic events and individual risk. Maintaining a detailed record of demographic events, relationships, and exposure to risks or interventions requires complex data-management operations, with a carefully controlled field-operation infrastructure to oversee and support data collection and entry, and a comprehensive computer system for the data-management operation.

Data-management systems required for this operation typically encompass thousands of lines of computer code. A key contribution of the INDEPTH network has been technology-sharing to offset the complexity of developing a data system and creating a reference data model for storage of DSS data. This generic model for data storage facilitates cross-site comparative analyses of the type described in this volume, as it standardizes data rules and concepts across sites. Future work of the network will address the need for generic analytical and data-management software compatible with the reference data model.

This chapter outlines features of this reference data model that pertain to the INDEPTH DSSs. In the not-too-distant past, developing DSS software was difficult, time-consuming, and prone to conceptual and programmatic errors. Software generators and object-oriented tools for software development greatly simply the task of developing a complex system, once common principles of software structure are instantiated in a common applications framework. The mechanisms of INDEPTH have marshalled these software innovations to meet the collective needs of member stations. The reference data model will facilitate exchange of information, swift formulation of site-specific data management software and common software for data analysis, and simplified technical assistance and capacity-building operations.

Background

The work of the INDEPTH Technical Working Group (TWG) has been informed by the achievements, limitations, and future needs of projects in Bangladesh, Burkina Faso, Ghana, Indonesia, Mali, Senegal, South Africa, Tanzania, and Uganda. One of the earlier systems, the Bangladesh DSS in Matlab District, was developed in the 1960s and has since been used for a wide range of studies of demographic dynamics, family planning, epidemiology, health-services research, and other issues (Rahman and D’Souza 1981; D’Souza 1984). Although the Bangladesh DSS has redeveloped its computer operations several times, its field operations have provided a model for a wide range of DSS applications in developing countries. The Bangladesh DSS precisely defined eligibility rules for members of a population under study; this, combined with a data system with rigorous logical-consistency checks, has provided high-quality data for many research papers. A number of software systems have been written, based on experiences with the Bangladesh DSS, including the Sample Registration System (Leon 1986a, b, 1987; Phillips et al. 1988; Mozumdar et al. 1990) and the Indramayu Child Survival Project of the University of Indonesia (Utomo et al. 1990). The DSS in Niakhar, Senegal, most recently described in Garenne (1997), has also influenced the technical design of a number of systems, including those of PRAPASS in Nouna, Burkina Faso (Sauerborn et al. 1996), and Agincourt, South Africa (Tollman et al. 1995). Garenne (1997) described the concept of entry–exit files (similar to the concept of “episodes” described here) as a means of modeling both intervals of residence at a location and intervals of relationships. Garenne also provided useful observations regarding the implementation of field and software systems for longitudinal population studies.

To develop its data model, TWG synthesized the experience of these disparate applications. The model specifies a demographic “core” common to field stations doing longitudinal research on populations (MacLeod et al. 1991; Phillips et al. 1991). Sites have developed software systems to manage this demographic core, maintain a consistent record of significant demographic events in the population of a fixed geographic region, generate registration books that the fieldworkers use, and compute basic demographic rates, such as birth, mortality, and total fertility. These core capabilities establish a computational framework to which projects add their site-specific data and consistency specifications. The concept of a core also entails some generic principles of data collection and management that apply to all INDEPTH sites.

The INDEPTH concept of a data core

All participating sites in INDEPTH collect and maintain a common core of data. Attempts to standardize data processing have led to the concept of a “core system” that provides many of the common software requirements of field research laboratories and can be extended and modified to tailor software to various specifications. This concept is based on the principle that certain characteristics of households, household members, relationships, and demographic events are common to all longitudinal studies of human populations, and software required to collect, enter, and manage data can therefore be generic to a family of applications. TWG has identified these features of a core system common to all DSS operations. In this framework, the core system maintains a consistent record of baseline and longitudinal data on all households, household members, and their relationships in a geographically defined population, including births, deaths, migrations, and marriages. The core system maintains information on events and observation dates to give each entity in the study corresponding “person-day” counts of risk for demographic events. Core computer operations structure data and maintain logical integrity on the following basic elements of a household unit:

  • All households have defined members at any given point in time (rules unambiguously exclude nonmembers);
  • All households have a single head at a given point in time, and members relate to one another and to the head in definable ways;
  • Members have names, dates of birth, and other characteristics that do not change over time;
  • Events can occur to members, such as death, birth, in- and out-migration, and marital-status change (attempts to enter event data on nonmembers are rejected at the point of data entry);
  • Events change household membership and relationships according to fixed rules; and
  • Episodes (such as pregnancies, conjugal relationships, or residencies) are associated with individuals at risk (that is, active members) and must follow simple logical rules.

Although these are seemingly trivial items, mundane relationships tend to become complex and unwieldy when arrayed as a logical system of longitudinal population data; and portraying even simple relationships requires rigorous standards to avoid error. For example, to be counted as a death in a resident population, a concerned household member must be resident in the study area at the time of death; a live birth to a woman 5 months after she gave birth to another child would be an inconsistent event. A central contribution of TWG has been to clarify such minimal system logic so that the system prevents errors resulting in violation of business rules and rendering data useless.

All INDEPTH computer systems maintain standard DSS-processing operations:

  • Data entry — Software allows for entry, deletion, and editing of the baseline and longitudinal data. Baseline household information includes the household location, individuals within the household, relationships between individuals, and familial social groups. Longitudinal information includes basic information on pregnancies and their outcomes, deaths, migrations in and out of the study area, marriages, and any other measures the investigators specify.
  • Validation — Software checks for the logical consistency of data.

Most INDEPTH sites have also developed software for reporting outcomes and managing data:

  • Reports and output — Routine software calculates and displays demographic rates and life tables and can compute age-specific and overall rates.
  • Visitation register — Software prints the household-registration book, which is used by the fieldworkers to update and record information during household interviews.
  • Utilities — This option is primarily used by the system administrator. It includes capabilities for adding new user IDs, setting interview-round information, and generating reconciliation reports to help track down unreported pregnancy outcomes and unmatched internal migrants.
Tailoring the core system

Given the basic core model for data structure, each site has developed site-specific applications using building blocks of the core framework, which allow software developers to construct additional modules for project-specific data. At nine INDEPTH sites, standard tools of database-management packages have been used for an INDEPTH product known as the household-registration system (HRS) for the core specification.1 Other INDEPTH sites have developed project-specific core capabilities to maintain the logical integrity of birth, death, migration, and marriage data over time and in a format consistent with the reference data model. Each site modifies the core to accommodate new cross-sectional data, special longitudinal modules, or variable classes or labels investigators want to add to field registers, along with logic to maintain the integrity of new variables.

The tools of commercially available database packages greatly facilitate the process of core modification. Standard features of commercially available database systems include those for easily adding data to the core system. For example, the HRS is built from the form menu (data-entry screen) and database builders of the Microsoft FoxPro system. These builders encourage and facilitate an object-oriented software-development approach through easily understandable mouse and menu procedures. To make changes to the core, a programmer locates the database table, menu, or form

1 The HRS formed the basis for INDEPTH software systems in The Gambia, Ghana (Binka et al. 1995), Indonesia, Mali, Mozambique, Tanzania (three sites), and Uganda. Applications involve a wide range of INDEPTH studies, including family-planning research, malaria interventions, child and maternal health, and correlates of HIV transmission. The current INDEPTH data model improves on the original HRS and other INDEPTH systems by allowing investigators to track nonresident individuals; include more general relationships, rather than just marital relationships; and separate membership in social groups (such as the household or family) from the location.

object to be changed, then works with the small pieces of code, called code snippets, which are “attached” to the object. Some code snippets control the timing of the entry of data for a variable; others enforce rules of consistency. Some INDEPTH sites, such as Hlabisa, are developing similar capabilities, using systems in SQL Server and Access.

The reference data model

As explained in Chapter 3, a DSS tracks the presence of individuals in a defined study area. These individuals can enter and leave the study area in a small set of well-defined ways (for example, entering through birth or in-migration and leaving through death or out-migration). The INDEPTH reference model uses events to record the ways individuals enter (or return to) and leave the study area over time. Thus, events bracket the residency of any individual in the study area. In general, they occur in pairs, with one event (such as presence in the study area) initiating a state and another event (such as migrating out or death) terminating that state. Use of episodes in the reference model makes this pairing of initiating and terminating events explicit. The concept of episodes is diagramed in the centre section of Figure 4.1.

When a DSS tracks episodes, the concept of the “time resolution” of this tracking is very important. Below a certain time threshold, movements into or out of a particular place are not recorded. If a person leaves the physical location in the morning to go to the market and returns in the afternoon, this is not reflected in the DSS. If this period of absence increases beyond a certain threshold (6 weeks, 3 months, or some other period), it turns into an episode to be recorded in the DSS. This threshold varies from project to project, but the project always makes it explicit. The time resolution for “in” episodes should be consistent with the time resolution for “out” episodes, that is, the time before a visit becomes residency or the time after which an absence becomes an out-migration.

DSSs are concerned not only with the physical location or residence of individuals but also with their membership in social groups (such as households) and their relationships with other individuals (such as marital unions or parenthood). Many DSSs also need to reconstruct genealogies and to record isolated events, such as pregnancy outcomes or births and deaths external to the study area.

To support field operations and routine cleaning of data, a DSS must also keep track of where, when, and by whom a particular event was recorded. In this respect, the reference model provides a number of fields to facilitate construction of a good-quality data set. Another challenge for demographic field operations is to correctly identify migrating individuals. To resolve this problem, the reference data model includes fields to designate the place a migrant is moving to or coming from.

The INDEPTH reference model meets these requirements through its use of the following entities and the relationships between them (see Figure 4.1):

  • Physical location — This entity records the physical locations where individuals can stay, either singly or in groups, such as a homestead, stand, or plot. At several INDEPTH sites, it is possible to pinpoint this location by using coordinates, such as latitudes and longitudes. This feature is easily linked to a GIS. External IDs, such as stand number or address, can be stored in addition to the unique location ID value. An individual is associated with a physical location at a given time through a “resident episode.”

Figure 4.1. Reference Demographic Surveillance Data Model.
Note: LMP, last menstrual period. Mandatory fields and entities are in in bold.

phdc-1_50_la_0.jpg

  • Individual — This entity contains a record for every individual who has ever resided in the study area. Optionally, this entity may record individuals whose residence in the study area has not been recorded but is required to complete a genealogy or relationship record. Records are uniquely identified through an individual ID value. Genealogical linkages can be established by storing the IDs of the individual’s father and mother. This information (mother’s and father’s ID) can also be useful for identification purposes, especially where name and date of birth are not clearly defined, as is often the case in SSA.
  • Social group — This entity stores information on a defined social group, such as a household. An individual is associated with one or more social groups, through one or more membership episodes.
  • Observation — The observation entity stores the information that a particular physical location has been observed at a given time. This entity can also store information on the person making the observation and optional information, such as the census round. The observation entity is linked to all the events recorded during the observation.
  • Events — The events entity may indicate a change in the state of an individual (for example, from resident to nonresident, in the case of an out-migration). Events that initiate and terminate a particular state of interest (for example, residency) are combined and recorded as an episode (for example, resident episode). These types of events are known as “paired events.” Events that do not record the start or end of a particular state are known as “point events.” The information common to all events (such as date of occurrence, type of event, and ID of the observation during which the event was recorded) is stored as part of the episode that this event initiates or ends (in the case of paired events) or as part of the point-event table (in the case of point events). Additional data associated with an event are stored in a separate entity. The following event types are noted in Figure 4.1:
    • Birth — This event type records all live births to residents (stillbirths are recorded as a pregnancy-outcome event). The event is linked to the resident episode it initiates — it also initiates social-group membership and relationship episodes.
    • Death — This event type records all deaths of residents. A death event will terminate all open episodes belonging to the individual. The death-event record is linked to the resident episode that the event terminates and contains additional data, such as the location and cause of death.
    • Relationship start — This event type records the start of a relationship of one individual to another. By convention, relationship events are linked to the female in cases of heterosexual relationships and to the younger individual in cases of same-sex relationships. In the case of caretaking relationships, the relationship events are linked to the person receiving care.
    • Relationship end — This event type records the end of a relationship between two individuals.
    • Membership start — This event type records the start of an individual’s membership in a social group.
    • Membership end — This event type records the end of an individual’s membership in a social group.
    • In-migration — An in-migration event initiates a new or changed physical location for an individual. It records the start of a new residence episode for an individual and can originate within or outside the study area. Additional data, such as origin, are usually stored in a separate entity linked to the episode via the episode ID.
    • Out-migration — An out-migration event terminates a residence episode at a physical location for an individual. The destination of an out-migration can be within or outside the study area. Additional data, such as destination, are usually stored in a separate entity linked to the episode via the episode ID.
    • Status observation — Any number of optional events can be defined to record status information observed for individuals, such as socioeconomic, nutritional, educational, or immunization status. Repeated status observations make no assumptions about the value of observed attributes during the observation interval, even if subsequent observations measure the same values.
  • Episodes — As Figure 4.1 shows, episodes can occur to residents, relationships, pregnancies, and memberships in social groups:
    • Resident episode — A resident episode records the stay of an individual at a physical location. A resident episode can be initiated only by a DSS entry, a birth, or an in-migration event. It can be terminated only by a DSS exit, a death, or an out-migration event.
    • Relationship episode — A relationship episode records a time-dependent relationship, such as a marital union, between two individuals. The episode is started by a relationship-start event and concluded by a relationship-end event, a death, or a DSS exit. The relationship episode records the IDs of the two individuals involved in the relationship, but the events initiating and terminating the episode are linked to only one of the individuals, as described above.
    • Pregnancy episode — Pregnancy is recorded as an episode, with certain attributes recorded on the first observation of the pregnancy and others recorded when the outcome of the pregnancy is known. One lesson we have learned is that if you want to do a good job in child registration, you have to register pregnancies first. However, if a pregnancy is not observed, but only the outcome, the start of the pregnancy episode is still recorded as the date of the last menstrual period before the pregnancy. In this case the start and last observation IDs will point to the same observation instances. If a pregnancy is terminated by the woman’s death or out-migration, the reason for termination is recorded as the terminating-event type, and the episode is concluded. In the normal course of events, the pregnancy outcome could be recorded in the terminating-event type as spontaneous abortion, induced abortion, normal delivery, assisted delivery, or caesarean section. The “birth location” field refers to the delivery environment (for example, the name of a hospital or clinic where the delivery took place).
    • Membership episode — A membership episode records the membership of an individual in a particular social group. A membership episode can be initiated only by a DSS entry, a birth, or a membership start event. It can be terminated only by a DSS exit, a death, or a membership end event.

In summary, Figure 4.1 illustrates the entities and relationships of the INDEPTH reference data model. Mandatory fields and entities are displayed in bold type, whereas optional fields and entities are displayed in normal (nonbold) type.

The role of the reference data model in maintaining data integrity

As explained in Chapter 3, any DSS must maintain a large volume of data over an extended period. Unless specific measures are taken, the integrity of the data will suffer, along with the accuracy and reliability of the information in the system. INDEPTH has taken steps to foster common standards for data integrity, based on a well-defined relational model. Although not all systems have the same measures to protect data quality, the following have been proposed or used at one or more INDEPTH site:

  • “Audit stamp” — The audit stamp is part of every record in the database. The audit stamp records the operator and the date and time of the last update to the record. In addition, a quality-check indicator may record whether the record has been verified (for example, through a double-entry process).
  • Standard values — Standard values should be used consistently throughout the database to indicate the status of a particular data value. The following standard values (and their meanings) are proposed:
    • “Never entered” — This is the default value for all data fields in a newly created record.
    • “To be confirmed” — This indicates a need to query the value as it appears on the input document and to take follow-up action.
    • “Not applicable” — Given the data in related fields or records, a value for this data field is not applicable.
    • “Out of range” — The value on the input document is out of range and could not be entered. Follow-up action yielded no better information or is not applicable.
    • “Unknown” — The value is not known. Follow-up action yielded no better information or is not applicable.

(The actual values used to indicate the standard values depend on the data type of the field and the natural value range for the data item. Care should be taken to exclude these values from quantitative analysis of the data.)

  • Date values — Date values are of particular importance in a DSS, and it is preferable to record the precision of date values in addition to the dates themselves. Each date or duration field should have an associated precision field for recording the precision of the date value (for example, minutes, hours, days, weeks, months, quarters, semesters, years, decades).

Extending the core

Although the INDEPTH reference data model covers aspects common to all INDEPTH DSSs, it makes no attempt to specify all site-specific needs. However, it is designed to accommodate new components to meet the needs of a wide spectrum of longitudinal studies, without losing its clear overall structure. Several ways are presented in this section:

  • Adding fields to existing entities — The simplest core extension is to add a data field for a fixed-in-time attribute of objects, events, or episodes already implemented. Examples of this type of core modification are inauguration date of a physical location, membership in an ethnic group, an individual’s Rh factor, the weaning age of a registered or member child, or the presence of a supervisor during an observation.
  • Defining new types of social groups and relationships — Whenever the interaction of individuals can be formalized to permit specification of a start and an end point of this interaction, it can be expressed in terms of social-group membership (interaction with all other individuals being members of the same social group at the same time) or of a relationship to just one other individual. INDEPTH data systems have specified a wide variety of such relationships and episodes. For example, membership might be in a social group (such as a lineage), in a health-insurance scheme, or in any other type of group that suits the needs of a study. A relationship can also be of a patient to a health-care provider or of a tenant to a landowner. Membership is not always limited to social groups but sometimes involves a “membership” in a category of chronically ill individuals or “membership” in a nested cohort study (where fulfillment of some predefined criteria might be the start events; and of others, the end events).
  • Adding new types of episodes or events — As illustrated in Figure 4.1, the system records four minimal, “predefined” types of episodes, and these can be adapted to various purposes. New event types are sometimes specified to facilitate storage of supplemental information (applicable only under specific conditions) while keeping the corresponding episode record as parsimonious as possible.
  • Defining events and episodes for physical locations and social groups — Although events and episodes always refer to individuals, they sometimes relate individuals to other operations. An extended model can define additional events and episodes with reference to physical locations or social groups. Point events and status observations can be defined to record information collected or observations made about physical locations (such as housing type, water supply, number of rooms), social groups (such as ID of household chief, monthly household income, agricultural production), or other nonconstant attributes.

Social groups can be related to other social groups, or “first-level” social groups like households can be members of “second-level” social groups like clans or other types of networks. DSSs designed to track the interaction of households might define relationship and membership episodes for social groups, to store information about this topic.

Households are normally associated with only one homestead, even if the members of the household reside in more than one physical location. When social groups are used to record households, this association can be depicted by an episode that records the start and termination of occupation at a physical location. Households also normally have a head of household. This head may change with time, but the household will still retain its identity, and head of household can be recorded either as an updatable attribute (“Current head of household”) or as a member of the social group. If the temporal dimension is important, the extension can be specified as an episode linking the household to an acting head of household.

In summary, the reference data model provides a structure to accommodate great flexibility in the design of longitudinal studies, and for this reason, INDEPTH includes sites engaged in various study designs, with a wide range of data-management needs. Despite this diversity, the model has a core of logic and structure lending integrity to operations and providing a crucial foundation for technical collaboration among sites.

Conclusion

This chapter has described the data model that INDEPTH has developed as the guiding framework for processing data at member sites. It makes attributes common to most health and family-planning studies explicit. As well, it serves as a structural framework for the addition of project-specific data. Much work still needs to be done to develop this model and a common data-processing system for INDEPTH core operations. However, the common framework for data management has already facilitated data sharing within the network, and nearly one-half of all INDEPTH sites use a common software system for core operations. If this use of generic software is more broadly accepted, the INDEPTH data model could serve as the basis for sharing system development, capacity-building, and collaborative research.







Prev Document(s) 15 of 38 Next



   guest (Read)(Ottawa)   Login Home|Careers|Copyright and Terms of Use|General Infomation|Contact Us|Low bandwidth