Supporting The Biosurveillance Processes

The technical topics that we have discussed in Part VI to this point—standards, architecture, server facility, and software— are foundational to the correct design of a biosurveillance system. They are necessary, but not sufficient, for long-term success. In this section, we discuss additional decisions that designers face. In particular, we discuss the different options available for supporting several core processes of biosurveillance (Figure 35.3): collection of surveillance data, persistent storage of data, the collection of additional data, and data linkage.These processes correspond to the key components of most biosurveillance systems illustrated by Figure 35.4.

4.1. Data Collection System

Data collection refers to the processes discussed in the previous chapter—extraction, transmission, transformation, and loading—with which a biosurveillance system obtains data for analysis. A data collection system should be flexible (accommodate the needs of different data providers), reliable, and have minimal data transmission latencies. In this section, we discuss the two data collection processes that occur at the data provider facility—data extraction and transmission. For each of the processes (extraction and transmission), we discuss the technologies commonly employed in biosurveillance.

4.1.1. Data Extraction Methods

Data extraction refers to the initial process of obtaining data from the data provider's systems before the data provider transmits the data to the recipient. There are two methods for that have been used in biosurveillance to extract data from a data provider's systems—query-based methods and message filtering (Lober et al., 2004).

The simplest data extraction method is a query to a database (e.g., an SQL query that retrieves all emergency department (ED) registration for the period midnight to midnight). Most organizations are capable of extracting data using queries, however, the requirements are as follows: (1) that the data provider has access to its database, (2) that the database supports database queries (most modern databases do, but some organization may operate older systems), and (3) that the data provider has technical staff that can implement a script or other program that periodically queries the database. To ensure reliability (i.e., humans are fallible), this program should run automatically at a predefined time interval. An example of an query-based extraction is the extraction of over-the-counter medication sales information from a retailer

figure 35.3 The biosurveillance process (Figure 1.1 from Chapter 1, reproduced here for convenient reference).

that contributes data to the National Retail Data Monitor (NRDM). Most of the retailers run a data extraction program on their databases every 24 hours that retrieves the number of medications sold on the previous day at each of their stores and stores the data in a file that is transmitted later in the data collection process. A limitation of query-based extraction is that it may put a load on the data provider's database systems, depending on which data are being queried for and how the data system is organized. For this reason, most organizations run query extraction procedures during off peak hours (e.g., midnight).

An alternative method is message filtering. A message is a discrete object of communication presented in a standard format. For example, a message may contain a information about a patient registering for care, a request back for more information, or a request to perform an action or an acknowledgement that an action was performed. The term message filtering refers to any process that selects individual messages from a stream of messages. For example, a message filter might be configured by a hospital to select only admission messages from a stream containing all admission-discharge-transfer messages.

Prerequisites for using filtering as a data extraction method include (1) a messaging system at the data provider, and (2) the ability to filter the system for the desired messages. The filtered messages can be transmitted immediately, queued for later transmission (e.g., when the receiving system is not functioning), or saved to a file for later transmission.

4.1.2. Transmission Methods

A transmission method is a way of moving data between computers over a network. The two most important characteristics of a transmission method are security and guaranteed delivery. A method is secure if only the intended recipient can view the data. Secure transmission is important because most biosurveillance data are confidential. Guaranteed delivery is the assurance that the receiver receives the transmission in its entirety. Guaranteed delivery is important because incomplete data are more difficult to analyze. Transmission methods can be either file or message based.

figure 35.4 Basic components of a biosurveillance system (reproducedfrom Chapter 1 for convenient reference).The dashed line represents look-back for additional information to any data source.

File-Based Transmission. File-based transmission methods are ideal for the results of query-based data extraction but can also be used when a data provider filters messages and saves them to a file. Examples of file-based transmission methods include FTP, SFTP, HTTP, HTTPS, and WebDAV.

FTP (File Transmission Protocol) is a commonly used method for transmitting files. FTP guarantees delivery but is not secure. FTP does not encrypt data being transmitted. Moreover, FTP represents a security risk to the overall system because it sends passwords in clear text over the network. Anyone with access to the network can run a sniffer program to intercept passwords.

Secure FTP (SFTP) addresses the security problems of FTP. FTP and SFTP are currently the most popular file-based transmission methods used by healthcare organizations.

Hyper Text Transfer Protocol (HTTP) is familiar to most people in the context of a Web-browser downloading files from the Internet, but it also supports uploading of files to a Web server. Similar to FTP and SFTP, HTTP has a secure version, HTTPS. Both HTTP and HTTPS provide guaranteed delivery, but HTTP is not secure. Healthcare organizations rarely use HTTP or HTTPS for data transmission because FTP and SFTP are older protocols and the technologists in healthcare organizations are comfortable with their use.

WebDAV (Web-based Distributed Authoring and Versioning) is an extension to HTTP. WebDAV is secure and guarantees delivery. Although healthcare organizations do not currently use WebDAV, other organizations use it for data transmission.

HTTP and WebDAV are alternatives to FTP/SFTP for file-based data transmission. However, HTTP and WebDAV are not widely used by healthcare organizations or over-the-counter data providers. FTP/SFTP is the most widely used file-based data transmission method and will continue to be for the near future.

Message-Based Transmission. In general, message-based transmission is preferable when it is feasible. It does not put load on the data provider's database, and it has more built-in fault tolerance (it can store messages in a queue if the receiving system is temporarily unavailable).

There are two types of message transmission: point-to-point messaging and message busses.

Point-to-point messaging refers to any transmission method for sending messages from one sender to one recipient. Sending a message from a hospital's ED registration system directly to a biosurveillance system would be an example of point-to-point messaging.

A message bus is a transmission method that employs specialized software called a message router. Unlike point-to-point messaging, a message bus can accept messages from multiple senders and send each message to one or more receivers over a network. For example, the healthcare industry makes extensive use of HL7 message busses, which are managed by HL7-message routers (a.k.a. integration engine). The HL7-message router accepts messages from the ancillary systems of the hospital and then routes the messages to the appropriate receivers (other information systems in a hospital or an external biosurveillance organization). Message buses provide guaranteed delivery and security. An example of a biosurveillance system that uses message-based transmission is the RODS system, which takes advantage of the fact that healthcare systems often use HL7 message buses to route ED registration information internally (to their billing, laboratory, and other information systems). When a data provider has an existing message bus, it is an obvious choice for data transmission. All that is required is modification or configuration of the message router to direct messages to the biosurveillance system.

The capabilities of the organizations sending and receiving the data will dictate the set of transmission methods that you must include in your biosurveillance system. For example, if a biosurveillance system is interacting with a data provider that already employs messaging systems, you should employ message filtering and message-based transmission methods. If a data provider offers you a Web service (discussed in the previous chapter) because they have a service-oriented architecture (SOA), you should connect to their Web service for data.

4.2. Data Storage System

The most important function of the data storage system is to protect the data from loss. However, the data storage system must also store data in such a way that they are logically organized and quickly accessible to other services.

Computer scientists are a bit religious about the correct use of the terms database and database management system (DBMS). A database is simply a defined structure that houses a collection of data. A DBMS, such as Oracle, MySQL, or Microsoft SQL Server, is a computer program (usually very large) that not only allows a user to define a structure (a database) but provides utilities for managing the structure (e.g., adding a record, deleting a record, backing up the data, optimizing retrieval of records from the database, checking integrity).

DBMSs are sufficiently mature that the only decision you face about the DBMS is whether to use Oracle, Microsoft SQL server, or possibly an open source DBMS to economize. The selection is usually dictated by either the requirement of other biosurveillance software (some systems only can interact with one DBMS) or whether your information department or the ASP that you plan to use has a preference.

The decision that you should focus on is sizing of the data storage system, including disk space and hardware. Although the individual data messages being collected by a biosurveillance system are usually small (approximately 4 kb), the overall storage required may be quite massive depending on how many facilities are sending data to the system, how many types of data they are sending (e.g., ED registrations and laboratory results), and whether you are going to rely on data warehousing to provide fast access to the data (see caching discussion below). The amount of data collected and stored will increase over time because the analysts or analytic programs typically require access to historical data. Improper planning will limit the overall capacity of the database. A DBMS on a single server for the collection of ED data from 200 hospitals is an example of a system that will have problems. A better approach is to connect the server to a storage area network (SAN) to allow for expected growth of the data storage system.

4.3. Look Back/Investigation Support

A key property of biosurveillance is its cyclic (iterative) nature in which analyses of currently available information lead to additional, more directed, data collection (e.g., as during an outbreak investigation).

The topic of how to provide information-system support to an outbreak investigation is an open area of research. Here we focus on one particular topic: how to obtain additional data from a hospital about a patient of interest. This functionality is not widely available, and unless some thought is put into how it will be accomplished, you may have to redesign your data collection approach at a later time.

As a concrete example of the functionality, let us assume that an epidemiologist receives an alert from the planned system about a high level of respiratory illness in the community and wishes to obtain additional information about one or more patients. We and others refer to this functionality as look-back when the additional data can be obtained by request to another computer system or individual (i.e., the information is available somewhere, and it is simply a matter of getting it).

The previous sections focused on options for transferring data unidirectionally from one organization to another—from a hospital to a health department, from a laboratory to a centralized data storage facility, or from a biosurveillance system to an external party (health department or other). Most of these data transfers view the world as involving only one-way communication.

Look-back requires two-way communication, which is best implemented by using message-based transmission.

Message-based data transmission has several properties that make it uniquely suited for this type of communication. Messages support asynchronous communication, which is required when human intervention may be necessary to fulfill a request (which may nearly always be the case for the near term when requesting additional data from healthcare organizations). Most message-based systems also associate response messages with request messages. In the event of a long delay in response, or multiple responses, this association provides a means of deciphering the history of communication between organizations.

Look-back requires not only a two-way messaging capability (asking for more information and receiving it) but also the capability to include in the request message the identity of the individual for whom additional data are requested. This seemingly simple requirement is surprisingly difficult to satisfy owing to the current ethical balance between protecting patient's right to confidentiality and the benefit to the public's health. At a minimum, a biosurveillance organization must send sufficient information back to a data provider to allow the data provider to uniquely identify the patient. We refer to information that is sufficient for the receiving organization to identify the individual as an identifier. A basic requirement for look-back is that the biosurveillance organization receives this identifier as part of its routine data collection process.

Table 35.1 lists a variety of identifiers that you could specify for your routine data collection and our impression of the administrative feasibility of obtaining them routinely for different types of biosurveillance data. Our opinion reflects the current balance between the need to protect confidentiality and the benefit to biosurveillance. Although many state health statutes allow a health department to collect any data needed for public health surveillance, in practice fully identifiable data are only collected for notifiable diseases (first row in Table 35.1). In current practice, all other data would require encryption of the identifier before transmission to a biosurveillance organization (Row 4), which would require that the data provider not only encrypt the identifier but also be capable

table 35.1 Technical and Administrative Feasibility of Different Look-Back Identifiers for Different Types of Biosurveillance Data

Identifier Sent Routinely (To Enable Later Look-Back)

Additional

Work for Data Provider

Privacy Protection

Data Endorsed by Current Health Statutes

Unencrypted identifier

None

No

Usually only for reporting notifiable diseases to local and

(e.g., social security number)*

0 0

Post a comment