




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient ...
Typology: Summaries
1 / 8
This page cannot be seen from the preview
Don't miss anything!





Volume II: page 99
This chapter explores different tools and techniques for handling data for research purposes. This chapter assumes that a research problem statement has been formulated, research hypotheses have been stated, data collection planning has been conducted, and data have been collected from various sources (see Volume I for information and details on these phases of research). This chapter discusses how to combine and manage data streams, and how to use data management tools to produce analytical results that are error free and reproducible, once useful data have been obtained to accomplish the overall research goals and objectives.
Proper data handling and management is crucial to the success and reproducibility of a statistical analysis. Selection of the appropriate tools and efficient use of these tools can save the researcher numerous hours, and allow other researchers to leverage the products of their work. In addition, as the size of databases in transportation continue to grow, it is becoming increasingly important to invest resources into the management of these data.
There are a number of ancillary steps that need to be performed both before and after statistical analysis of data. For example, a database composed of different data streams needs to be matched and integrated into a single database for analysis. In addition, in some cases data must be transformed into the preferred electronic format for a variety of statistical packages. Sometimes, data obtained from “the field” must be cleaned and debugged for input and measurement errors, and reformatted.
The following sections discuss considerations for developing an overall data collection, handling, and management plan, and tools necessary for successful implementation of that plan.
The data collection, handling, and management plan plays an important role within a research project. The plan provides a roadmap documenting the flow of data through the sequential phases of collection, storage, cleaning, reduction, analysis, and finally to archiving. Further, the management plan documents the relationships between all of the software tools and programs necessary to guide the data through this research life cycle.
The data handling and management plan needs to be developed before a research project begins. The plan, however, can evolve as the researcher learns more about the data, and as new avenues of data exploration are revealed.
The data collection, handling, and management plan addresses three major areas of concern : Data Input, Storage, Retrieval , Preparation; Analysis Techniques and Tools ; and Analysis Mechanics. These concerns are not independent, and have synergistic impacts on the plan.
Provided below is a list of questions that must be considered when formulating a data collection, handling, and management plan. Although the questions are organized into three major categories, many questions raised will affect decisions made in two or more categories.
Are the data “clean?” The data input process oftentimes introduces typos, miscodes, and errors into the data. (These errors are distinctly different from random or measurement errors introduced in the measurement process). Different storage strategies support differing levels of data editing. [Do the data need to be edited? Do edits need to be tracked?]
Are the data static, or will updates be available through the lifetime of the analysis? Many data streams are updated periodically. For example, transportation safety data can be updated on an annual basis. Pavement data may be updated on a route-by-route basis as field observations are recorded. [Will new data constantly be added? Will new data be in the same format? Will new fields be added?]
Are the data obtained from a variety of sources or from a single source? As data are drawn from a greater number of sources, the need for transforming the data to a common format becomes more critical and challenging. Further, one must always consider whether the different sources use the same definitions for common variables. For example, does one source define “average delay” in the same manner as another? If not, can a simple transformation between definitions be established?
How much data will be managed and stored? Different storage strategies support varying record sizes and large numbers of records. [Will storage space be a problem? Will access time be a problem?]
Will all data be used in the analysis, or will subsets of the data be analyzed? To speed analysis, the researcher will sometimes want to work with a subset of fields rather than all database fields within a record at once. In other cases, only a subset of records will be analyzed. For example, a research may investigate traffic flow and speed relationships for workdays only. [Are sophisticated query and data sub-setting features needed?]
Do records in the database share common, duplicated information? Duplicative information wastes storage space, and in some cases creates database problems. In a safety analysis, for instance, geometric information may be duplicated if accidents occur at
The researcher may use one or more of the following solutions in the overall management and handling of data. The key is to recognize that there is not a one-size-fits-all solution, and so the best solution for the particular task at hand must be carefully selected.
Text files and spreadsheets are categorized as “small” databases.
In text files, often called “flat files,” all records related to a particular analysis are stored in consecutive lines in the file. Text files are the “least common denominator” of files and are generally used when transferring data from one statistical application to another.
From the research perspective, small-scale databases offer several advantages. First, they are simple to understand. All records in the database are the same, and all the necessary information is contained in each record (i.e., the researcher need not look up additional information in a separate table.) Second, they are simple to use. In general, the researcher will use all the information in the database, rather then selecting a portion of the data. Third, flat-file or spreadsheet databases are simple to update, edit, change, or append.
On the other hand, small-scale databases offer several disadvantages. First, they are not designed for “querying” or locating specific records (e.g., all the accidents from a specific site.) This becomes a greater problem as the number of records in the database increases, and queries become important. Second, duplicate information across records can increase storage requirements, and result in a propagation of errors through the analysis. For example, suppose that records in an accident database contained site- specific data. All accidents at a specific site would contain duplicate information. If the researcher found it necessary to update the site-data at this site, all records in the database would have to be updated simultaneously. Third, only one person (or application) can use the database at a time. When edits or changes are being made, the entire database is locked, and no one else can be provided access.
For many research projects, the small-scale solution (e.g., flat-file or spreadsheet) is appropriate. Other solutions should be examined when multiple users require access to the data, when the amount of data is large, or when the data is constantly being modified, queried, or appended.
Medium-scale data solutions include the desktop relational database management systems (RDBMS). These systems store data in the form of related tables. Relational databases are powerful because they require few assumptions about how data is related or how it will be extracted from the database. Consequently, the same database can be viewed in many different ways. An important feature of relational systems is that a single database can be distributed across several tables. This differs from flat-file databases, in which each database is contained in a single table. The internal organization of data can affect how quickly and flexibly the analyst can extract information.
Requests for information from a relational database are made in the form of a query, or stylized question. For example, the query
SELECT ALL WHERE NAME = “SMITH” AND AGE > 35
requests all records in which the NAME field is SMITH and the AGE field is greater than
The information contained in a database can be presented in a variety of formats. Most DBMSs include a report writer program that enables you to output data in the form of a report. Many DBMSs also include a graphics component that enables you to output information in the form of graphs and charts.
Medium-scale solutions tend to focus on the single-user. Performance of these systems tends to degrade rapidly as the number of users accessing the data at any given time increases.
Examples of medium-scale database management solutions include FoxPro, Paradox, DBase, and Microsoft Access.
Medium-scale solutions are appropriate for a large number of research applications. Other solutions should be examined when the size of the database exceeds around 500 megabytes (MB), when the data is constantly updated (e.g., real-time data collection), or when the number of concurrent users and/or analysts is large.
While large-scale databases offer many of the same features as medium-scale databases, they differ in several important aspects. First, they are generally designed to handle large numbers of records (e.g., millions and greater.) As such, the overhead of such a system is significantly greater than that of a medium-scale database, but this is offset by the speed at which the data can be queried. Second, large-scale databases include some level of transaction logging. This feature tracks changes to the database, and allows the system manager to “rollback” or “undo” modifications if they are found to be in error. Third, large-scale databases are designed to provide multiple users with concurrent access to the data. It is common to have several hundred users concurrently querying a large-scale database.
Large-scale databases tend to require a significant investment in the computer hardware, and generally need expertise and resources available to manage, upgrade, and maintain the system.
Large-scale databases include MS SQL Server, Oracle, Informix, Interbase, and Sybase.
Large-scale solutions are appropriate for special research applications. Traffic management systems for a city, county, or state, for example, can manage their huge databases with large-scale database management solutions.
A data warehouse differs from a regular database in several aspects: 1) A data warehouse is read-only; 2) A data warehouse contains data from disparate sources that
several programs be executed on these data in a specific sequence with the outputs of one program becoming the inputs to the next; and 3) reports be generated during the process.
Large-scale analyses do not simply “appear.” They are performed incrementally, with the researcher focusing on one small aspect of the problem, and when completed, turning her attention to the next aspect of the analysis. In many cases, the results of one analysis (or data transformation) are used as inputs to the next analysis.
Make allows the researcher to encapsulate the manipulations employed during each step of the analysis into a single set of computer instructions. This helps ensure that other researchers on other computers can repeat the overall analysis in the future.
Further, make helps the researcher operate more efficiently. For example, suppose that one step of a complex analysis takes several hours to complete, and suppose that the outputs of this analysis are used by subsequent programs for additional analysis. If the researcher simply used a “batch” file to store the necessary instructions, the long analysis step would have to be performed each time the researcher made a change to one of the smaller, subsequent programs. By using make , only the necessary programs would be executed, eliminating the need to constantly run the long analysis each time.
Perl is another of the Unix utilities that can save the researcher hours of time when handling statistical data, and performing statistical analysis. Like make , perl is available for most operating systems and will run on laptops, desktops, workstations, and servers.
Per l stands for “Practical Extraction and Reporting Language.” Perl was designed as a tool for manipulating large data files, and as a “glue” language between different software packages. Perl makes it easy to manipulate numbers and text, files and directories, computers and networks, and especially other programs. With perl , it is easy to run other programs, then scan their output files for specific results, and then send these specific results off to other programs for additional analysis. Perl code is easy to develop, modify, and maintain. Perl is portable, and the same perl program can run on a variety of computer platforms without changes. Perl programs are text files and can easily be shared with other researchers.
Example: The researcher spends a surprising amount of time transforming data from one format to another. For example, many data collection devices generate text files with the individual data elements separated by commas with each record containing a variable number of characters and fields.
Many statistical analysis software packages require that each input record contain the same number of characters and fields. In order for these software packages to analyze data from data collection devices, the output of the data collection device must be cleaned (e.g., all “funny” characters must be removed), and reformatted into an electronic format that can be loaded by the statistical software.
For small amounts of data, many researchers perform this cleaning and reformatting task by hand, or within a general-purpose software package like a spreadsheet. However, this approach is prone to errors, and can result in irreproducible results. It also becomes very cumbersome and time consuming when large data files, or large numbers of files are transformed.
Perl is the perfect tool for these types of jobs, and it renders the analysis repeatable, which is an important aspect of the scientific method.