
Advice for preparing a simple database for statistical analysis using Excel
1. Excel is probably the easiest and most user friendly package to use when preparing simple databases for
analysis. Unfortunately, as data points are easily changed, Excel is also the most easily corruptible data entry
package. To reduce mistakes due to improper sorting and accidental over-writing, consistently save backup
copies of the data.
The rule of thumb for choosing Excel rather than a database is, “Are you prepared to re-enter, from primary
sources, all the accumulated data if you accidentally corrupt it?” If you’re not, seek assistance with the design
of a database.
2. Any dataset provided must have an accompanying document describing the columns in the worksheet and the
coding of each field. This is often called a ‘data dictionary’.
3. The most common way to structure data is to have one person (subject or patient) per row and one variable per
column, with no blank rows or columns anywhere in the database.
4. Variable headings should be in a single row, the top row of the sheet, with each heading preferably limited to 8
characters or less. Avoid using blank spaces in the headings (an underscore can be used instead of a space),
and avoid the following characters…
i. ! @ # $ % ^ & * ( ) { } [ ] = - \ ; < > ?
ii. … either in the headings or in other fields within the database.
5. In columns of numbers, all characters such as (>, \ / .?) must be removed. As a guide, Excel will right justify
numbers and left justify characters. As > and < signs are not recognised by statistics packages, please think
carefully about what single numerical value should be used to replace these values, and preferably supply two
columns – one containing the original, mixed alpha-numeric data and one containing the numeric data.
6. Fields that contain missing data should be left completely blank (rather than NA, missing or ? or * or 999). Most
statistical packages can cope with importing a blank field (and will replace it with the missing value code unique
to the package).
7. Treatment groups and other groups (eg. Gender, Ethnicity, Marital Status) to which subjects belong should be
indexed by categorical factors (each factor has its own column in the spreadsheet). Avoid putting more than
one factor in a column (it complicates the importation of the data into a statistical package) and note that while
colour coding can be helpful when visually checking the data, it cannot, unfortunately, be interpreted by m ost
statistical packages – hence the need to include factors to index the groups of interest.
Also avoid labelling the categories with textual codes that can be entered in multiple ways (e.g. Female,
female, FEMALE) because Excel will read these different versions as distinct categories and they will be
imported into statistical packages as such. More preferable are either numeric codes or single letter text codes,
entered consistently
8. For ease of importing the data into a statistical analysis package, remove any superfluous data, summary tables
and charts from the sheet that contains the “analysis data set”.
9. Excel is quite finick y about date variables so it’s important to use a consistent date format (and save the sheet
consistently in a particular file-type). Cutting and pasting date fields from old to new versions of Excel (and vice
versa) can create problems when importing the sheet into another package.
10. Repeated Measures: When only 2 time points exist, it is sometimes advantageous to structure repeats across
the page (eg BPpre & BPpost). When there are more than 2 time points it is generally better to structure the
data down the page, with an additional column indicating time (i.e. a “stacked” data sheet) and repeating ID
codes for the subjects and their treatment groups. Statistical packages can easily switch the data structure from
“wide” to “stacked”.