Preparing Simple Databases for Statistical Analysis using Excel: Best Practices, Exams of Designs and Groups

Guidelines for preparing simple databases for statistical analysis using microsoft excel. Topics covered include data organization, formatting, and avoiding common mistakes to ensure data integrity and ease of importation into statistical packages.

Typology: Exams

2021/2022

Uploaded on 07/05/2022

paul.kc
paul.kc 🇦🇺

4.7

(68)

1K documents

1 / 4

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Advice for preparing a simple database for statistical analysis using Excel
1. Excel is probably the easiest and most user friendly package to use when preparing simple databases for
analysis. Unfortunately, as data points are easily changed, Excel is also the most easily corruptible data entry
package. To reduce mistakes due to improper sorting and accidental over-writing, consistently save backup
copies of the data.
The rule of thumb for choosing Excel rather than a database is, “Are you prepared to re-enter, from primary
sources, all the accumulated data if you accidentally corrupt it?” If you’re not, seek assistance with the design
of a database.
2. Any dataset provided must have an accompanying document describing the columns in the worksheet and the
coding of each field. This is often called a ‘data dictionary’.
3. The most common way to structure data is to have one person (subject or patient) per row and one variable per
column, with no blank rows or columns anywhere in the database.
4. Variable headings should be in a single row, the top row of the sheet, with each heading preferably limited to 8
characters or less. Avoid using blank spaces in the headings (an underscore can be used instead of a space),
and avoid the following characters
i. ! @ # $ % ^ & * ( ) { } [ ] = - \ ; < > ?
ii. … either in the headings or in other fields within the database.
5. In columns of numbers, all characters such as (>, \ / .?) must be removed. As a guide, Excel will right justify
numbers and left justify characters. As > and < signs are not recognised by statistics packages, please think
carefully about what single numerical value should be used to replace these values, and preferably supply two
columns one containing the original, mixed alpha-numeric data and one containing the numeric data.
6. Fields that contain missing data should be left completely blank (rather than NA, missing or ? or * or 999). Most
statistical packages can cope with importing a blank field (and will replace it with the missing value code unique
to the package).
7. Treatment groups and other groups (eg. Gender, Ethnicity, Marital Status) to which subjects belong should be
indexed by categorical factors (each factor has its own column in the spreadsheet). Avoid putting more than
one factor in a column (it complicates the importation of the data into a statistical package) and note that while
colour coding can be helpful when visually checking the data, it cannot, unfortunately, be interpreted by m ost
statistical packages hence the need to include factors to index the groups of interest.
Also avoid labelling the categories with textual codes that can be entered in multiple ways (e.g. Female,
female, FEMALE) because Excel will read these different versions as distinct categories and they will be
imported into statistical packages as such. More preferable are either numeric codes or single letter text codes,
entered consistently
8. For ease of importing the data into a statistical analysis package, remove any superfluous data, summary tables
and charts from the sheet that contains the “analysis data set”.
9. Excel is quite finick y about date variables so it’s important to use a consistent date format (and save the sheet
consistently in a particular file-type). Cutting and pasting date fields from old to new versions of Excel (and vice
versa) can create problems when importing the sheet into another package.
10. Repeated Measures: When only 2 time points exist, it is sometimes advantageous to structure repeats across
the page (eg BPpre & BPpost). When there are more than 2 time points it is generally better to structure the
data down the page, with an additional column indicating time (i.e. a “stacked” data sheet) and repeating ID
codes for the subjects and their treatment groups. Statistical packages can easily switch the data structure from
“wide” to “stacked”.
pf3
pf4

Partial preview of the text

Download Preparing Simple Databases for Statistical Analysis using Excel: Best Practices and more Exams Designs and Groups in PDF only on Docsity!

Advice for preparing a simple database for statistical analysis using Excel

1. Excel is probably the easiest and most user friendly package to use when preparing simple databases for

analysis. Unfortunately, as data points are easily changed, Excel is also the most easily corruptible data entry

package. To reduce mistakes due to improper sorting and accidental over-writing, consistently save backup

copies of the data.

The rule of thumb for choosing Excel rather than a database is, “Are you prepared to re-enter, from primary

sources, all the accumulated data if you accidentally corrupt it?” If you’re not, seek assistance with the design

of a database.

2. Any dataset provided must have an accompanying document describing the columns in the worksheet and the

coding of each field. This is often called a ‘data dictionary’.

3. The most common way to structure data is to have one person (subject or patient) per row and one variable per

column, with no blank rows or columns anywhere in the database.

4. Variable headings should be in a single row, the top row of the sheet, with each heading preferably limited to 8

characters or less. Avoid using blank spaces in the headings (an underscore can be used instead of a space),

and avoid the following characters…

i.! @ # $ % ^ & * ( ) { } [ ] = - \ ; < >?

ii. … either in the headings or in other fields within the database.

5. In columns of numbers, all characters such as (>, \ / .?) must be removed. As a guide, Excel will right justify

numbers and left justify characters. As > and < signs are not recognised by statistics packages, please think

carefully about what single numerical value should be used to replace these values, and preferably supply two

columns – one containing the original, mixed alpha-numeric data and one containing the numeric data.

6. Fields that contain missing data should be left completely blank (rather than NA, missing or? or * or 999). Most

statistical packages can cope with importing a blank field (and will replace it with the missing value code unique

to the package).

7. Treatment groups and other groups (eg. Gender, Ethnicity, Marital Status) to which subjects belong should be

indexed by categorical factors (each factor has its own column in the spreadsheet). Avoid putting more than

one factor in a column (it complicates the importation of the data into a statistical package) and note that while

colour coding can be helpful when visually checking the data, it cannot, unfortunately, be interpreted by most

statistical packages – hence the need to include factors to index the groups of interest.

Also avoid labelling the categories with textual codes that can be entered in multiple ways (e.g. Female,

female, FEMALE) because Excel will read these different versions as distinct categories and they will be

imported into statistical packages as such. More preferable are either numeric codes or single letter text codes,

entered consistently

8. For ease of importing the data into a statistical analysis package, remove any superfluous data, summary tables

and charts from the sheet that contains the “analysis data set”.

9. Excel is quite finicky about date variables so it’s important to use a consistent date format (and save the sheet

consistently in a particular file-type). Cutting and pasting date fields from old to new versions of Excel (and vice

versa) can create problems when importing the sheet into another package.

10. Repeated Measures: When only 2 time points exist, it is sometimes advantageous to structure repeats across

the page (eg BPpre & BPpost). When there are more than 2 time points it is generally better to structure the

data down the page, with an additional column indicating time (i.e. a “stacked” data sheet) and repeating ID

codes for the subjects and their treatment groups. Statistical packages can easily switch the data structure from

“wide” to “stacked”.

11. For studies (e.g. clinical trials) with a complex visit structure we encourage you to seek the advice of

professional data managers and database designers as well as a biostatistician. Generally these studies will

have three styles of linked “sheets” or datasets. A demographic dataset with one row or record per subject, a

visit dataset with multiple records per subject (one record for each visit, indexed by a visit or sample date), and

running logs with multiple records per subject (each record corresponding to a new incident and each record

with a start date and either a stop, or censor, date). The datasets are linked by subject ID and records are

uniquely identified by dates within ID’s. Statistical packages can easily merge and link data from different

sources but this facility relies on the sources being set up appropriately with variable names chosen carefully

and linking variables defined clearly.

12. For ease of importation into statistical packages, where possible (e.g. with small studies) the final dataset for

analysis should be contained in a single worksheet rather than spread over multiple short worksheets.

13. Although ideally researchers should provide a single, finalised dataset for analysis, in reality this may not occur

because omissions, errors, corrections or new data may arise. In such circumstances it is imperative that

updated versions of the dataset should be identified in the name of the file, the variables names should not

change between versions, and the general layout of the dataset should not be altered.

The way to structure repeated data down the page

  • 1 0009722 12-May-63 44 1 1 75 40 2. Diabetic UR DOB Age Male Time Bp Sodium Troponin
  • 1 0009722 12-May-63^44 1 2 50 40 1.
  • 1 0009722 12-May-63^44 1 3 70 70 12.
  • 1 0009722 12-May-63^44 1 4 80 50 3.
  • 1 0009722 12-May-63^44 1 5 80 60
  • 0 4444444 1-Jan-02^5 0 1 67 23 0.
  • 0 4444444 1-Jan-02^5 0 2 68 25 0.