Introduction to Bioinformatics Databases - Intorduction to Bioinformatics - Lecture Slides, Slides for Bioinformatics. Dhirubhai Ambani Institute of Information and Communication Technology
abhae
abhae9 January 2013

Introduction to Bioinformatics Databases - Intorduction to Bioinformatics - Lecture Slides, Slides for Bioinformatics. Dhirubhai Ambani Institute of Information and Communication Technology

PDF (681 KB)
68 pages
762Number of visits
Description
These are the important key points of lecture slides of Introduction to Bioonformatics are:Introduction to Bioinformatics Databases, Size of Genbank, Database Background, Codd Normal Forms, Basic Structure, Unique Identi...
20points
Download points needed to download
this document
Download the document
Preview3 pages / 68
This is only a preview
3 shown on 68 pages
Download the document
This is only a preview
3 shown on 68 pages
Download the document
This is only a preview
3 shown on 68 pages
Download the document
This is only a preview
3 shown on 68 pages
Download the document
Introduction to Bioinformatics Databases

Introduction to Bioinformatics Databases

Docsity.com

The Problem • Current size of Genbank (June 2011): 129,178,292,958 bp (1.3 x

1011, or 129 tera-base pairs) in 140,482,268 entries. This doesn’t include unprocessed genomic sequences, which would double the size. – This would be hard to deal with if it were all written on pieces of paper

stored in file cabinets. – In 1982, Genbank contained 690,338 bp in 606 entries, which fit on two

5 ¼ inch floppy disks (360 kB capacity), which Genbank mailed to you. • At its simplest, a database is a way to store information and retrieve

it efficiently. – However, nearly all databases add value to the information by processing it in

different ways.

• We want to introduce some ideas about how databases work, and what kinds of databases are available.

Docsity.com

Database Background • The general concepts of storing and retrieving data go back to

very beginnings of writing. • How to record data in a uniform fashion, and how to file it where

you can find it again later. Concepts like: – forms, – alphabetical order, – serial numbers, – filing cabinets with separate drawers and folders within drawers – Use of machines to accurately tabulate information (typewriter,

adding machine, etc.)

• Computers allowed even larger amounts of information to be stored – Computers are like being blind: they deal with

information one small chunk at a time, and they can’t see what’s coming next.

Sumerian land purchase records from about 2400 BCE

Docsity.com

Codd’s Normal Forms • In 1970, IBM researcher E. F. Codd published the seminal paper “A

relational model of data for large shared data banks”, which specified the basic design principles still used today for designing databases.

– “relational” databases, that is. There are other ways of doing databases, not as widely used.

• Codd’s fundamental insight was to put data into multiple tables connected together by unique keys. This is opposed to the typical spreadsheet idea of having all the data together in one big table.

• Making a database adhere to Codd’s principals is called “normalization”, and the principles themselves are called “normal forms”.

– At present there are 6-8 normal forms: defining them is part of database theoretical research. “relational algebra”

• Databases that conform to the normal forms are: – Easy and fast to search, and give the correct unique results – Easy to update: each piece of data is stored in a single location – Easy to extend to new types of data

Docsity.com

Basic Structure • Databases are composed of tables of data.

– Tables hold logically related sets of data. A table is essentially the same thing as a spreadsheet: a set of rows and columns.

• Each table has several records: – a record stores all the information for a given individual – Records are the rows of a data table

• Each record has several fields: – A field is an individual piece of data, a single attribute of the record. – Fields are the columns of a data table

• Each record (row) has a unique identifier, the primary key. – the primary key serves to identify the data stored in this record across all the

tables in the database. • Databases are manipulated with a language called SQL (Structured Query

Language). It’s a “baby English” type of language: uses real words, but rigid in terms of the order and placement.

• Various database software: Oracle, MS Access, MySQL, etc. Docsity.com

Tables in the GL database • Example: a pet grooming business called Grooming Lodge.

– thanks to L. Jennifer Bosch, who created this example • I am using database software called MySQL. It is running on one of our

departmental servers, using a “command line” interface. The computer shows me a prompt: mysql> and I type in a command: show tables; I then hit Enter, and the data prints onto my monitor.

• This database uses 3 tables: 1. Charges: a list of each bill sent to a customer 2. Clients: contact information for each customer 3. Pets: a list of each individual pet seen at Grooming Lodge.

mysql> show tables; +--------------+ | Tables_in_GL | +--------------+ | charges | | clients | | pets | +--------------+

Docsity.com

GL Data Table • Here is part of the “clients” table. Each record (row) is a client, and each client has

several attributes or fields (columns). • Note that each client has a unique identifier, the client_id. This is a very important

aspect of a good database table: each record represents one unique individual, with no duplicates and none left out.

– Names themselves are often not unique, so an identifier number is used. – The unique ID for each record in the table is the primary key.

mysql> select client_id,name_last,name_first, phone,last_visit,balance from clients; +-----------+------------+------------+--------------+------------+---------+ | client_id | name_last | name_first | phone | last_visit | balance | +-----------+------------+------------+--------------+------------+---------+ | 1 | Bosch | Linda | 123-234-3456 | 2011-03-18 | 134 | | 2 | Harford | Cornelius | 234-354-2987 | 2011-01-10 | 0 | | 3 | Perkins | Laura | 815-823-9000 | 2010-05-01 | 0 | | 4 | Gramme | Barbara | 898-555-2008 | 2011-02-05 | 23 | | 5 | Granillo | Richard | 323-543-3328 | 2011-03-01 | 23 | | 6 | Hambourger | Colleen | 959-456-2345 | 2010-10-15 | 0 | | 7 | Harrell | Kenneth | 324-888-5555 | 2010-03-08 | 0 |

+-----------+------------+------------+--------------+------------+---------+ Docsity.com

Another Table • On the “pets” table, each pet gets its own record and ID: pet_id is the primary key

on this table. • Pets are associated with their owner through the client_id, which is the same as

the client_id primary key in the “clients” table. • Note that all the data fro each pet and for each client is only entered once. All

connections between the tables go through the primary keys. • The “charges” table has these fields: client_id, pet_id, job_code, and fee.

mysql> select pet_id,client_id,name,species,age,sex,breed from pets; +--------+----------+---------+---------+------+------+-----------------+ | pet_id | client_id | name | species | age | sex | breed | +--------+----------+---------+---------+------+------+-----------------+ | 1 | 1 | Merlin | feline | 3 | Mn | DSH | | 2 | 1 | Azzy | feline | 4 | Fs | DSH | | 3 | 1 | Brach | feline | 4 | Mn | DSH | | 4 | 2 | Wheezy | canine | 7 | Mn | Yorkie mix | | 5 | 3 | Foghorn | avian | 12 | F | Amazon - Mexica | | 6 | 4 | Pepper | feline | 13 | Fs | DLH | | 7 | 5 | Arlyn | canine | 5 | Fs | Afghan hound | | 8 | 5 | Taylor | canine | 7 | Ms | Afghan hound | | 9 | 6 | Creeper | reptile | 2 | M | iguana | | 10 | 6 | Jasper | canine | 10 | Mn | Dobie/Greyhound | | 11 | 7 | Maxwell | canine | 2 | Mn | Dauschund | +--------+----------+---------+---------+------+------+-----------------+

Docsity.com

First Normal Form • As an example of how database tables are made, here are Codd’s original

concepts, the first normal form. He also came up with second and third normal forms, in later publications.

• 1NF is a way of putting data tables into a regular order, so they are easy to process.

• The first normal form (1NF):

– The order of rows and columns is irrelevant (so you can’t have a row that refers to the “previous” row, etc.).

• This allows random access to the data, rather than needing to read it from top to bottom.

– No duplicate rows (this way, the data are stored only a single time) – No repeating columns – Each cell (row-column intersection) must contain a single item of data.

Docsity.com

Grooming Lodge Data and 1NF • What would the data look like as a spreadsheet?

– One logical approach would be to list each client on a separate line along with their contact information. Since some people have more than one pet, this would lead to multiple data items in a single cell, or multiple identical columns. Hard to update accurately or search quickly.

– Another approach would be to list each pet on a separate line along with its owner. This would lead to multiple copies of owner data, which will be difficult to update.

– Another approach would be to list each owner a single time, with multiple pets referring to that line. This requires that the data be read in a specific order, top to bottom.

• By using 2 tables, one for clients and one for pets, all the data is in 1NF and is easy to search and update.

Docsity.com

+-----------+------------+------------+--------------+------------+------+ | client_id | name_last | name_first | phone | pet | species | +-----------+------------+------------+--------------+------------+------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin | feline | | 1 | Bosch | Linda | 123-234-3456 | Azzy | feline | | 1 | Bosch | Linda | 123-234-3456 | Brach | feline | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy | canine | | 3 | Perkins | Laura | 815-823-9000 | Foghorn | avian | | 5 | Granillo | Richard | 323-543-3328 | Arlyn | canine | | 5 | Granillo | Richard | 323-543-3328 | Taylor | canine | +-----------+------------+------------+--------------+------------+------+

Each pet gets its own line, resulting in multiple rows for a single client.

+-----------+------------+------------+--------------+------------------+ | client_id | name_last | name_first | phone | pets | +-----------+------------+------------+--------------+------------------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin (feline), | | Azzy (feline), | | Brach (feline) | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy (canine) | | 3 | Perkins | Laura | 815-823-9000 | Foghorn (avian) | | 5 | Granillo | Richard | 323-543-3328 | Arlyn (canine), | | | Taylor (canine) | +-----------+------------+------------+--------------+------------------+

Multiple data items in a single cell. Very difficult to search efficiently.

Docsity.com

+----+------------+------------+--------------+-----------------+---------------+---------------+ |id | name_last | name_first | phone | pet | pet | pet | +----+------------+------------+--------------+-----------------+---------------+---------------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin (feline) | Azzy (feline) | Brach (feline)| | 2 | Harford | Cornelius | 234-354-2987 | Wheezy (canine) | | | | 3 | Perkins | Laura | 815-823-9000 | Foghorn (avian) | | | | 5 | Granillo | Richard | 323-543-3328 | Arlyn (canine) |Taylor(canine) | | +----+------------+------------+--------------+-----------------+---------------+---------------+

Multiple columns for pets. Frequent need to expand the table, lots of columns with no data.

+-----------+------------+------------+--------------+---------+---------+ | client_id | name_last | name_first | phone | pet | species | +-----------+------------+------------+--------------+---------+---------+ | 1 | Bosch | Linda | 123-234-3456 | Merlin | feline | | see 1 | | | | Azzy | feline | | see 1 | | | | Brach | feline | | 2 | Harford | Cornelius | 234-354-2987 | Wheezy | canine | | 3 | Perkins | Laura | 815-823-9000 | Foghorn | avian | | 5 | Granillo | Richard | 323-543-3328 | Arlyn | canine | | see 5 | | | | Taylor | canine | +-----------+------------+------------+-------------+---------+----------+ Rows refer to each other, requiring that the data be read from top to bottom.

Docsity.com

Searching • Retrieving information

is a matter of specifying what you want:

– which fields, – which tables, – which records you

want, based on the values of given fields (i.e. the conditions for selection).

mysql> SELECT name_last, name_first, phone,balance FROM clients WHERE balance>0; +-----------+------------+--------------+---------+ | name_last | name_first | phone | balance | +-----------+------------+--------------+---------+ | Bosch | Linda | 123-234-3456 | 134 | | Gramme | Barbara | 898-555-2008 | 23 | | Granillo | Richard | 323-543-3328 | 23 | +-----------+------------+--------------+---------+

“who owes money?”

mysql> SELECT name, pet_id FROM pets WHERE species='feline'; +--------+--------+ | name | pet_id | +--------+--------+ | Merlin | 1 | | Azzy | 2 | | Brach | 3 | | Pepper | 6 | +--------+--------+

“which pets are cats?”

mysql> SELECT name_last, name_first, pets.name FROM clients, pets WHERE clients.client_id=pets.client_id AND species='feline'; +-----------+------------+--------+ | name_last | name_first | name | +-----------+------------+--------+ | Bosch | Linda | Merlin | | Bosch | Linda | Azzy | | Bosch | Linda | Brach | | Gramme | Barbara | Pepper | +-----------+------------+--------+

“which clients have cats?” (this question uses data from two different tables, connected by the client_id).

Docsity.com

mysql> SELECT client_id,name_last,name_first,phone,last_visit,balance FROM clients; +-----------+------------+------------+--------------+------------+---------+ | client_id | name_last | name_first | phone | last_visit | balance | +-----------+------------+------------+--------------+------------+---------+ | 1 | Bosch | Linda | 123-234-3456 | 2002-03-18 | 134 | | 2 | Harford | Cornelius | 234-354-2987 | 2002-01-10 | 0 | | 3 | Perkins | Laura | 815-823-9000 | 2000-05-01 | 0 | | 4 | Gramme | Barbara | 898-555-2008 | 2002-02-05 | 23 | | 5 | Granillo | Richard | 323-543-3328 | 2002-03-01 | 23 | | 6 | Hambourger | Colleen | 959-456-2345 | 2001-10-15 | 0 | | 7 | Harrell | Kenneth | 324-888-5555 | 2001-03-08 | 0 |

+-----------+------------+------------+--------------+------------+---------+

mysql> SELECT pet_id,client_id,name,species,age,sex,breed FROM pets; +--------+----------+---------+---------+------+------+-----------------+ | pet_id | client_id | name | species | age | sex | breed | +--------+----------+---------+---------+------+------+-----------------+ | 1 | 1 | Merlin | feline | 3 | Mn | DSH | | 2 | 1 | Azzy | feline | 4 | Fs | DSH | | 3 | 1 | Brach | feline | 4 | Mn | DSH | | 4 | 2 | Wheezy | canine | 7 | Mn | Yorkie mix | | 5 | 3 | Foghorn | avian | 12 | F | Amazon - Mexica | | 6 | 4 | Pepper | feline | 13 | Fs | DLH | | 7 | 5 | Arlyn | canine | 5 | Fs | Afghan hound | | 8 | 5 | Taylor | canine | 7 | Ms | Afghan hound | | 9 | 6 | Creeper | reptile | 2 | M | iguana | | 10 | 6 | Jasper | canine | 10 | Mn | Dobie/Greyhound | | 11 | 7 | Maxwell | canine | 2 | Mn | Dauschund | +--------+----------+---------+---------+------+------+-----------------+

Docsity.com

Indexing • A database index is a data structure

that improves the speed of data retrieval, but at a cost of slower changes to the data and increased storage space.

• Any column (field) of any table can be indexed. The index is just a list of which records (rows) have given values for that field.

• You make an index that matches a frequent query. For instance, “which pets are cats? “ or “which pets belong to which client?”

• Note that you don’t need an index: you can search for anything in the database. But, the index makes the search go faster.

Indexed on “species” column feline : 1, 2, 3, 6 canine: 4, 7, 8, 10, 11 avian: 5 reptile: 9

Indexed on “client_id” column 1: 1, 2, 3 2: 4 3: 5 4: 6 5: 7, 8 6: 9, 10 7: 11

Docsity.com

Search Trees • From the computer’s point of view, the simplest way to

find a given record in a file is to start at the beginning of the file and read every record until it finds the right one. – This is obviously not very efficient for big files.

• It is much faster to keep the file sorted, and then search it in a binary fashion: as with finding a word in a dictionary.

• The basic concept is the binary tree. You find a given record by dividing the problem into halves repeatedly.

– To find “tree” in the dictionary, you open it halfway, then determine your word is in the second half, then open that part at the halfway point, etc.

– This is efficient: on average you look at logN records with a binary tree, and N/2 records with a linear search, where N is the number of records.

• Computers use a generalization of this concept: the B- tree, which uses multiple levels of index, and more than 2 child nodes for each parent node.

Docsity.com

Computational Complexity • Different programs use different amounts of time,

computer memory, number of calculations, etc., which greatly affect your ability to get results.

– Time, space, and number of computations are all interrelated.

• “Big O notation” is a way of quantifying this. The basic idea is how does the time, space, complexity scale with the number of objects being examined (n)?

– O(1) = constant time. E.g. determine whether a number is even or odd.

– O(log n) = logarithmic time. E.g. binary search – O(n) – linear time. E.g. finding something in an

unsorted list – O(n log n). E.g. fast sorting algorithms – O(n2). Quadratic time. Simple sorting “bubble sort” – -O(2n) = exponential time. Very bad scaling!

Travelling salesman problem, other NP-complete problems.

Travelling salesman problem: --how to plan a route between many cities minimizing the total distance travelled --trying all possibilities: the number increases exponentially --going to the closest remaining city each time doesn’t work.

Docsity.com

End of Database Theory • Lots more to database theory and practice: proper table design,

search queries, indexing. Writing SQL statements. • Mostly, users don’t need to worry about this. The user interface

takes your input and writes the appropriate SQL, retrieves the information you want, and formats it for you. Indexes have been created to speed up common query types.

• Of course this means you don’t have access to the raw data and you can’t make custom queries. And it usually means you can’t get all the information in the database downloaded onto your own computer to play with as you see fit.

• However, many online databases allow downloading of large amounts of information by a process called FTP, which we will discuss in a bit.

Docsity.com

Plain Text vs. Formatted Text • A basic bioinformatics problem is that your data analysis will probably use

several different applications. Also, you may want to do some direct manipulation or analysis of the data. If every application inputs and outputs different arrangements of the data, you will have a tough time getting them to work together.

• The common language for trading information between applications without having to deal with compatibility is the flat file written in plain text.

– As opposed to formatted text or a binary file or a file written in a proprietary format.

• Plain text is the lowest common denominator: just the letters, numbers, and punctuation marks that are on the keyboard.

• Formatted text contains codes that affect the appearance of the text, but which don’t appear on the screen.

– HTML is a good example of formatted text – Word processors (like MS Word) insert formatting codes that you can’t see.

Often the codes are proprietary. – All word processors these days are WYSIWYG: “what you see is what you get”.

What appears on the screen is what will print out, but there is more in the file than just the characters.

Docsity.com

HTML Formatting • The format codes are

enclosed in angle brackets, called tags: like <b>.

• The text to be formatted is between the opening tag and the closing tag (which is the same as the opening tag except that it has a slash: like </b>.

• Thus, all text between <b> and </b> is written in boldface.

• <h2> and <h4> are header tags: they start new sections of text.

• <p> is a paragraph tag.

Some Text Formatted by HTML This is a paragraph with Bold or italicized words.

<h2>Some Text</h2> <h4>Formatted by HTML</h4> <p>This is a paragraph with <b>Bold</b> or <i>italicized</i> words.</p>

In plain text this would appear as: Some text Formatted by HTML This is a paragraph with Bold or italicized words.

Docsity.com

Text Encoding • Computers store and process information as binary

bits: 0’s and 1’s. However, most information is actually processed in bytes, which are groups of 8 bits.

– A byte has 28 = 256 possible states. • Text files store each character as a single byte. • ASCII (American Standard Code for Information

Interchange) is a way of encoding the English alphabet, plus numbers and punctuation, plus some control characters (like TAB and EOT: “end of transmission”) and graphics elements (like ¬ ).

– ASCII is really a 7-bit code, with 128 possibilities. The eighth bit was used for error-checking: the code was originally developed for data transmission.

• A file written in ASCII (or a variant) is a text file. – Text files are readable with Notepad

• Files of compiled computer code, or in some proprietary format, are called binary files.

– Look like gibberish in Notepad.

Printout of an executable binary file. ^L ZY[å]Ê ÷Ãðÿu0øÑËÑËÑË ÑË© ´BÍ!r#ÇÖ¸) Å^W3Àë V´òô 2ä@Hð»)

Docsity.com

Docsity.com

ASCII Oddities • ASCII separates capital and small letters: why sorts

sometimes put all capital letter file names before the small letter names.

– Windows file names are not case sensitive, but Unix file names are. Thus MyFile.txt and myfile.txt are the same name in Windows, but different in Unix.

• The most troublesome control character is “newline”: at the end of each line of text.

– On a typewriter (or teletype) this requires both a “line feed” and a “carriage return”, which seems a bit wasteful.

– Unix just wants a line feed. – Macintosh just wants a carriage return – Windows wants both. – Sending files as “text” rather than “binary” or “default”

sometimes cleans up this problem. • Often written in octal (base 8) or hexadecimal

(base 16). Hexadecimal uses ABCDEF to represent the single “digits” 10-15.

Docsity.com

ASCII Extensions and Unicode • ASCII was developed for the US, and many countries use entirely

different alphabets and special symbols. – ASCII contains some “national symbols” that vary between countries.

• US national symbols: $ # @ % & = ? – 8 bit (256 character) ASCII: ISO-8859-1 (Latin 1) uses the

regular 128 ASCI characters as the first 128, then another set for the second 128. Used widely for internet applications

• Things like Ù ę þ ¿ • Unicode covers most of the world’s writing systems. It currently has

about 107,000 characters in 90 sets, plus rules for converting them, adding to them, etc.

– You can even encode ancient Egyptian hieroglyphics!

• Unicode can be in 1 byte (UTF-8), 2 byte (UTF-16), or 4 byte (UTF- 32) encoding.

– The first 256 Unicode characters are the same as the 8 bit (ISO-8895-1) ASCII

Docsity.com

Flat Files • A flat file is written in plain text, in a standard defined format • General text files written in plain text can be written by programs like

Notepad (for Windows) or TextEdit (Mac) or vi (Unix). They are often given a .txt extension on their file names.

– .txt helps Windows figure out what kind of file they are. Unix and Macs don’t use the extension for anything, but it helps the user remember what the file is.

– The word processing software just renders the characters in whatever font is uses as a default.

• For database tables, each record is on a separate line, and the fields are separated by delimiters, usually tabs or commas.

– Excel and other spreadsheet programs can easily open a flat file in this format, no matter which operating system is being used.

– Spreadsheet programs can also save the files in tab-delimited (or comma) format • DNA and protein sequences are written in FASTA format.

Docsity.com

comments (0)
no comments were posted
be the one to write the first!
This is only a preview
3 shown on 68 pages
Download the document