























Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Indexing, Index Key, Create Index, Disk Storage, Buffer Lookaside, Create Tablespace, Row Pointers, Embedded SQL Program, B-tree Index, Index Node Layout, Duplicate Keyvalues, Oracle Bitmap Index, Clustered Index, Non-clustered Indexes, Index-organized Tables, Hash Primary Key Index
Typology: Study notes
1 / 31
This page cannot be seen from the preview
Don't miss anything!
























Chapter 8. Indexing.
8.1 Overview. Usually, an index is like a card catalog in a library. Each card (entry) has:
(keyvalue, row-pointer) (keyvalue is for lookup, call row-pointer ROWID) (ROWID is enough to locate row on disk: one I/O)
Entries are placed in Alphabetical order by lookup key in "B-tree" (usually), explained below. Also might be hashed.
An index is a lot like memory resident structures you've seen for lookup: binary tree, 2-3-tree. But index is disk resident. Like the data itself, often won't all fit in memory at once.
X/OPEN Syntax, Fig. 8.1, pg. 467, is extremely basic (not much to standard).
CREATE [UNIQUE] INDEX indexname ON tablename (colname [ASC | DESC] {,colname [ASC | DESC].. .});
Index key created is a concatenation of column values. Each index entry looks like (keyvalue, recordpointer) for a row in customers table.
An index entry looks like a small row of a table of its own. If created concatenated index:
create index agnamcit on agents (aname, city);
and had (aname, city) for two rows eual to (SMITH, EATON) and (SMITHE, ATON), the system would be able to tell the difference. Which once comes earlier alphabetically?
After being created, index is sorted and placed on disk. Sort is by column value asc or desc, as in SORT BY description of Select statement.
NOTE: LATER CHANGES to a table are immediately reflected in the index, don't have to create a new index.
Ex 8.1.1. Create an index on the city column of the customers table.
create index citiesx on customers (city);
Note that a single city value can correspond to many rows of customers table. Therefore term index key is quite different from relational concept of primary key or candidate key. LOTS OF CONFUSION ABOUT THIS.
With no index if you give the query:
select * from customers where city = 'Boston' and discnt between 12 and 14;
Need to look at every row of customers table on disk, check if predicates hold (TABLE SCAN or TABLESPACE SCAN).
Probably not such a bad thing with only a few rows as in our CAP database example, Fig 2.2. (But bad in DB2 from standpoint of concurrency.)
And if we have a million rows, tremendous task to look at all rows. Like in Library. With index on city value, can winnow down to (maybe 10,000) customers in Boston, like searching a small fraction of the volumes.
User doesn't have to say anything about using an Index; just gives Select statement above.
Query Optimizer figures out an index exists to help, creates a Query plan (Access plan) to take advantage of it -- an Access plan is like a PROGRAM that extracts needed data to answer the Select.
Model 100 MIPS computer. Takes .00000001 seconds to acess memory and perform an instruction. Getting faster, like car for $9 to take you to the moon on a gallon of gas.
Disk on the other hand is a mechanical device — hasn't kept up.
(Draw picture). rotating platters, multiple surfaces. disk arm with head assemblies that can access any surface.
Disk arm moves in and out like old-style phonograph arm (top view).
When disk arm in one position, cylinder of access (picture?). On one surface is a circle called a track. Angular segment called a sector.
To access data, move disk arm in or out until reach right track (Seek time)
Wait for disk to rotate until right sector under head (rotational latency)
Bring head down on surface and transfer data from a DISK PAGE (2 KB or 4KB: data block, in ORACLE) (Transfer time). Rough idea of time:
Seek time: .008 seconds Rotational latency: .004 seconds (analyze) Transfer time: .0005 seconds (few million bytes/sec: ew K bytes)
Total is .0125 seconds = 1/80 second. Huge compared to memory access.
Typically a disk unit addresses ten Gigabytes and costs 1 thousand dollars (new figure) with disk arm attached; not just pack which is much cheaper.
512 bytes per sector, 200 sectors per track, so 100,000 bytes per track. 10 surfaces so 1,000,000 bytes per cylinder. 10,000 cylinders per disk pack.
Total is 10 GB (i.e., gigabytes)
Now takes .0125 seconds to bring in data from disk, .000'000'01 seconds to access (byte, longword) of data in memory. How to picture this?
Analogy of Voltaire's secretary. Copying letters at one word a second. Run into word can't spell. Send letter to St Petersberg, six weeks to get response (1780). Can't do work until response (work on other projects.)
From this see why read in what are called pages, 2K bytes on ORACLE, 4K bytes in DB2 UDB. Want to make sure Voltaire answers all our questions in one letter. Structure Indexes so take advantage of a lot of data together.
Buffer Lookaside
Similarly, idea of buffer lookaside. Read pages into memory buffer so can access them. Once to right place on disk, transfer time is cheap.
Everytime want a page from disk, hash on dkpgaddr, h(dkpgaddr) to entry in Hashlookaside table to see if that page is already in buffer. (Pg. 473)
If so, saved disk I/O. If not, drop some page from buffer to read in requested disk page. Try to fix it so popular pages remain in buffer.
Another point here is that we want to find something for CPU to do while waiting for I/O. Like having other tasks for Voltaire's secretary.
This is one of the advantages of multi-user timesharing. Can do CPU work for other users while waiting for this disk I/O.
If only have 0.5 ms of CPU work for one user, then wait 12.5 ms for another I/O. Can improve on by having at least 25 disks, trying to keep them all busy, switching to different users when disk access done, ready to run.
Why bother with all this? If memory is faster, why not just use it for da- tabase storage? Volatility, but solved. Cost no longer big factor:
Memory storage costs about a $4000 per gigabyte. Disk storage (with disk arms) costs about 100 dollars a Gigabyte.
So could buy enough memory so bring all data into buffers (no problem about fast access with millions of memory buffers; very efficient access.) Coming close to this; probably will do it soon enough. Right now, being limited by 4 GBytes of addressing on a 32 bit machine.
Operating systems files named in datafile clause. ORACLE is capable of creating them itself; then DBA loses ability to specify particular disks.
The idea here is that ORACLE (or any database system) CAN create its own "files" for a tablespace.
If SIZE keyword omitted, data files must already exist, ORACLE will use. If SIZE is defined,ORACLE will normally create file. REUSE means use existing files named even when SIZE is defined; then ORACLE will check size is right.
If AUTOEXTEND ON, system can extend size of datafile. The NEXT n [K|M] clause gives size of expansion when new extent is created. MAXSIZE limit.
If tablespace created offline, cannot immediately use for table creation. Can alter offline/online later for recovery purposes, reorganization, etc., without bringing down whole database.
SYSTEM tablespace, created with Create Database, never offline.
When table first created, given an initial disk space allocation. Called an initial extent. When load or insert runs out of room, additional allocations are provided, each called a "next extent" and numbered from 1.
Create Tablespace gives DEFAULT values for extent sizes and growth in STORAGE clause; Create Table can override these.
INITIAL n: size in bytes of initial extent: default 10240 NEXT n: size in bytes of next extent 1. (same) May grow. MAXEXTENTS n: maximum number of extents segment can get MINEXTENTS n: start at creation with this number of extents PCTINCREASE n: increase from one extent to next. default 50.
Minimum possible extent size is 2048 bytes, max is 4095 Megabytes, all extents are rounded to nearest block (page) size.
The MINIMUM EXTENT clause guarantees that Create Table won't be able to override with too small an extent: extent below this size can't happen.
Next, Create Table in ORACLE, Figure 8.5, pg. 478. ( Leave Up )
CREATE TABLE [schema.]tablename ({colname datatype [DEFAULT {constant|NULL}] [col_constr] { , col_constr…} | table_constr} {, colname datatype etc... .} [ORGANIZATION HEAP | ORGANIZATION INDEX (with clauses not covered)] [TABLESPACE tblspname] [STORAGE ([INITIAL n [K|M]] [NEXT n [K|M]] [MINEXTENTS n] [MAXEXTENTS n] [PCTINCREASE n] ) ] [PCTFREE n] [PCTUSED n] [other disk storage and update tx clauses not covered or deferred] [AS subquery]
ORGANIZATION HEAP is default: as insert new rows, normally placed left to right. Trace how new pages (data blocks) are allocated, extents. Note if space on old block from delete's, etc., might fill in those.
ORGANIZATION INDEX means place rows in A B-TREE INDEX (will see) in place of entries. ORDER BY primary key!
The STORAGE clause describes how initial and successive allocations of disk space occur to table (data segment). There can be a default storage clause with a tablespace, inherited by table unless overridden.
The PCTFREE n clause determines how much space on each page used for inserts before stop (leave space for varchar expand, Alter table new cols.)
PCTFREE n, n goes from 0 to 99, default 10.
The PCTUSED n clause specifies a condition where if page (block) gets empty enough, inserts will start again! Range n from 1 to 99, default 40.
Require PCTFREE + PCTUSED < 100, or invalid. Problem of hysteresis; don't always want to be switching from inserting to not inserting and back.
E.g., if PCTFREE 10 PCTUSED 90, then stop inserrts when >90% full, start when <90% full. If 20, 90, ill-defined between 80 and 90.
Data Storage Pages and Row Pointers: Fig. 8.
Class 9. Review Fig. 8.6 above (put on board, discuss)
Disk Pointer. RID (DB2), ROWID (ORACLE), TID (INGRES). A row in a table can be uniquely specified with the page number (P) and slot number (S). In INGRES, have TID (Tuple ID), given by:
TID = 512*P + S
Pages in INGRES are numbered successively within the table allocation, from zero to 2 23 - 1: all 1's in 23 bit positions. Slots are 0 to 511. all 1's in 9 bit positions: total container is 32 bits, unsigned int, 4 bytes.
So if rows are 500 bytes long (small variation), 2K byte pages in UNIX will contain only 4 rows, and TIDs go: 0, 1, 2, 3, 512, 513, 514, 515, 1024, 1025,
...
Note value of information hiding. Give row ptr in terms of Page and Slot number; if gave byte offset instead of slot number would have to change RID/ROWID/TID when reorganized page
Example 8.2.1, pg 481, (variant). INGRES DBA (or any user) can check space calculations are right (200 byte rows 10 to 2 KB disk page, make sure knows what's going on) by selecting TID from table for successive rows:
select tid from employees where tid <= 1024;
Note tid is a Virtual column associated with every table, so can select it. Will appear in order (tablespace scan goes to first pg, first row, then.. .)
A DB2 record pointer is called a RID, also 4 bytes, encoding page number within Tablespace and slot number, but DB2 RID structure is not public, and we can't Select a DB2 RID from a tableas a virtual column.
In ORACLE, row pointer is called ROWID. and is normally 6 bytes long!! Restricted ROWID display representation is made up of Block number (page) within OS file, Slot number in block, & file no. (why?):
BBBBBBBB.RRRR.FFFF (each hexadecimal, total of 8 bytes for ROWID)
(Block number, Row (Slot) number, File number)
The ROWID value for each row can be retrieved as a virtual column by an SQL Select statement on any table, as with the following query:
select cname, rowid from customers where city = 'Dallas';
which might return the following row information (if the ROWID retrieved is in restricted form):
CNAME CROWID
Basics 00000EF3.0000. Allied 00000EF3.0001.
The alternative “extended ROWID” form is displayed as a string of four components having the following layout, with letters in each component representing a base-64 encoding:
OOOOOOFFFBBBBBRRR
Here OOOOOO is the data object number, and represents the database segment (e.g., a table). The components FFF, BBBBB, and RRR represent the file number, block number, and row number (slot number).
Here is the “base-64” encoding, comparable to hexadecimal representation except that we have 64 digits. The digits are printable characters:
DIGITS CHARACTERS
0 to 25 A to Z (Capital letters) 26 to 51 a to z (lower case letters) 52 to 61 0 to 9 (decimal digits) 62 and 63 + and / respectively
For example, AAAAm5AABAAAEtMAAB represents object AAAAm5 = 3864 + 5, file AAb = 1, block AAEtM = 4642 + 44*64 + 13, and slot 1. The query:
select cname, rowid from customers where city = 'Dallas';
might return the row information (different values than restricted ROWID):
CNAME ROWID
Basics AAAAm5AABAAAEtMAAB Allied AAAAm5AABAAAEtMAAC
The B-tree is like the (2-3) tree in memory, except that nodes of the tree take up a full disk page and have a lot of fanout. (Picture on board.)
ORACLE Form of Create Index Figure 8.7, pg. 485 (leave on board):
CREATE [UNIQUE | BITMAP] INDEX [schema.]indexname ON tablename (colname [ASC | DESC] {,colname [ASC | DESC]…}) [TABLESPACE tblespace] [STORAGE... ] ( see pg. 478, par 4 ff, override Tablespace default) [PCTFREE n] [other disk storage and update tx clauses not covered or deferred] [NOSORT]
We will discuss the concept of a BITMAP Index later. Non-Bitmap for now.
Note ASC | DESC is not really used. ORACLE B-tree is usable in both directions (leaf sibling pointers left and right).
What Create Index does: reads through all rows on disk (assume N), pulls out (keyvalue, rowid) pairs for each row. Get following list put out on disk.
(keyval1, rowid1) (keyval2, rowid2)... (keyvalN, rowidN)
Now sort these on disk, so in order by kevalues. Explain NOSORT clause (telling ORACLE rows are in right order, so don't have to sort, but error returned if ORACLE notices this is false).
Now idea of binary search, Example 8.3.1, pg. 486 ff. Assume N = 7, searching array of structs: arr[K].keyval, arr[K].rowid (list of pairs named above). Ordered by keyval K = 0 to 6.
/* binsearch: return K so that arr[K].keyval == x, or -1 if no match; * / /* arr is assumed to be external, size of 7 is wired in * / int binsearch(int x) { i n t probe = 3, /* first probe position at subscript K = 3 * / diff = 2;
while (diff > 0) { /* loop until K to return * / if (probe <= 6 && x > arr[probe].keyval) probe = probe + diff; else
probe = probe - diff; diff = diff/2; } /* we have reached final K * / if (probe <= 6 && x == arr[probe].keyval) /* have we found it? * / return probe; else if (probe+1 <= 6 && x == arr[probe+1].keyval) /* maybe next * / return probe + 1; else return -1; /* otherwise, return failure * / }
Figure 8.8 Function binsearch, with 7 entries wired in
Consider the sequence of keyval values {1, 7, 7, 8, 9, 9, 10} at subscript po- sitions 0-6.
Work through if x = 1, binsearch will probe successive subscripts 3, 1, and 0 and will return 0. If x = 7, the successive subscript probes will be 3, 1, 0. (undershoot — see why?) return 1. If x = 5, return -1. More?
Given duplicate values in the array, binsearch will always return the smallest subscript K such that x == arr[K].keyval ( exercise).
The binsearch function generalizes easily: given N entries rather than 7 , choose m so that 2 m-1^ < N ≤ 2 m^ ; then initialize probe to 2 m-1^ -1 and diff t o 2 m-2^. Tests if probe <= 6 or probe +1 <= 6 become <= N-1.
Optimal number of comparisons (if all different). Log 2 N. But not optimal in terms of disk accesses.
Example 8.3.2. Assume 1 million entries. First probe is to entry 524, (2 19 -1). Second prove is 2 18 = 262,144 entries away. Look at list pg. 488.
But with 2K byte pages, assume keyvalue is 4 bytes (simple int), ROWID is 4 bytes (will be either 6 or 7 in ORACLE), then 8 bytes for whole entry; 2000/ is about 250 entries per page. (1M/250 = 4000 pgs.) Therefore successive probes are always to different pages until probe 14 (maybe).
That's 13 I/Os!! A year of letters to Voltaire. We can make it 9 weeks!
Search algorithm is to start at root, find separators surrounding, follow np down, until leaf. Then search for entry with keyval == x if exists.
Only 3 I/Os. Get the most out of all information on (index node) page to locate appropriate subordinate page. 250 way fanout. Say this is a bushy tree rather than sparse tree of binary search, and therefore flat. Fanout f:
depth = log (^) f (N) — e.g., log 2 (1,000,000) = 20, log (^) f (1,000,000) = 3 if f >
Actually, will have commonly accessed index pages resident in buffer. 1 + 16 in f = 250 case fits, but not 4000 leaf nodes: only ONE I/O.
Get same savings in binary search. First several levels have1 + 2 + 4 + 8 + nodes, but to get down to one I/O would have to have 2000 pages in buffer. There's a limit on buffer, more likely save 6-7 I/Os out of 13.
(Of course 2000 pages fits in only 4 MB of buffer, not much nowadays. But in a commercial database we may have HUNDREDS of indexes for DOZENS of tables! Thus buffer space must be shared and becomes more limited.)
Dynamic changes in the B-tree
OK, now saw how to create an index when know all in advance. Sort entries on leaf nodes and created directory, and directory to directory.
But a B-tree also grows in an even, balanced way. Consider a sequence of inserts, SEE Figure 8.12, pg. 491.
(2-3)-tree. All nodes below root contain 2 or 3 entries, split if go to 4. This is a 2-3 tree, B-tree is probably more like 100-200.
(Follow along in Text). Insert new entries at leaf level. Start with simple tree, root = leaf (don't show rowid values). Insert 7, 96, 41. Keep ordered. Insert 39, split. Etc. Correct bottom tree, first separator is 39.
Note separator key doesn't HAVE to be equal to some keyvalue below, so don't correct it if later delete entry with keyvalue 39.
Insert 88, 65, 55, 62 (double split).
Note: stays balanced because only way depth can increase is when root node splits. All leaf nodes stay same distance down from root.
All nodes are between half full (right after split) and full (just before split) in growing tree. Average is .707 full: SQRT(2)/2.
Now explain what happens when an entry is deleted from a B-tree (because the corresponding row is deleted).
In a (2-3)-tree if number of entries falls below 2 (to 1), either BORROW from sibling entry or MERGE with sibling entry. Can REDUCE depth.
This doesn't actually get implemented in most B-tree products: nodes might have very small number of entries. Note B-tree needs malloc/free for disk pages, malloc new page if node splits, return free page when empty.
In cases where lots of pages become ALMOST empty, DBA will simply reorga- nize index.
Properties of the B-tree, pg. 499 ff, Talk through. (Actually, B+tree.)
Index Node Layout and Free Space on a page
See Figure 8.13, pg. 494. Always know how much free space we have on a node. Usually use free space (not entry count) to decide when to split.
This is because we are assuming that keyval can be variable length, so entry is variable length; often ignored in research papers.
ORACLE makes an effort to be compatible with DB2. Have mentioned PCTFREE before, leaves free space on a page for future expansion.
DB2 adds 2 clauses, INCLUDE columname-list to carry extra info in keyval and CLUSTER to speed up range retrieval when rows must be accessed. See how this works below. Class 11.
Missed a class. Exam 1 is Wed, March 22. (After Vac.) Hw 2 due by Monday, March 20, solutions put up on Tuesday to allow you to study.
If you are late, your hw 2 doesn't count. NO EXCEPTIONS. TURN IN WHAT YOU HAVE READY!
Duplicate Keyvalues in an Index.
What happens if have duplicate keyvalues in an index? Theoretically OK, but differs from one product to another.
Look at Fig 8.12, pg 498, right-hand, and assume add we a bunch of new rows with duplicate keyvalues 88. Get e.g. Fig 8.16 following on pg 504.
np 88
62 65 88 88^ 88 96
np 88 np
(Work out slowly how get it). Seems kind of strange to call 88 a separator value, since 88 on both nodes below. Now consider Select statement:
select * from tbl where keyval = 88;
Have to find way down to list of 88s. Done in binsearch, remember? Leftmost one GE 88. Must follow np to left of sepkey = 88. (No 88 at that leaf level, but there could be one since dups allowed)
Then will go right on leaf level while keyval <= 88, following sibling pointers, access ROWIDs and read in rows. Point of this is that everything works fine with multiple duplicates: you should think about how.
Consider the process of deleting a row: must delete its index entries as well (or index entries point to non-existent row; could just keep slot empty, but then waste time later in retrievals).
Thus when insert row, must find way to leaf level of ALL indexes and insert entries; when delete row must do the same. Overhead.
Treat update of column value involved in index key as delete then insert.
Point is that when there are multiple duplicate values in index, find it HARD to delete specific entry matching a particular row. Look up entry with given value, search among ALL duplicates for right RID.
Ought to be easy: keep entries ordered by keyvalue||RID (unique). But most commercial database B-trees don't do that. (Bitmap index solves this.)
Said before, there exists an algorithm for efficient self-modifying B-tree when delete, opposite of split nodes, merge nodes. But not many products use it: only DB2 UDB seems to. (Picture B-tree shrinking from 2 levels to one in Figure 8.12! After put in 39, take out 41, e.g., return to root only.)
Lot of I/O: argues against index use except for table with no updates. But in fact most updates of tables occur to increment-decrement fields.
Example. Credit card rows. Indexed by location (state || city || staddress), credit-card number, socsecno, name (lname || fname || midinit), and possibly many others.
But not by balance. Why would we want to look up all customers who have a balance due of exactly 1007.51? Might want all customers who have attempted to overdraw their balance but this would use an (indexed) flag.
Anyway, most common change to record is balance. Can hold off most other updates (change address, name) until overnight processing, and reorganize indexes. Probably more efficient.
DB2 has nice compression for multiple duplicate keyvalues in an index. See Fig. 8.17 on pg 505 for data page layout with duplicate keyvalues.
Header Info Prx^ Keyval^ RID^ RID^...^ RID Prx^ Keyval^ RID^ RID^... RID