Notes on File Access - Program and Problem Solving III | CECS 277, Exams of Computer Science

Material Type: Exam; Professor: Pompei; Class: Prog+Problem Solving III; Subject: Computer Engr & Computer Sci; University: California State University - Long Beach; Term: Unknown 1989;

Typology: Exams

Pre 2010

Uploaded on 08/19/2009

koofers-user-oau
koofers-user-oau 🇺🇸

9 documents

1 / 22

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
FILE ACCESS
1. Two basic ways to search for a particular record:
a) Sequentially – read through file record by record until you get to the
desired record.
b) Direct access – go directly to the desired record.
i) Caveat – must know where the record is, e.g. using an index file
to locate the record.
2. Best time to use sequential access:
a) When every record needs to be processed.
b) If there's only a few records in total to be processed.
c) Searching ASCII files for a particular pattern.
d) Searching a file in which you want all records with a certain
secondary key value, where a large number of matches are expected.
3. Best time to use direct access:
a) When you need to only access individual records, i.e. single record
access (via a primary key).
b) When you want to access several records satisfying a condition (via a
secondary key).
4. Keys:
a) An expression obtained from one or more fields of a record which can
be used to identify that record.
b) Examples:
i) Last name (however, this may be a problem if the last name is
not unique).
ii) Social security number (unique – good candidate for a primary
key).
iii) Last name concatenated with first name (again, this may not
necessarily be unique).
iv) Last name concatenated with zip code (may not necessarily be
unique).
5. Primary key uniquely identifies a record (e.g. social security number).
6. Secondary key (or alternate key) identifies a group of records (e.g. zip code).
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16

Partial preview of the text

Download Notes on File Access - Program and Problem Solving III | CECS 277 and more Exams Computer Science in PDF only on Docsity!

FILE ACCESS

  1. Two basic ways to search for a particular record: a) Sequentially – read through file record by record until you get to the desired record. b) Direct access – go directly to the desired record. i) Caveat – must know where the record is, e.g. using an index file to locate the record.
  2. Best time to use sequential access: a) When every record needs to be processed. b) If there's only a few records in total to be processed. c) Searching ASCII files for a particular pattern. d) Searching a file in which you want all records with a certain secondary key value, where a large number of matches are expected.
  3. Best time to use direct access: a) When you need to only access individual records, i.e. single record access (via a primary key). b) When you want to access several records satisfying a condition (via a secondary key).
  4. Keys: a) An expression obtained from one or more fields of a record which can be used to identify that record. b) Examples: i) Last name (however, this may be a problem if the last name is not unique). ii) Social security number (unique – good candidate for a primary key). iii) Last name concatenated with first name (again, this may not necessarily be unique). iv) Last name concatenated with zip code (may not necessarily be unique).
  5. Primary key uniquely identifies a record (e.g. social security number).
  6. Secondary key (or alternate key) identifies a group of records (e.g. zip code).
  1. Canonical form of a key forces format on the key (e.g. making the key all upper case letters). a) All conversions to the key should be made before writing it to the index file. Then, when searching the file, you will know what form your search key must be in.
  2. Sequential search. a) Example of a sequential search program – searches sequentially through a file for a record with a particular key. b) Evaluating performance of sequential search: i) If there are n records in the file, it will take an average of n/ reads to find a single record (ignoring disk blocking, etc. and assuming that one read is required for each record). ii) We say that a sequential search is of order O(n) (big-oh notation) which means the time it takes to do the search is proportional to n (i.e. in this case, the proportion is ½).
  3. Direct access (using RRN where RRN is relative record number). a) Suppose we have a data file with fixed length records. Then we can directly access any record in the file if we know its relative record number (RRN). b) Example: Suppose each record in the file contains 100 bytes, and the fields have varying lengths and are separated by the delimiter '|'. c) Mahoney|Mike|….|Foster|Sheila|….|Volper|Dennis|…. i) 100 bytes 100 bytes 100 bytes ii) RRN 0 RRN 1 RRN 2 d) The byte offset (at the start of the record) for record with: i) RRN 0 is 0 * 100 = 0 ii) RRN 1 is 1 * 100 = 100 iii) … iv) RRN n is n * 100 e) General formula (assuming fixed length record). r is the RRN and n is the number of bytes per record. byte offset = r * n f) After calculating the byte offset we can access the record via the seekg or seekp library function.

FILE MAINTENANCE

  1. Files can be organized or reorganized so as to improve performance. Three possible ways to accomplish this are: a) Compression techniques that allow you to make files smaller by encoding the basic information in the file. b) Reclamation of unused space in files caused by record deletions and updates. c) Reorganization of files by sorting them to support simple binary searching.
  2. Data compression can make files smaller. Smaller files: a) Use less storage, resulting in cost savings. b) Can be transmitted faster, decreasing access time or, alternatively, allowing the same access time with a lower and cheaper bandwidth. c) Can be processed faster sequentially.
  3. Data compression involves encoding the information in a file in such a way as to take up less space.
  4. Different compression techniques are: a) Using a different notation. i) Fixed-length fields are good candidates for compression. ii) The number of bits are decreased by finding a more compact notation. a) Technique is classified as redundancy reduction. b) Example would be using one byte to code the state abbreviation (through the use of setting bits to represent each of the 50 states) rather than using two bytes. c) Cost of compression: (1) The file is unreadable by humans. (2) Performance cost incurred to encode and decode the information. (3) Software must be written to perform the encoding and decoding. b) Suppressing repeating sequences. i) Also known as run-length encoding. ii) Encodes sequences of repeating values, rather than writing all of the values in the file.

c) Assigning variable-length codes. i) Variable-length codes are assigned to values depending on how frequently the values occur. ii) Values that occur often (e.g. commonly used characters such as e and t) are given shorter codes, so they take up less space. iii) Huffman codes are an example of variable-length codes (fig. 6.2). a) The Huffman code determines the probabilities of each value occurring in the data set, and builds a binary search tree in which the search path for each value represents the code for that value. b) More frequently occurring values are given shorter search paths in the tree. c) This tree is then turned into a table, much like a Morse code table, that can be used to encode and decode the data.

  1. Storage compaction is easy; we can use the file copy method or compact the file itself. a) File copy method. i) Read the file record by record and write it to another file, not writing those records with an asterisk in the first position. ii) This is the easiest method but uses more space. b) Compact the file. i) Shift records up in the file to replace space used by deleted records. ii) Need to keep two offset pointers. a) One pointer to show where to write to next. b) Another pointer to show where to read next. iii) This method is more difficult and is more time-consuming, but uses less space.
  2. Storage compaction is not good for volatile files since it's done in batch mode. You must stop any interactive processing while compaction is being done.

FIXED-LENGTH RECORD DELETION

  1. Some applications are too volatile and interactive for storage compaction to be useful. We want to reuse the space from deleted records as soon as possible.
  2. Dynamic storage reclamation. a) We need to mark deleted records (e.g. using an asterisk in the first byte) and find the deleted space later. b) Sequential search for deleted records is slow and impractical for large files. c) Faster method is to store deleted record information.
  3. Recall the following: a) Record information is stored in the following format: Header RRN 0 RRN 1(deleted) RRN 2 RRN 3 (deleted) b) The byte offset is computed using the following formula: byte offset = header length + (RRN * record length) c) We use the seek command in conjunction with the byte offset to access a record.
  4. Since all records are of the same length, we can reuse any deleted record space to add records.
  5. The most efficient way to manage deleted records is to collect all of the available record slots (i.e. records flagged for deletion) into a linked list. a) The linked list is created by stringing together all the deleted records to form a linked list of deleted record spaces (fig. 6.4 and following page). b) The simplest way to maintain the linked list is to treat it as a stack. i) In a fixed-length record file, any one record slot is just as usable as any other record slot; they are interchangeable. ii) Newly available records are added to the linked list by pushing them onto the front of the list; record slots are removed form the linked list by popping them from the front of the list (LIFO - last in, first out).
  1. Pseudo code that returns the RRN of the first available slot in the file. If the avail list (i.e. the linked list) of deleted records is empty, the function returns the RRN of the next record to be appended at the end of the file. This code does not rewrite into the available space; it merely returns the location of where to write and pops the location from the linked list. FUNCTION pop_avail() if HEAD.FIRST_AVAIL == -1 then /* avail list empty / return RRN of next record to be appended else / pop avail list / set RET_VAL to HEAD.FIRST_AVAIL move file pointer to HEAD.FIRST_AVAIL position in file skip over '' field /* don't need the asterisk / / Get RRN of next unused space to place in header */ read link field from file into RRN set HEAD.FIRST_AVAIL TO RRN return RET_VAL end FUNCTION

VARIABLE-LENGTH RECORD DELETION

  1. We can still use an avail (linked) list to store available space in the file for variable-length record deletion. a) Important difference from the fixed length case: RRNs cannot be used as links because we cannot compute byte offsets from RRNs. b) Solution: Store byte offsets themselves in avail (linked) list (fig. 6.6).
  2. We cannot access the records on the avail list as if it were a stack because the first record slot in the list may not be large enough. a) We need to search through the linked list for a record slot big enough. b) There are different methods that can be used to accomplish this: i) First fit – use the first space that is big enough. ii) Best fit – search all open spaces and use the one that is closest in size to the record you're adding. iii) Worst fit – find the largest unused space. Write the record there and hope you have enough space left over for another record at another time. c) If no available slots are large enough, then the new record should be appended to the end of the file.
  3. First fit – when we need to add a record to the file, we look through the list, starting at the beginning, until we either find a record slot that is big enough or reach the end of the list (fig. 6.7). a) The least possible amount of work is expended when we place newly available space on the list. b) We are not very particular about the closeness of fit as we look for a record slot to hold a new record. i) We accept the first available record slot that will do the job, regardless of how large it might be.
  1. Pseudo code for getting a slot from the avail list for variable length record insertion using the first-fit strategy: FUNCTION: get_avail() find the first record on the avail list while (the record is not big enough AND not end of list) jump to the next available record if the record is big enough rearrange the linked list to remove the record return the byte offset of the record slot else /* end of list reached before a big enough slot found */ return byte offset of the end of the file end FUNCTION
  2. Pseudo code for placing deleted records on the avail list for variable length records using the first-fit strategy: FUNCTION: delete_record(RRN) read sequentially through the file until record RRN is found /* Note: Normally would use an index rather than read / / sequentially through the file / set BYTE_POS to the byte offset of the record to be deleted place a deleted record marker ('') in the first field /* add to front of list / place the value of HEAD.FIRST_AVAIL in the next field as a link set HEAD.FIRST_AVAIL to BYTE_POS / Note: Extra space was not put on the avail list here, thus / / internal fragmentation occurs */ end FUNCTION

FRAGMENTATION

  1. Internal fragmentation is space that is lost within a record. a) You don't get internal fragmentation with variable length records as they are initially added. b) Internal fragmentation of variable length records occurs as you add and delete records (fig. 6.10).
  2. External fragmentation occurs in a file when there is unused space outside of or between individual records.
  3. To combat internal fragmentation that occurs when adding and deleting variable length records, a single, large variable-length record slot (representing a deleted record) can be broken into two or more smaller ones, using exactly as much space as is needed for a new record and leaving the remainder on the avail list (figures 6.11 and 6.12). a) Although this could decrease the amount of wasted space, eventually the remaining fragments are too small to be useful. b) When this happens, the space is lost to external fragmentation.
  4. There are a number of things that one can do to minimize external fragmentation. They include: a) Compacting the file in a batch mode when the level of fragmentation becomes excessive. b) Coalescing adjacent record slots on the avail list to make larger, more generally useful slots. c) Adopting a placement strategy to select slots for reuse in a way that minimizes fragmentation.
  1. In general, a binary search of a file with n records takes at most: log 2 n + 1 comparisons a) Quick review of logarithm: i) A logarithm is an exponent. ii) log 2 n is the exponent on 2 which yields n. iii) Example: 2 3 = 8 means log 2 8 = 3 b) Example: Suppose you have a sorted file of 100 records, i.e. n = 100.  log 2 100  + 1 = 6.64 + 1 = 7 comparisons
  2. On average, a binary search takes approximately log 2 n + ½ comparisons.
  3. A binary search is said to be O(log 2 n) whereas a sequential search requires at most n comparisons, and on average ½ n, which is to say that a sequential search is O(n).
  4. Binary search is decent. a) Better than sequential sorting. b) Not the best because it requires a sorted file.
  5. A major drawback of binary searching is that we have to sort the file.

SORTING A DISK FILE IN RAM

  1. Any internal sorting algorithm requires multiple passes over the list that is to be sorted, comparing and reorganizing the elements. a) Some of the items in the list are moved a long distance from their original positions in the list. b) If such an algorithm were applied directly to data stored on a disk, there would be a lot of jumping around, seeking, and rereading of data. i) This would be an extremely slow operation.
  2. If the entire contents of the file can be held in RAM, an attractive alternative is to read the entire file from the disk into memory, and then do the sorting there, using an internal sort. a) We still have to access the data on the disk, but this way we can access it sequentially, sector after sector, without having to incur the cost of a lot of seeking and the cost of multiple passes over the disk. b) The idea is to force your disk access into a sequential mode, performing the more complex, direct accesses in RAM.
  3. The basic steps of RAM sort are as follows: a) Read the records from the input file into the RECORDS array in RAM. b) Extract the keys, building the KEYNODES array in RAM. For example, if the key consisted of last name followed by first name, the last name/first name would be stored in KEYNODES along with a pointer to the full record which would be stored in the RECORDS array. c) Build an INDEX array of subscripts for KEYNODES[] and RECORDS[]. The index array is initialized as INDEX[i] = i. For example, i = 1 represents the first record in the file. d) Use a sorting procedure such as shell sort or bubble sort to do the actual sorting indirectly (i.e. only the keys in KEYNODES[] are compared and only the indices in the index array are changed (swapped)). i) For example, before the sort, index 1 would point to the 1 st record in the file and index 5 would point to the 5 th record in the file.

KEYSORTING

  1. Keysorting (figures 6.16 and 6.17). a) We don't need to read the entire file of records into RAM to perform the sort as was done previously in RAM sort. b) We do need to read the keys into RAM. c) We read only the key information into RAM, not the whole record. d) The key information is then sorted in RAM. e) Keysort is essentially the same as RAM sort except: i) Rather than read in all of the records into a RAM array, we simply read each record into a temporary buffer and then discard it; and ii) When we are writing the records out in sorted order, we have to read them in a second time, since they are not all stored in RAM.
  2. Neither RAM sort nor keysort is very good. a) RAM sort requires a lot of memory to store all of the records. b) Keysort requires memory to store the keys and seek time to write out the sorted records. i) Rearranging a file of n records requires n random seeks out to the original file, which can take much more time than does a sequential reading of the same number of records. ii) Recall that minimizing the number of seeks is the most important objective of this class.
  3. Keysorting naturally leads to the suggestion that we merely write the sorted list of keys off to secondary storage, setting aside the expensive matter of rearranging the file. a) This list of keys, coupled with RRN tags pointing back to the original records, is an example of an index. i) Instead of creating a new, sorted copy of the file to use for searching, we have created a second kind of file, an index file, that is to be used in conjunction with the original file. ii) If we are looking for a particular record, we do our binary search on the index file, then use the RRN stored in the index file record to find the corresponding record in the original file. b) Figure 6.19 in textbook illustrates the relationship between the index file and the data file.
  1. In future lectures, we will discuss: a) Various ways we can use simple indexes. b) Different ways of organizing the index to provide more flexible access and easier maintenance.