Advanced database systems, Lecture notes of Database Management Systems (DBMS)

Computer systems, both software and hardware, consist of modules, or components. Each component is designed to operate correctly, i.e., to obey to or meet certain consistency rules. When components that operate concurrently interact by messaging or by sharing accessed data (in memory or storage), a certain component's consistency may be violated by another component. The general area of concurrency control provides rules, methods, design methodologies, and theories to maintain the consistency of components operating concurrently while interacting, and thus the consistency and correctness of the whole system

Typology: Lecture notes

2022/2023

Available from 07/07/2023

chibu-du
chibu-du 🇰🇪

7 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
ADVANCED DATABASE MANAGEMENT SYSTEM.
CONCURRENCY CONTROL
In information technology and computer science, especially in the fields of
computer programming, operating systems, multiprocessors, and databases,
concurrency control ensures that correct results for concurrent operations
are generated, while getting those results as quickly as possible.
Computer systems, both software and hardware, consist of modules, or
components. Each component is designed to operate correctly, i.e., to obey
to or meet certain consistency rules. When components that operate
concurrently interact by messaging or by sharing accessed data (in memory
or storage), a certain component's consistency may be violated by another
component. The general area of concurrency control provides rules,
methods, design methodologies, and theories to maintain the consistency of
components operating concurrently while interacting, and thus the
consistency and correctness of the whole system. Introducing concurrency
control into a system means applying operation constraints which typically
result in some performance reduction. Operation consistency and
correctness should be achieved with as good as possible efficiency, without
reducing performance below reasonable.
For example, a failure in concurrency control can result in data corruption
from Concurrency control in databases
Comments:
1. This section is applicable to all transactional systems, i.e., to all
systems that use database transactions (atomic transactions; e.g.,
transactional objects in Systems management and in networks of
smartphones which typically implement private, dedicated database
systems), not only general-purpose database management systems
(DBMSs).
2. DBMSs need to deal also with concurrency control issues not typical
just to database transactions but rather to operating systems in general.
These issues (e.g., see Concurrency control in operating systems below) are
out of the scope of this section.
Concurrency control in Database management systems (DBMS; e.g.,
Bernstein et al. 1987, Weikum and Vossen 2001), other transactional objects,
and related distributed applications (e.g., Grid computing and Cloud
computing) ensures that database transactions are performed concurrently
without violating the data integrity of the respective databases. Thus
concurrency control is an essential element for correctness in any system
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Advanced database systems and more Lecture notes Database Management Systems (DBMS) in PDF only on Docsity!

ADVANCED DATABASE MANAGEMENT SYSTEM.

CONCURRENCY CONTROL

In information technology and computer science, especially in the fields of computer programming, operating systems, multiprocessors, and databases, concurrency control ensures that correct results for concurrent operations are generated, while getting those results as quickly as possible. Computer systems, both software and hardware, consist of modules, or components. Each component is designed to operate correctly, i.e., to obey to or meet certain consistency rules. When components that operate concurrently interact by messaging or by sharing accessed data (in memory or storage), a certain component's consistency may be violated by another component. The general area of concurrency control provides rules, methods, design methodologies, and theories to maintain the consistency of components operating concurrently while interacting, and thus the consistency and correctness of the whole system. Introducing concurrency control into a system means applying operation constraints which typically result in some performance reduction. Operation consistency and correctness should be achieved with as good as possible efficiency, without reducing performance below reasonable. For example, a failure in concurrency control can result in data corruption from Concurrency control in databases Comments:

  1. This section is applicable to all transactional systems, i.e., to all systems that use database transactions (atomic transactions; e.g., transactional objects in Systems management and in networks of smartphones which typically implement private, dedicated database systems), not only general-purpose database management systems (DBMSs).
  2. DBMSs need to deal also with concurrency control issues not typical just to database transactions but rather to operating systems in general. These issues (e.g., see Concurrency control in operating systems below) are out of the scope of this section. Concurrency control in Database management systems (DBMS; e.g., Bernstein et al. 1987, Weikum and Vossen 2001), other transactional objects, and related distributed applications (e.g., Grid computing and Cloud computing) ensures that database transactions are performed concurrently without violating the data integrity of the respective databases. Thus concurrency control is an essential element for correctness in any system

where two database transactions or more, executed with time overlap, can access the same data, e.g., virtually in any general-purpose database system. Consequently a vast body of related research has been accumulated since database systems have emerged in the early 1970s. A well established concurrency control theory for database systems is outlined in the references mentioned above: serializability theory , which allows to effectively design and analyze concurrency control methods and mechanisms. An alternative theory for concurrency control of atomic transactions over abstract data types is presented in (Lynch et al. 1993), and not utilized below. This theory is more refined, complex, with a wider scope, and has been less utilized in the Database literature than the classical theory above. Each theory has its pros and cons, emphasis and insight. To some extent they are complementary, and their merging may be useful. To ensure correctness, a DBMS usually guarantees that only serializable transaction schedules are generated, unless serializability is intentionally relaxed to increase performance, but only in cases where application correctness is not harmed. For maintaining correctness in cases of failed (aborted) transactions (which can always happen for many reasons) schedules also need to have the recoverability (from abort) property. A DBMS also guarantees that no effect of committed transactions is lost, and no effect of aborted (rolled back) transactions remains in the related database. Overall transaction characterization is usually summarized by the ACID rules below. As databases have become distributed, or needed to cooperate in distributed environments (e.g., Federated databases in the early 1990, and Cloud computing currently), the effective distribution of concurrency control mechanisms has received special attention. Database transaction and the ACID rules The concept of a database transaction (or atomic transaction) has evolved in order to enable both a well understood database system behavior in a faulty environment where crashes can happen any time, and recovery from a crash to a well understood database state. A database transaction is a unit of work, typically encapsulating a number of operations over a database (e.g., reading a database object, writing, acquiring lock, etc.), an abstraction supported in database and also other systems. Each transaction has well defined boundaries in terms of which program/code executions are included in that transaction (determined by the transaction's programmer via special transaction commands). Every database transaction obeys the following rules (by support in the database system; i.e., a database system is designed to guarantee them for the transactions it runs):

value. The transactions that have read the wrong value end with incorrect results.

  1. The dirty read problem : Transactions read a value written by a transaction that has been later aborted. This value disappears from the database upon abort, and should not have been read by any transaction ("dirty read"). The reading transactions end with incorrect results.
  2. The incorrect summary problem : While one transaction takes a summary over the values of all the instances of a repeated data-item, a second transaction updates some instances of that data-item. The resulting summary does not reflect a correct result for any (usually needed for correctness) precedence order between the two transactions (if one is executed before the other), but rather some random result, depending on the timing of the updates, and whether certain update results have been included in the summary or not. Most high-performance transactional systems need to run transactions concurrently to meet their performance requirements. Thus, without concurrency control such systems can neither provide correct results nor maintain their databases consistent. Concurrency control mechanisms Categories The main categories of concurrency control mechanisms are:
  • Optimistic - Delay the checking of whether a transaction meets the isolation and other integrity rules (e.g., serializability and recoverability) until its end, without blocking any of its (read, write) operations ("...and be optimistic about the rules being met..."), and then abort a transaction to prevent the violation, if the desired rules are to be violated upon its commit. An aborted transaction is immediately restarted and re-executed, which incurs an obvious overhead (versus executing it to the end only once). If not too many transactions are aborted, then being optimistic is usually a good strategy.
  • Pessimistic - Block an operation of a transaction, if it may cause violation of the rules, until the possibility of violation disappears. Blocking operations is typically involved with performance reduction.
  • Semi-optimistic - Block operations in some situations, if they may cause violation of some rules, and do not block in other situations while delaying rules checking (if needed) to transaction's end, as done with optimistic.

Different categories provide different performance, i.e., different average transaction completion rates (throughput), depending on transaction types mix, computing level of parallelism, and other factors. If selection and knowledge about trade-offs are available, then category and method should be chosen to provide the highest performance. The mutual blocking between two transactions (where each one blocks the other) or more results in a deadlock, where the transactions involved are stalled and cannot reach completion. Most non-optimistic mechanisms (with blocking) are prone to deadlocks which are resolved by an intentional abort of a stalled transaction (which releases the other transactions in that deadlock), and its immediate restart and re-execution. The likelihood of a deadlock is typically low. Both blocking, deadlocks, and aborts result in performance reduction, and hence the trade-offs between the categories. Methods Many methods for concurrency control exist. Most of them can be implemented within either main category above. The major methods,[1] which have each many variants, and in some cases may overlap or be combined, are:

  1. Locking (e.g., Two-phase locking - 2PL) - Controlling access to data by locks assigned to the data. Access of a transaction to a data item (database object) locked by another transaction may be blocked (depending on lock type and access operation type) until lock release.
  2. Serialization graph checking (also called Serializability, or Conflict, or Precedence graph checking) - Checking for cycles in the schedule's graph and breaking them by aborts.
  3. Timestamp ordering (TO) - Assigning timestamps to transactions, and controlling or checking access to data by timestamp order.
  4. Commitment ordering (or Commit ordering; CO) - Controlling or checking transactions' chronological order of commit events to be compatible with their respective precedence order. Other major concurrency control types that are utilized in conjunction with the methods above include:
  • Multiversion concurrency control (MVCC) - Increasing concurrency and performance by generating a new version of a database object each time the object is written, and allowing transactions' read operations of

sequential with no overlap in time, and thus completely isolated from each other: No concurrent access by any two transactions to the same data is possible). Serializability is considered the highest level of isolation among database transactions, and the major correctness criterion for concurrent transactions. In some cases compromised, relaxed forms of serializability are allowed for better performance (e.g., the popular Snapshot isolation mechanism) or to meet availability requirements in highly distributed systems (see Eventual consistency), but only if application's correctness is not violated by the relaxation (e.g., no relaxation is allowed for money transactions, since by relaxation money can disappear, or appear from nowhere). Almost all implemented concurrency control mechanisms achieve serializability by providing Conflict serializablity, a broad special case of serializability (i.e., it covers, enables most serializable schedules, and does not impose significant additional delay-causing constraints) which can be implemented efficiently. Recoverability Comment: While in the general area of systems the term "recoverability" may refer to the ability of a system to recover from failure or from an incorrect/forbidden state, within concurrency control of database systems this term has received a specific meaning. Concurrency control typically also ensures the Recoverability property of schedules for maintaining correctness in cases of aborted transactions (which can always happen for many reasons). Recoverability (from abort) means that no committed transaction in a schedule has read data written by an aborted transaction. Such data disappear from the database (upon the abort) and are parts of an incorrect database state. Reading such data violates the consistency rule of ACID. Unlike Serializability, Recoverability cannot be compromised, relaxed at any case, since any relaxation results in quick database integrity violation upon aborts. The major methods listed above provide serializability mechanisms. None of them in its general form automatically provides recoverability, and special considerations and mechanism enhancements are needed to support recoverability. A commonly utilized special case of recoverability is Strictness, which allows efficient database recovery from failure (but excludes optimistic implementations; e.g., Strict CO (SCO) cannot have an optimistic implementation, but has semi-optimistic ones). Comment: Note that the Recoverability property is needed even if no database failure occurs and no database recovery from failure is needed. It is

rather needed to correctly automatically handle transaction aborts, which may be unrelated to database failure and recovery from it. Distribution With the fast technological development of computing the difference between local and distributed computing over low latency networks or buses is blurring. Thus the quite effective utilization of local techniques in such distributed environments is common, e.g., in computer clusters and multi- core processors. However the local techniques have their limitations and use multi-processes (or threads) supported by multi-processors (or multi-cores) to scale. This often turns transactions into distributed ones, if they themselves need to span multi-processes. In these cases most local concurrency control techniques do not scale well. Distributed serializability and Commitment ordering As database systems have become distributed, or started to cooperate in distributed environments (e.g., Federated databases in the early 1990s, and nowadays Grid computing, Cloud computing, and networks with smartphones), some transactions have become distributed. A distributed transaction means that the transaction spans processes, and may span computers and geographical sites. This generates a need in effective distributed concurrency control mechanisms. Achieving the Serializability property of a distributed system's schedule (see Distributed serializability and Global serializability (Modular serializability)) effectively poses special challenges typically not met by most of the regular serializability mechanisms, originally designed to operate locally. This is especially due to a need in costly distribution of concurrency control information amid communication and computer latency. The only known general effective technique for distribution is Commitment ordering, which was disclosed publicly in 1991 (after being patented). Commitment ordering (Commit ordering, CO; Raz 1992) means that transactions' chronological order of commit events is kept compatible with their respective precedence order. CO does not require the distribution of concurrency control information and provides a general effective solution (reliable, high-performance, and scalable) for both distributed and global serializability, also in a heterogeneous environment with database systems (or other transactional objects) with different (any) concurrency control mechanisms.[1] CO is indifferent to which mechanism is utilized, since it does not interfere with any transaction operation scheduling (which most mechanisms control), and only determines the order of commit events. Thus, CO enables the efficient distribution of all other mechanisms, and also the distribution of a mix of

  1. About the references and Commitment ordering: (Bernstein et al.
  1. was published before the discovery of CO in 1990. The CO schedule property is called Dynamic atomicity in (Lynch et al. 1993, page 201). CO is described in (Weikum and Vossen 2001, pages 102, 700), but the description is partial and misses CO's essence. (Raz 1992) was the first refereed and accepted for publication article about CO algorithms (however, publications about an equivalent Dynamic atomicity property can be traced to 1988). Other CO articles followed. (Bernstein and Newcomer 2009)[1] note CO as one of the four major concurrency control methods, and CO's ability to provide interoperability among other methods. Distributed recoverability Unlike Serializability, Distributed recoverability and Distributed strictness can be achieved efficiently in a straightforward way, similarly to the way Distributed CO is achieved: In each database system they have to be applied locally, and employ a vote ordering strategy for the Two-phase commit protocol (2PC; Raz 1992, page 307). As has been mentioned above, Distributed SS2PL, including Distributed strictness (recoverability) and Distributed commitment ordering (serializability), automatically employs the needed vote ordering strategy, and is achieved (globally) when employed locally in each (local) database system (as has been known and utilized for many years; as a matter of fact locality is defined by the boundary of a 2PC participant (Raz 1992) ). CONCURRENCY CONTROL WITH LOCKING METHODS A lock guarantees exclusive use of a data item to a current transaction. In other words, transaction T2 does not have access to a data item that is currently being used by transaction T1. A transaction acquires a lock prior to data access; the lock is released (unlocked) when the transaction is completed so that another transaction can lock the data item for its exclusive use. This series of locking actions assumes that there is a likelihood of concurrent transactions attempting to manipulate the same data item at the same time. The use of locks based on the assumption that conflict between transactions is likely to occur is often referred to as pessimistic locking. Data consistency cannot be guaranteed during a transaction; the database might be in a temporary inconsistent state when several updates are

executed. Therefore, locks are required to prevent another transaction from reading inconsistent data. Most multiuser DBMSs automatically initiate and enforce locking procedures. All lock information is managed by a lock manager, which is responsible for assigning and policing the locks used by the transactions. Lock Granularity Lock granularity indicates the level of lock use. Locking can take place at the following levels: database, table, page, row, or even field (attribute). Database Level In a database-level lock, the entire database is locked, thus preventing the use of any tables in the database by transaction T2 while transaction Tl is being executed. This level of locking is good for batch processes, but it is unsuitable for multiuser DBMSs. You can imagine how s-l- o-w the data access would be if thousands of transactions had to wait for the previous transaction to be completed before the next one could reserve the entire database. Note that because of the database-level lock, transactions T1 and T2 cannot access the same database concurrently even when they use different tables. Database-level locking sequence Table Level In a table-level lock, the entire table is locked, preventing access to any row by transaction T2 while transaction T1 is using the table. If a transaction requires access to several tables, each table may be locked. However, two transactions can access the same database as long as they access different tables. Table-level locks, while less restrictive than database-level locks, cause traffic jams when many transactions are waiting to access the same table. Such a condition is especially irksome if the lock forces a delay when different transactions require access to different parts of the same table, that is, when the transactions would not interfere with each other. Consequently, table-level locks are not suitable for multiuser DBMSs. Note the effect of a table-level lock: transactions T1 and T2 cannot access the same table even when they try to use different rows; T2 must wait until T1 unlocks the table. Page Level In a page-level lock, the DBMS will lock an entire diskpage. A diskpage, orpage, is the equivalent of a diskblock, which can be described as a directly addressable section of a disk. A page has a fixed size, such as 4K, 8K, or 16K. For example, if you want to write only 73 bytes to a 4K page, the entire 4K page must be read from disk, updated in memory, and written back to disk. A table can span several pages, and a page can contain several rows

optimal concurrency conditions. For example, the DBMS will not allow two transactions to read the same database object even though neither transaction updates the database, nor therefore, no concurrency problems can occur. Remember that concurrency conflicts occur only when two transactions execute concurrently and one of them updates the database. Shared/Exclusive Locks The labels “shared” and “exclusive” indicate the nature of the lock. An exclusive lock exists when access is reserved specifically for the transaction that locked the object. The exclusive lock must be used when the potential for conflict exists. A shared lock exists when concurrent transactions are granted read access on the basis of a common lock. A shared lock produces no conflict as long as all the concurrent transactions are read-only. A shared lock is issued when a transaction wants to read data from the database and no exclusive lock is held on that data item. An exclusive lock is issued when a transaction wants to update (write) a data item and no locks are currently held on that data item by any other transaction. Using the shared/exclusive locking concept, a lock can have three states: unlocked, shared (read), and exclusive (write). As shown in Table 10.11, two transactions conflict only when at least one of them is a Write transaction. Because the two Read transactions can be safely executed at once, shared locks allow several Read transactions to read the same data item concurrently. For example, if transaction T1 has a shared lock on data item X and transaction T2 wants to read data item X, T2 may also obtain a shared lock on data item X. If transaction T2 updates data item X, an exclusive lock is required by T over data item X. The exclusive lock is granted if and only if no other locks are held on the data item. Therefore, if a shared or exclusive lock is already held on data item X by transaction T1, an exclusive lock cannot be granted to transaction T2 and T2 must wait to begin until T1 commits. This condition is known as the mutual exclusive rule: only one transaction at a time can own an exclusive lock on the same object. Although the use of shared locks renders data access more efficient, a shared/exclusive lock schema increases the lock manager’s overhead, for several reasons: The type of lock held must be known before a lock can be granted. Three lock operations exist: READ_LOCK (to check the type of lock), WRITE_LOCK (to issue the lock), and UNLOCK (to release the lock). The schema has been enhanced to allow a lock upgrade (from shared to exclusive) and a lock downgrade (from exclusive to shared).

Although locks prevent serious data inconsistencies, they can lead to two major problems: The resulting transaction schedule might not be serializable. The schedule might create deadlocks. A deadlock occurs when two transactions wait indefinitely for each other to unlock data. A database deadlock, which is equivalent to traffic gridlock in a big city, is caused when two or more transactions wait for each other to unlock data. Fortunately, both problems can be managed: serializability is guaranteed through a locking protocol known as two-phase locking, and deadlocks can be managed by using deadlock detection and prevention techniques. Those techniques are examined in the next two sections. Two-Phase Locking to Ensure Serializability Two-phase locking defines how transactions acquire and relinquish locks. Two-phase locking guarantees serializability, but it does not prevent deadlocks. The two phases are:

  1. A growing phase, in which a transaction acquires all required locks without unlocking any data. Once all locks have been acquired, the transaction is in its locked point.
  2. A shrinking phase, in which a transaction releases all locks and cannot obtain any new lock. The two-phase locking protocol is governed by the following rules: Two transactions cannot have conflicting locks. No unlock operation can precede a lock operation in the same transaction. No data are affected until all locks are obtained—that is, until the transaction is in its locked point. In this example, the transaction acquires all of the locks it needs until it reaches its locked point. (In this example, the transaction requires two locks.) When the locked point is reached, the data are modified to conform to the transaction’s requirements. Finally, the transaction is completed as it releases all of the locks it acquired in the first phase. Two-phase locking increases the transaction processing cost and might cause additional undesirable effects. One undesirable effect is the possibility of creating deadlocks. Deadlocks A deadlock occurs when two transactions wait indefinitely for each other to unlock data. For example, a deadlock occurs when two transactions, T1 and T2, exist in the following mode: T1 = access data items X and Y

Uniqueness ensures that no equal time stamp values can exist, and monotonicity1 ensures that time stamp values always increase. 1The term monotonicity is part of the standard concurrency control vocabulary. The authors’ first introduction to this term and its proper use was in an article written by W. H. Kohler, “A Survey of Techniques for Synchronization and Recovery in Decentralized Computer Systems,” Computer Surveys 3(2), June 1981, pp. 149−283. How a Deadlock Condition Is Created All database operations (Read and Write) within the same transaction must have the same time stamp. The DBMS executes conflicting operations in time stamp order, thereby ensuring serializability of the transactions. If two transactions conflict, one is stopped, rolled back, rescheduled, and assigned a new time stamp value. The disadvantage of the time stamping approach is that each value stored in the database requires two additional time stamp fields: one for the last time the field was read and one for the last update. Time stamping thus increases memory needs and the database’s processing overhead. Time stamping demands a lot of system resources because many transactions might have to be stopped, rescheduled, and restamped. Wait/Die and Wound/Wait Schemes You have learned that time stamping methods are used to manage concurrent transaction execution. In this section, you will learn about two schemes used to decide which transaction is rolled back and which continues executing: the wait/die scheme and the wound/wait scheme.2 An example illustrates the difference. Assume that you have two conflicting transactions: T1 and T2, each with a unique time stamp. Suppose T1 has a time stamp of 11548789 and T2 has a time stamp of 19562545. You can deduce from the time stamps that T1 is the older transaction (the lower time stamp value) and T2 is the newer transaction. Given that scenario, the four possible outcomes are: T1 (11548789) T2 (19562545)

  • T1 waits until T2 is completed and T2 releases its locks.
  • T1 preempts (rolls back) T2.
  • T2 is rescheduled using the same time stamp. T2 (19562545) T1 (11548789)
  • T2 dies (rolls back). • T2 is rescheduled using the same time stamp.
  • T2 waits until T1 is completed and T1 releases its locks.

Using the wait/die scheme: If the transaction requesting the lock is the older of the two transactions, it will wait until the other transaction is completed and the locks are released. If the transaction requesting the lock is the younger of the two transactions, it will die (roll back) and is rescheduled using the same time stamp. In short, in the wait/die scheme, the older transaction waits for the younger to complete and release its locks. In the wound/wait scheme: If the transaction requesting the lock is the older of the two transactions, it will preempt (wound) the younger transaction (by rolling it back). T1 preempts T2 when T1 rolls back T2. The younger, preempted transaction is rescheduled using the same time stamp. If the transaction requesting the lock is the younger of the two transactions, it will wait until the other transaction is completed and the locks are released. In short, in the wound/wait scheme, the older transaction rolls back the younger transaction and reschedules it. In both schemes, one of the transactions waits for the other transaction to finish and release the locks. However, in many cases, a transaction requests multiple locks. How long does a transaction have to wait for each lock request? Obviously, that scenario can cause some transactions to wait indefinitely, causing a deadlock. To prevent that type of deadlock, each lock request has an associated time-out value. If the lock is not granted before the time-out expires, the transaction is rolled back. CONCURRENCY CONTROL WITH OPTIMISTIC METHODS The optimistic approach is based on the assumption that the majority of the database operations do not conflict. The optimistic approach requires neither locking nor time stamping techniques. Instead, a transaction is executed without restrictions until it is committed. Using an optimistic approach, each transaction moves through two or three phases, referred to as read, validation, and write.3 During the read phase, the transaction reads the database, executes the needed computations, and makes the updates to a private copy of the database values. All update operations of the transaction are recorded in a temporary update file, which is not accessed by the remaining transactions. During the validation phase, the transaction is validated to ensure that the changes made will not affect the integrity and consistency of the database. If the validation test is positive, the transaction goes to the write phase. If the validation test is negative, the transaction is restarted and the changes are discarded. During the write phase, the changes are permanently applied to the database. The optimistic approach is

  1. Natural disasters. This category includes fires, earthquakes, floods, and power failures. Whatever the cause, a critical error can render the database in an inconsistent state. The following section introduces the various techniques used to recover the database from an inconsistent state to a consistent state

Data Warehouse

In computing, a data warehouse ( DW ) is a database used for reporting and analysis. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting. The typical data warehouse uses staging, integration, and access layers to house its key functions. The staging layer stores raw data, the integration layer integrates the data and moves it into hierarchal groups, and the access layer helps users retrieve data.[1] Data warehouses can be subdivided into data marts. Data marts store subsets of data from a warehouse. This definition of the data warehouse focuses on data storage. The main source of the data is cleaned, transformed, catalogued and made available for use by managers and other business professionals for data mining, online analytical processing, market research and decision support (Marakas & O'Brien 2009). However, the means to retrieve and analyze data, to extract, transform and load data, and to manage the data dictionary are also considered essential components of a data warehousing system. Many references to data warehousing use this broader context. Thus, an expanded definition for data warehousing includes business intelligence tools, tools to extract, transform and load data into the repository, and tools to manage and retrieve metadata. Benefits of a data warehouse A data warehouse maintains a copy of information from the source transaction systems. This architectural complexity provides the opportunity to:  Maintain data history, even if the source transaction systems do not.  Integrate data from multiple source systems, enabling a central view across the enterprise. This benefit is always valuable, but particularly so when the organization has grown by merger.

 Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad data.  Present the organization's information consistently.  Provide a single common data model for all data of interest regardless of the data's source.  Restructure the data so that it makes sense to the business users.  Restructure the data so that it delivers excellent query performance, even for complex analytic queries, without impacting the operational systems.  Add value to operational business applications, notably customer relationship management (CRM) systems. Data mart  A data mart is the access layer of the data warehouse environment that is used to get data out to the users. The data mart is a subset of the data warehouse which is usually oriented to a specific business line or team.

Data mining

Data mining (the analysis step of the knowledge discovery in databases process,[1]^ or KDD), a relatively young and interdisciplinary field of computer science[2][3]^ is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems.[2]^ The goal of data mining is to extract knowledge from a data set in a human-understandable structure[2] and involves database and data management, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of found structure, visualization and online updating.[2] The term is a buzzword, and is frequently misused to mean any form of large-scale data or information processing (collection, extraction, warehousing, analysis and statistics) but also generalized to any kind of computer decision support system including artificial intelligence, machine learning and business intelligence. In the proper use of the word, the key term is discovery , commonly defined as "detecting something new". Even the popular book "Data mining: Practical machine learning tools and