Docsity
Docsity

Prepara i tuoi esami
Prepara i tuoi esami

Studia grazie alle numerose risorse presenti su Docsity


Ottieni i punti per scaricare
Ottieni i punti per scaricare

Guadagna punti aiutando altri studenti oppure acquistali con un piano Premium


Guide e consigli
Guide e consigli


appunti di basics in data management, Appunti di Fondamenti di informatica

appunti top in basics in data management

Tipologia: Appunti

2022/2023

Caricato il 06/11/2025

samu.23
samu.23 🇮🇹

7 documenti

1 / 35

Toggle sidebar

Questa pagina non è visibile nell’anteprima

Non perderti parti importanti!

bg1
BASICS IN DATA MANAGEMENT
2. INFORMATION SYSTEMS
Business process: sequence of business activities aimed at producing a product or
service involving multiple resources: material, organizational, informational
Information system: set of information managed by business processes
Data assets: raw material with which information is produced
Set of procedures: for information acquisition, processing, production
Set of human resources: that oversee the procedures
Set of tools and instruments: for storage and processing of information
An information system is the set of components of an organization designed to: acquire,
process, store, retrieve, share, transmit information.
A computer system is the technology supporting the information system: hardware,
software, databases and management systems, communication networks.
EXAMPLE OF INFO AND COMPUTER SYSTEM
Municipality equipped with information and computer system for detecting the level of
smog
- Information system: - people (employees, mayor, aldermen) means and
procedures for the production and management of pollution-related information
- Computer system: - automatic management – by appropriate staff-of-smog-level
information using: hardware machines, software programs and network
equipment
Computer systems rely on information technology to process data and produce
information: procedures are automated and consist of programs running on a computer.
Customarily and for various reasons there are parts of the business information system
that are not “computerized”, and therefore information is produced without the aid of
information technology.
Data and information:
An organization’s information is available in the form of a set of data stored on
appropriate media
Data must be properly “interpreted” to give rise to the actual information
Data: a representation of the information useful for its storage and management
Both the raw data and the rules for their interpretation are stored in the form of
data thus constitute the information assets of the company
Complex information: the elementary information is logically related to each other by
associating appropriate semantics with the links: this results in complex information.
… interactions with databases
Sopping in a supermarket
Purchasing with credit cards
Hotel reservations at agency
Payment of insurance
Enrolling in a course
Require
Use of “data collections”
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Anteprima parziale del testo

Scarica appunti di basics in data management e più Appunti in PDF di Fondamenti di informatica solo su Docsity!

BASICS IN DATA MANAGEMENT

2. INFORMATION SYSTEMS

 Business process: sequence of business activities aimed at producing a product or service involving multiple resources: material, organizational, informational  Information system: set of information managed by business processes  Data assets: raw material with which information is produced  Set of procedures: for information acquisition, processing, production  Set of human resources: that oversee the procedures  Set of tools and instruments: for storage and processing of information An information system is the set of components of an organization designed to: acquire, process, store, retrieve, share, transmit information. A computer system is the technology supporting the information system: hardware, software, databases and management systems, communication networks. EXAMPLE OF INFO AND COMPUTER SYSTEM Municipality equipped with information and computer system for detecting the level of smog

  • Information system: - people (employees, mayor, aldermen) – means and procedures for the production and management of pollution-related information
  • Computer system: - automatic management – by appropriate staff-of-smog-level information using: hardware machines, software programs and network equipment Computer systems rely on information technology to process data and produce information: procedures are automated and consist of programs running on a computer. Customarily and for various reasons there are parts of the business information system that are not “computerized”, and therefore information is produced without the aid of information technology. Data and information:  An organization’s information is available in the form of a set of data stored on appropriate media  Data must be properly “interpreted” to give rise to the actual information  Data: a representation of the information useful for its storage and management  Both the raw data and the rules for their interpretation are stored in the form of data thus constitute the information assets of the company Complex information: the elementary information is logically related to each other by associating appropriate semantics with the links: this results in complex information. … interactions with databases  Sopping in a supermarket  Purchasing with credit cards  Hotel reservations at agency  Payment of insurance  Enrolling in a course Require  Use of “data collections”

 Stored persistently  Related to a reality of interest  Serving a given organization Database A database is the set of information associated with the collections of data: - related to each other – equipped with an appropriate description. It is: - a single, large data repository – shared within the enterprise by all applications and users – persistent, i.e., with much longer life than management procedures – a tool that allows you to always work on a consistent state of the data. Database description  The database must maintain its own description  This refers to a catalog or dictionary containing a set of data called metadata that is used to describe the data itself  All this is achieved through a software layer called DBMS, that manages all data in an integrated manner, ensuring that operations are carried out efficiently and effectively Definition of DBMS It a set of programs that allows to:  Define: specify types, structures, and constraints on data  Manipulate: insert, delete, update, retrieve data  Check: control access to data by ensuring protection from failures, unwanted access the database. A DBMS, therefore, makes it easier for users to use their database. Before the advent of DBMSs, a data store consisted of a set of files, and all operations and information management logic were the responsibility of the applications that interacted with the store. With the introduction of DBMSs, applications that interact with the database are greatly simplified. Managing a phone book Registration of the countless names of friends with their addresses, phone numbers, cell phone numbers…  Personal approach: managing with the appropriate programs written in the preferred language all data access operations and ensuring their “persistence” in confidential archives  Approach using a DBMS: a DBMS product (free, download form the Web) is used that allows data definition and management by classic SELECT, INSERT, DELETE, UPDATE operations. A database system is the set formed by a database and a DBMS. ANSI-SPARC three-level architecture  One of the first general architecture proposals for database systems was introduced in 1971 by the Data Base Task Group (DBTG) formed by the Conference on Data Systems and Languages (CODASYL). The Standards Planning and Requirement Committee (SPARC) of the American National Institute. (ANSI) proposed a similar model (since then called the ANSI-SPARC architecture) whose

 ATOMICITY: is the so-called “all or nothing” property: a transaction is atomic if it is executed in its entirety or not at all  CONSISTENCY: a transaction is a transformation of one consistent state of the database to another consistent state. A DBMS, in particular, must ensure that all constraints defined on the database are satisfied  ISOLATION: transactions must be executed independently of each other. This means that partial effects of incomplete transactions should not be visible to other transactions  DURABILITY: the effects of a transaction that is terminated by a “commit” must be permanently recorded in the database and never lost for any reason. Features of DBMS:

  1. Redundancy and consistency check: a DB-based approach tends to eliminate unnecessary duplication (redundancy) of data. This reduces the risk of “inconsistency”: if a data item is available only once in the DB, a change in it is immediately available to all.
  2. Sharing: an organization’s data is shared by all users and applications in an organization.
  3. Data integrity: we refer to the consistency and validity of the values contained in the DB. This is achieved by having the DBMS handle “changes” in the information through appropriate constraint rule.
  4. Efficient management of operations: the DBMS has features for efficient execution of queries and updates.
  5. Concurrency control: the DBMS guarantees concurrent transactions on the data without mutual interference.
  6. Reliability: the DBMS has functionality for restoring the database even in case of failures or malfunctions during transaction execution.
  7. Security: security means prevention from data access by unauthorized users. The DBMS allows data access policies to be defined by users or by user profile. Main components of a DBMS:
  • Access manager: module of a DBMS that performs database access control. It ensures that only authorized users and applications have access to information in the database and that their operations are compatible with their privileges/roles.
  • Query manager: component that handles user requests in terms of DDL and DML operations. Specifically, DDL and DML commands into an internal DBMS format, transforming, if possible, the user request into an equivalent but more efficient one.
  • Memory manager: component that is responsible for defining strategies for accessing information in mass memory and its transfer to central memory and vice versa.
  • File manager: module that deals with management of files on mass storage (data files, control files, log files) containing the database.
  • Integrity manager: module that is responsible for verifying that integrity rules are verified within the database.
  • Concurrency control manager: module that aims to manage concurrent transaction access to shared database resources. Special lock mechanism (lock in

resources) are used to serialize concurrent transactions, preventing them from anomalies in accessing information.

  • Reliability manager: module that deals with saving operations on the database in log files and initiating procedures to restore the database itself downstream of failures.
  1. EXAMPLES OF INFORMATION SYSTEMS Operational support systems: manage information exchanges  Within operational processes between different companies (business to business, B2B)  Between processes in the same company  Between user and company (business to customer, B2C) Typical functions of operational support systems are:  Management of business registries (customer, supplier, product registries)  Processing of “business transactions” (procurement processes, inventory management, production and sales processes, order management)  Planning and control of operations  Management of administrative procedures Thus, the information handled by these processes are:  Biographical information: describes structural properties of objects and actors in the system  Information on “transactions”: describe events or movements of resources of both economic and financial and material nature  Information on operations planning: describe the production plan of a given organization. Such information can be:  Stable over time: such as a customer, supplier and product master records  Changeable over time: i.e. the current account balance  Related to events: such as describing business transactions and component transactions, such as airline, hotel and car rental reservations Management of orders and purchases Suppose a multinational sportswear distribution company active in the Italian market is equipped with an information system for managing customer orders, purchases from suppliers, and storage of goods in the warehouse. The steps for the production and management of information are:  Sales office employees acquire daily customer orders (information acquisition) via telephone and e-mail  Order information is entered via terminals within the enterprise system and, in particular, stored in a special central server (information storage)  The archived information is then, by means of special software programs, processed and analysed by purchasing department employees (information processing)  Based on the remaining quantities in the warehouse, the procurement manager can decide whether or not the order can be fulfilled, and if so, whether to issue an order for the purchase of goods from suppliers (information communication)

Suppose a bank, with offices and branches spread throughout the country, is equipped with a video surveillance system to monitor and control sensitive areas in order to detect dangerous events, such as possible robberies. The steps for the production and management of information are:  A set of cameras, equipped with processors capable of providing them with “intelligence” and placed at various locations in the venues, continuously acquires information about the observed scene (information acquisition)  These data are then sent via dedicated communications network to a central processor or server that stores them in specialized digital archives for processing information of a video nature (information storage)  Data are then processed in real time by special software programs automatically (without any human intervention) or semi-automatically, and then analysed by special operators (information processing)  If anomalies are detected, operators can check for the actual presence of suspicious occurrences and report alarm information in a timely manner to security managers or directly to law enforcement (information communication) As can be seen, the previously described information system provides decision support for the activities of security officers, who, thanks to the help of an almost completely computerized system, receive or not only the information to be analysed, but also the results of their processing, having continuous and immediate feedback on their actions. Specifically, depending on the level of alert automatically generated by a software program, the operator could look at the video related to a certain area and check for suspicious events (such as robbery in progress), and, in case of danger alert law enforcement agencies. Note that in this case it is unthinkable that the IT system does not exist, as its absence would greatly impact the entire business organization. Each headquarters would have to be staffed with a certain number of security personnel to guard all sensitive areas, the communication of information to headquarters would be much more complicated and slow and would have to be done with rules to be defined, the time for law enforcement agencies to intervene would be lengthened resulting in late. Information systems in public administration In Italy, examples of central public administrations that have adopted completely innovative solutions include: the Ministry of Agriculture and Forestry •the National Health Information System •the Public Education Information System •the Ministry of Justice -which is in the process of computerizing the criminal and civil areas for everything related to the trial process Document management: a very important aspect to highlight in PA information systems is undoubtedly the presence of systems for document management. Document management within administrations presents problems and critical issues of an organizational, technological and archival nature that require a systematic redefinition of structures, responsibilities, appropriate IT architectures, document flow management models and preservation methods. The introduction of a document management system within a public administration must, in other words, be harmoniously and functionally included as part of an information system supporting the administration’s institutional

activities. A document management system consists of the set of documents produced and acquired for the institutional purposes of an agency or administration system, i.e., the rules, procedures and resources necessary for the formation, organization, maintenance, retrieval, use and preservation of documents. To sum up, the macro-objectives to be achieved with a document management system are:  The production/acquisition of reliable documents for legal and administrative purposes  The intake of processing of documents  The organization and maintenance of documentary production in an orderly manner consistent with the functions performed  The transmission and preservation of authentic documents, i.e., intact and of certain and identified provenance  The speed and efficiency of retrieval within the administrative work performed Hospital information systems A hospital information system is an integrated information system designed to manage all aspects of a hospital, whether administrative and financial, clinical, medical or research. This terminology also refers to the document management systems required in the health care organization as well as the technological infrastructure for processing and transmitting the information processed. Usually, the system is divided into subsystems related to the various medical areas. At present, there are three different subsystems commonly found:  The Hospital Information System proper  The Radiological Information System (RIS)  The Picture Archiving and Communication System (PACS) From the perspective of information managed by the system, there are three main classes of data:  Those related to patients  Those related to the activities  Those related to resources A hospital information system, designed after careful analysis of the needs of the health care facility, is a tool aiding diagnostic decision-making and organizational activities. All of this requires that HIS, like all modern ISs, virtually presents itself as an integrated system that allows different information to be stored, accessed and shared. It is assumed that an orthopaedic trauma hospital is equipped with an information system t manage patient admissions. The steps for the production and management of information in such case are:  A set of hospital receptionists acquire via telephone the request for admissions from patients (information acquisition)  Patient information is then entered by the employees themselves via terminals within the hospital computer system and, in particular, stored on a special central server as an integral part of computerized medical records (information storing)  The archived information is, then, collected and analysed by the head physician and head nurse o the relevant department (information processing)

 Analysis of air traffic information and possible problems impacts the decision- making activities of airport managers In particular, depending on the possible problems that may be encountered (unavailability of some runways), managers might make the decision to cancel or delay one or more flights or to wait for the situation to evolve (runway restoration) and simply delay flights.

  1. INTRODUCTION TO R R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It has:  An effective data handling and storage facility  A suite of operators for calculations on arrays, in particular matrices  A large, coherent, integrated collection of intermediate tools for data analysis  Graphical facilities for data analysis and display either directly at the computer or on hardcopy  A well developed, simple and effective programming language (called “S”) which includes conditionals, loops, user defined recursive functions and input and output facilities. (indeed, most of the system supplied functions are themselves written in the S language) The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. R is very much a vehicle for newly developing methods of interactive data analysis. It has developed rapidly, and has been extended by a large collection of packages. However, most programs written in R are essentially ephemeral, written for a single piece of data analysis. First created in the early 1990s by Ross Ihaka and Robert Gentleman as an implementation of S. Development soon shifted to a larger core group. Distributed under the GNU General Public License. It is open source (R vs. SAS). It has a large body of academic users. Most statistical tests will have some implementation in R. It has a full- blown programming language (R vs. SPSS). It is free. Development by non-computer language people can lead to some drawbacks. Many functions are in packages outside base R. Scaling to run intensive programs can require a bit of work. Most R users also use an Integrated Development Environment (IDE) for coding in R and the most popular one is R studio.
  2. Some preliminary concepts about R The fact that R is a language may deter some users who think that they cannot program. This should not be the case for two reasons. First, R is an interpreted language, not a compiled one: meaning that all commands typed on the keyboard are directly executed without requiring to build a complete program like in most computer languages (C, Fortran, Pascal, …). Second, R’s syntax is very simple and intuitive: in R, in order to be executed, a function always needs to be written with parentheses, even if there is nothing within them (e.g., ls()). If one just types the name of a function without parentheses, R will display the content of the function. When R is running, variables, data, functions, results, etc., are stored in the active memory of the computer in the form of objects which have a name. the user can do actions on these objects with operators (arithmetic, logical, comparison, …) and

functions (which are themselves objects). The use of operators is relatively intuitive. An R function may be sketched as follows: The arguments can be objects (“data, formulae, expressions, …), some of which could be defined by default in the function. These default values may be modified by the user specifying options. An R function may require no argument: either all arguments are defined by default (and their values can be modified with the options), or no argument has been defined in the function. All the actions of R are done on objects stored in the active memory of the computer: no temporary files are used.  The readings and writings of files are used for input and output of data and results (graphics, …)  The user executes the functions via some commands  The results are displayed directly on the screen, stored in an object, or written on the disk (particularly for graphics)  Since the results are themselves objects, they can be considered as data and analysed as such  Data files can be read from the local disk or form a remote server through internet  The functions available to the user are stored in a library localized on the disk in a directory called R HOME/library (R HOME is the directory where R is installed)  The directory contains packages of functions, which are themselves structured in directories  The package named base in a way the core of R and contains the basic functions of the language, particularly, for reading and manipulating data  Each package has a directory called R with a file named like the package (for instance, for the package base, this is the file R HOME/library/base/R/base)  This file contains all the functions of the of the package  One of the simplest commands is to type the name of an object to display its content  For instance, if an object n contents the value 10:  The digit 1 within brackets indicates that the display starts at the first element of n  This command is an implicit use of the function print and the above example is similar to print(n) (in some situations, the function print must be used explicitly, such as within a function or a loop)  The name of an object must start with a letter (A-Z and a-z) and can include letters, digits (0-9), dots(.), and underscores (_).

 We can avoid to display all these details with the option max.level = -

. To delete objects in the memory, we use the function rm: rm(x) deletes the object x, rm(x,y) deletes both the objects x et y, rm(list=ls()) deltes all the objects in memory. The same options mentioned for the function ls() can then be used to delete selectively some objects: rm(list=ls(pat”^m))  The on-line help of R gives very useful information on how to use the functions. Help is available directly for a given function, for instance: > ?lm will display, within R, the help page for the funcgtion lm() (linear model). The commands help(lm) and help(“lm”) have the same effect. The last one must be used to access help with non-conventional characters  calling help opens a page (this depends on the operating system) with general information on the first line such as the name of the package, where is (are) the documented function(s) or operators. Then comes a title followed by sections which give detailed information  The help in html format (read, e.g., with Netscape) is called by typing > help.start(). The search with keywords is also possible in R with the function help.search. The latter looks for a specific topic, given as a character string, in the help pages of all installed packages. For instance, help.search(“tree”) will display a list of the functions which help pages mention “tree”

  1. Data with R We have seen that R works with objects which are, of course, characterized by their names and their content, but also by attributes which specify the kind of data represented by an object. In order to understand the usefulness of these attributes, consider a variable that takes the value 1,2, or 3: such a variable could be an integer variable or the coding of a categorical variable. All objects have 2 intrinsic attributes: mode and length. The mode is the basic type of the elements of the object. There are 4 main modes: numeric, character, complex, and logical (FALSE or TRUE).

 Other modes exist but they do not represent data, for instance function or expression  The length is the number of elements of the object  To display the mode and the length of an object, one can use the functions mode and length respectively:  Whatever the mode, missing the data are represented by NA (not available)  A very large numeric value can be specified with an exponential notation:  A value of mode character is input with double quotes “. It is possible to include this latter character in the value if it follows a backlash . The 2 characters altogether \” swill be treated in a specific way by some functions, such as cat for display on screen or write.table to write on the disk ,  Alternatively, variables of mode character can be delimited with single quotes (‘); in this case it is not necessary to escape double quotes with backlashes (but single quotes must be !)

 The variants of read.table are useful since they have different default values:  The function scan is more flexible than read.table. A difference is that it is possible to specify the mode of the variables, for example:

. Reads in the file data.dat 3 variables, the first is of mode character and the next 2 are of mode numeric. Another important distinction is that scan() can be used to create different objects, vectors, matrices, data frames, lists,… In the above ex, mydata is a list of three vectors. By default, that is if what is omitted, scan() created a numeric vector. If the data read do not correspond to the mode(s) expected (either by default, or specified by what), an error message is returned.

 The function read.fwf can be used to read in a file some data in fixed withdt format:

. The options are the same for read.table() except withds which specifies the width of the fields (buffersize is the maximum number of lines read simultaneously). For example, if a file name data.txt has the data indicated on the right, one can read the data with the following command:  The function write.table writes in a file an object, typically a data frame but this could be another kind of object (vector, matrix, …). The arguments and options are:

it is useful in statistics to be able to generate random data, and R can do it for a larger number of probability density functions. These functions are of the for rfunc(n,p1,p2,…), where func indicates the probability distrivution, n the number of data generated and p1,p2,… are the values of the parameters of the distribution. The above table gives the details of each distribution and possible default values (if none default values is indicated, it must be specified by the user). Most of these functions have counterparts obtained by replacing the letter r with d, p or q to get respectively, the probability density (dfunc(x,…))), the cumulative probability density (pfunc(x,…)) and the value of quantile (qfunc(p,…)) with 0 < p < 1). It is possible to create an object and specifying its mode, length, type, etc. One can, for, instance, create an “empty” object and then modify its elements successively which is more efficient than putting all its elements together with c(). It can also be very convenient to create objects form other. For example, to fir a series of models, it is simple to put the formulae in a list and then to extract the elements successively to insert them in the function lm. The explicit construction of objects gives a better understanding of their structure and allows to go further in some notions previously mentioned. The function vector which has 2 arguments mode and length, created a vector which elements have a value depending on the mode specified as argument: 0 if numeric, FALSE if logical, or “” if character. The following functions have for single argument the length of the vector: numeric(), logical() and character(). A matrix is a vector with an additional attribute (dim) which is itself a numeric vector with length 2 and defines the numbers of row and columns of the matrix. It can be created with the function matrix. The option byrow indicates where the values given by

data must fill successively the columns (the default) or the rows (if TRUE). The option dimnames allows to give names to the rows and columns. We have seen that a data frame is created implicitly by the function read.table; it is also possible to create a data frame with the function data.frame. the vectors so included in the data frame must be of the same length, or if one of them is shorter, it is “recycled” a whole number of times. R offers a remarkable variety of graphics. To get an idea, one can demo(graphics) or demo(persp). It is not possible to detail here to possibilities of R in terms of graphics, particularly, since each graphical function has a large number of options making production of graphics very flexible. The result of graphical function cannot be assigned to an object but is sent to a graphical device which is a graphical window or a file. There are 2 kinds of graphical functions: the high-level plotting functions which create a new graph and the low-level plotting functions which add elements to an existing graph. The graphs produced with respect to graphical parameters which are defined by default and can be modified with the function par.