Comparison of Grid and Database Approaches for Efficient Petabyte Data Management | Study notes Architecture

Distributed Database Management Systems and the Data Grid

Heinz Stockinger

CERN, European Organization for Nuclear Research, Geneva, Switzerland

Institute for Computer Science and Business Informatics, University of Vienna, Austria

[email protected]

tel +41 22 767 16 08

Abstract

Currently, Grid research as well as distributed

database research deals with data replication but

both tackle the problem from different points of

view. The aim of this paper is to outline both

approaches and try to find commonalities between

the two worlds in order to have a most efficient

Data Grid that manages data stored in object-

oriented databases. Our target object-oriented

database management system is Objectivity/DB

which is currently the database of choice in some

existing High Energy Physics (HEP) experiments as

well as in next generation experiments at CERN.

The characteristics of Data Grids are described,

especially within the High Energy Physics

community, and needs for Data Grids are defined.

The Globus toolkit is the Grid middle-ware on

which we base our discussions on Grid research.

1 Introduction

Grid computing in general comes from high-

performance computing, super computing and later

cluster computing where several processors or work

stations are connected via a high-speed interconnect

in order to compute a mutual program. Originally,

the cluster was meant to span a local area network

but then it was also extended to the wide area. A

Grid itself is supposed to connect computing

resources over the wide area network.

The Grid research field can further be divided

into two large sub-domains: Computational Grid

and Data Grid. Whereas a Computational Grid is a

natural extension of the former cluster computer

where large computing tasks have to be computed

at distributed computing resources, a Data Grid

deals with the efficient management, placement and

replication of large amounts of data. However, once

data are in place, computational tasks can be run on

the Grid using the provided data. The need for Data

Grids stems from the fact that scientific

applications like data analysis in High Energy

Physics (HEP), climate modelling or earth

observation are very data intensive and a large

community of researchers all around the globe

wants to have fast access to the data.

In the remainder of this paper we concentrate

on the specific needs of High Energy Physics which

can be regarded as a representative example for

other data intensive research communities. In

particular, we focus on the data intensive Large

Hadron Collider (LHC) experiments of CERN, the

European Organization for Nuclear Research in

Geneva, Switzerland. At CERN, recently the

DataGrid project [1] has been initiated in order to

set up a Data Grid. One of the working groups

explicitly deals with data management in a Data

Grid [2], i.e. in the DataGrid project. The tasks to

be solved include data access, migration and

replication as well as query estimation and

optimisation in a secure environment. In this paper

we deal with the replication aspects that need to be

solved in the DataGrid project. The Globus toolkit

[3] is the middle-ware which we will use for the

Grid infrastructure.

Scientific, data intensive applications use

large collections of files for storing data. As regards

the HEP community, data generated by large

detectors have to be stored persistently in mass

storage systems like disks and tapes in order to be

available for physics analysis. In some HEP

experiments, databases are used to store Terabytes

and even Petabytes of persistent data. The usage of

databases is still a unique feature for a Data Grid.

Let us compare this to the climate modelling

community: in that research domain large

collections of files are available and stored in so

called “flat files” without databases. This requires

additional data management tasks like keeping a

catalogue of available files whereas in some

physics experiments in the HEP community the

database management system (DBMS) takes care of

this.

Currently, some new experiments in HEP use

object-oriented databases management systems

(ODBMS) for data management. This is true for

Comparison of Grid and Database Approaches for Efficient Petabyte Data Management, Study notes of Architecture

Related documents

Partial preview of the text

Download Comparison of Grid and Database Approaches for Efficient Petabyte Data Management and more Study notes Architecture in PDF only on Docsity!

Distributed Database Management Systems and the Data Grid

Abstract

1 Introduction

2 Related work on data replication

4 Objectivity/DB

8 Conclusion

Acknowledgement

References