I- The few common concepts and terms in the big data world :

i. Relational database management system (RDBMS)

Structured data in a predetermined schema (tables), scalable vertically through large SMP

servers, or horizontally through clustering software. These databases are usually easy to

create, access, and extend. The standard language for relational database interoperability is

the Structured Query Language (SQL).

ii. Non-relational database

A database that does not store data into tables, but made them accessible through special

query APIs. The standard language used is Not Only SQL (NoSQL): it does not present a

fixed schema, it uses BASE system to scale vertically (basically available, soft-state,

eventually consistent), and sharding (horizontal partitioning) to scale horizontally.

Examples are MongoDB and CouchDB (they differ mainly because in MongoDB the main

objects are documents, while in CouchDB are collections, which in turn contain documents).

NoSQL commonly used JavaScript Object Notation (JSON) data format (BSON in

MongoDB — binary JSON), and it mainly works through Key Value Store (KSV), i.e., a

collection of different unknown data types (while an RDBMS stores data into table knowing

exactly the data type).

iii. Programming language

It is a formally constructed language designed to communicate instructions to a machine.

The main ones for data science applications are Java, C, C++, C#, R, and Matlab. Scala is

another language that is becoming extremely popular right now, but it is an example of

functional language.

iv. Hadoop

An open source software for analyzing a huge amount of data on a distributed system. His

primary storage system is called Hadoop distributed file system (HDFS), which duplicates

the data and allocates them in different nodes. It has been written in Java. It is a core

technology in the big data revolution and stores data into their native raw format, and it can

be used for several purposes (Dull, 2014), such as a simple data staging or landing platform

complementary to the existing EDW (as an enterprise data hub, i.e., EDH), or managing data

(even small), transforming those into a specific format in the HDFS and sending them back

to the EDW, lowering thus the costs while increasing the processing power. Furthermore, it

can integrate external data sources and archive data (both on-premises or into the cloud), and

reduce the burden for a standard EDW.

v. MapReduce

Software for parallel processing huge amount of data.

vi. Flume

Service to gather, aggregate, and move chunks of data from several sources to a centralized

system.

vii. Cassandra

An open source database system for analyzing a large amount of data on a distributed

system. It is characterized by a high performance and by a high availability with no single

point of failure (i.e., a part of the system that if fails stops the whole system). It fosters data

denormalization, which means grouping data or adding redundant information, in order to

optimize the database performance.

viii. Distributed System

Multiple terminals communicating between them. The problem is divided into many tasks

and assigned to each terminal. It is a highly scalable system as further nodes are added.

ix. Google File System

Proprietary distributed file system for managing efficiently large datasets.

x. HBase

An open source non-relational database (column-oriented) developed on an HDFS. It is very

useful for real-time random read and write access to data, as well as to store sparse data

Big data students notes, Study notes of Java Programming

Related documents

Partial preview of the text

Download Big data students notes and more Study notes Java Programming in PDF only on Docsity!