data mining libraries, High school final essays of Computer science

a survey about data mining tools

Typology: High school final essays

2019/2020

Uploaded on 09/18/2022

nouioua-mourad-1
nouioua-mourad-1 🇩🇿

1 document

1 / 11

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
AbstractDue to the fast and flawless technological innovation
there is a tremendous amount of data dumping all over the world in
every domain such as Pattern Recognition, Machine Learning, Spatial
Data Mining, Image Analysis, Fraudulent Analysis, World Wide
Web etc., This issue turns to be more essential for developing severa l
tools for data mining functionalities. The major aim of this paper is to
analyze various tools which are used to build a resourceful analytical
or descriptive model for handling large amount of information more
efficiently and user friendly. In this survey the diverse tools are
illustrated with their extensive technical paradigm, outstanding
graphical interface and inbuilt multipath algorithms in which it is
very useful for handling significant amount of data more indeed.
KeywordsClassification, Clustering, Data Mining, Machine
learning, Visualization
I. INTRODUCTION
HE domain of data mining and discovery of knowledge in
various research fields such as Pattern Recognition,
Information Retrieval, Medicine, Image Processing, Spatial
Data Extraction, Business and Education has been
tremendously increased over the certain span of time. Data
Mining highly endeavors to originate, analyze, extract and
implement fundamental induction process that facilitates the
mining of meaningful information and useful patterns from the
huge dumped unstructured data. This Data mining paradigm
mainly uses complex algorithms and mathematical analysis to
derive exact patterns and trends that subsists in data. The main
aspire of data mining technique is to build an effective
predictive and descriptive model of an enormous amount of
data. Several real world data mining problems involve
numerous conflicting measures of performance or intention in
which it is needed to be optimized simultaneously. The most
distinct features of data mining are that it deals with huge and
complex datasets in which its volume varies from gigabytes to
even terabytes. This requires the data mining operations and
algorithms are robust, stable and scalable along with the
ability to cooperate with different research domains. Hence the
various data mining tasks plays a crucial role in each and
every aspect of information extraction and this in turn leads to
the emergence of several data mining tools. From a pragmatic
perspective, the graphical interface used in the tools tends to
be more efficient, user friendly and easier to operate, in which
they are highly preferred by researchers [1].
Mrs.S.Sarumathi, Associate Professor, is w ith the Department of
Information Technology, K. S. Rangasamy College of Technology, Tamil
Nadu, India (phone: 9443321692; e-mail: rishi_saru20@rediffmail.com).
Dr.N.Shanthi, Professor and Dean, is with the Department of Computer
Science Engineering, Nandha Engineering College, Tamil Nadu, India (e-
Fig. 1 A Data Mining Framework
Revolving into the relationships between the elements of
the framework has several data modeling notations pointing
towards the cardinality 1 or else m of every relationship. For
these minimum familiar with data modeling notations.
A business problem is studied via more than one classes
of modeling approach is useful for multiple business
problems.
More than one method is helpful for any classes of model
plus any known methods is used for more than one classes
of models.
There is normally more than one approach of
implementing any known methods.
Data mining tools may sustain more than one of the
methods plus every method is supported by means of
more than one vendor's products.
For every known method a meticulous product supports a
meticulous implementation algorithm [2].
II. DIFFERENT DATA MINING TOOLS
A. DATABIONIC
The Databionic Emergent Self-Organizing Map tool [3] is a
collection of programs to do data mining tasks such as
visualization, clustering and classification. Training data is a
collection of points from a high dimensional space known as
data space. A SOM contain a collection of prototype vectors in
the data space plus a topology between these prototypes.
Commonly used topology is a 2-dimensional grid where every
prototype that is neuron has four direct neighbors and the
locations on the grid from the map space. Additional two
distance functions are necessary for each space. Euclidean
S. Sarumathi, N. Shanthi
Comprehensive Analysis of Data Mining Tools
T
World Academy of Science, Engineering and Technology
International Journal of Computer and Information Engineering
Vol:9, No:3, 2015
837International Scholarly and Scientific Research & Innovation 9(3) 2015 ISNI:0000000091950263
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
pf3
pf4
pf5
pf8
pf9
pfa

Partial preview of the text

Download data mining libraries and more High school final essays Computer science in PDF only on Docsity!

Abstract Due to the fast and flawless technological innovation there is a tremendous amount of data dumping all over the world in every domain such as Pattern Recognition, Machine Learning, Spatial Data Mining, Image Analysis, Fraudulent Analysis, World Wide Web etc., This issue turns to be more essential for developing several tools for data mining functionalities. The major aim of this paper is to analyze various tools which are used to build a resourceful analytical or descriptive model for handling large amount of information more efficiently and user friendly. In this survey the diverse tools are illustrated with their extensive technical paradigm, outstanding graphical interface and inbuilt multipath algorithms in which it is very useful for handling significant amount of data more indeed.

Keywords Classification, Clustering, Data Mining, Machine learning, Visualization

I. I NTRODUCTION HE domain of data mining and discovery of knowledge in various research fields such as Pattern Recognition, Information Retrieval, Medicine, Image Processing, Spatial Data Extraction, Business and Education has been tremendously increased over the certain span of time. Data Mining highly endeavors to originate, analyze, extract and implement fundamental induction process that facilitates the mining of meaningful information and useful patterns from the huge dumped unstructured data. This Data mining paradigm mainly uses complex algorithms and mathematical analysis to derive exact patterns and trends that subsists in data. The main aspire of data mining technique is to build an effective predictive and descriptive model of an enormous amount of data. Several real world data mining problems involve numerous conflicting measures of performance or intention in which it is needed to be optimized simultaneously. The most distinct features of data mining are that it deals with huge and complex datasets in which its volume varies from gigabytes to even terabytes. This requires the data mining operations and algorithms are robust, stable and scalable along with the ability to cooperate with different research domains. Hence the various data mining tasks plays a crucial role in each and every aspect of information extraction and this in turn leads to the emergence of several data mining tools. From a pragmatic perspective, the graphical interface used in the tools tends to be more efficient, user friendly and easier to operate, in which they are highly preferred by researchers [1].

Mrs.S.Sarumathi, Associate Professor, is with the Department of Information Technology, K. S. Rangasamy College of Technology, Tamil Nadu, India (phone: 9443321692; e-mail: [email protected]). Dr.N.Shanthi, Professor and Dean, is with the Department of Computer Science Engineering, Nandha Engineering College, Tamil Nadu, India (e- mail: [email protected]).

Fig. 1 A Data Mining Framework

Revolving into the relationships between the elements of the framework has several data modeling notations pointing towards the cardinality 1 or else m of every relationship. For these minimum familiar with data modeling notations.

  • A business problem is studied via more than one classes of modeling approach is useful for multiple business problems.
  • More than one method is helpful for any classes of model plus any known methods is used for more than one classes of models.
  • There is normally more than one approach of implementing any known methods.
  • Data mining tools may sustain more than one of the methods plus every method is supported by means of more than one vendor's products.
  • For every known method a meticulous product supports a meticulous implementation algorithm [2].

II. DIFFERENT DATA M INING T OOLS A. DATABIONIC The Databionic Emergent Self-Organizing Map tool [3] is a collection of programs to do data mining tasks such as visualization, clustering and classification. Training data is a collection of points from a high dimensional space known as data space. A SOM contain a collection of prototype vectors in the data space plus a topology between these prototypes. Commonly used topology is a 2-dimensional grid where every prototype that is neuron has four direct neighbors and the locations on the grid from the map space. Additional two distance functions are necessary for each space. Euclidean

S. Sarumathi, N. Shanthi

Comprehensive Analysis of Data Mining Tools

T

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

distance is normally used for the data space, then the City block distance in the map space. The function of SOM training is to adapt the grid of prototype vectors to the specified data generating a 2-dimensional projection that conserves the topology of the data space. The update of a prototype vector using a vector of the training data is a dominant operation at SOM training. The prototype vector of the neuron is drawn nearer towards a given vector in the data space. The Prototypes in the district of the neuron are drawn in the similar direction with less emphasis. During training the emphasis and the size of the district are reduced. Online and Batch training are the two common training algorithms both searches the closest prototype vector for each data point. The best match is updated immediately in online training, but in Batch training first the best matches is collected for all data points then the updates if performed together. Emergency is the capacity of a system to improve higher level structures using the teamwork of various elementary processes. The structures evolve inside the system without external influences in self-organizing systems. Emergency is the form of high level phenomena which cannot be derived from the elementary processes. An emergent structure provides an abstract explanation of a complex system containing low level individuals. Transmitting the principles of self-organization to data analysis is achieved by allowing multivariate data points form themselves into homogeneous groups. Self-organizing Map is a well-known tool for this task that integrates the above mentioned principles. The SOM iteratively regulates to distance structures in a high dimensional space. That provides a low dimensional projection that reserves the topology of the input space as possible. The map is used in unsupervised clustering and supervised classification. The emergence of structure in data is frequently neglected by the power of self-organization. In the scientific literature this part is a misusage of SOM. The maps consist of some tens of neurons, which is used by some authors are commonly very small.

  1. Features a) Training of ESOM in dissimilar initialization methods, distance functions, ESOM grid topologies, training algorithms, neighborhood kernels and parameter cooling strategies b) Visualization of high dimensional data space using p- Matrix, SDH and more c) Animated visualization of the training process d) Link ESOM e) Scalable with more data to the training data, data descriptions and data classifications using clustering, interactive and explorative data analysis f) Formation of ESOM classifier plus automated application of new data g) Formation of non-redundant U-Maps from toroid ESOM h) Databionic ESOM Analyzer i) U-Max of hexa dataset

B. ELKI

ELKI is open source [4] (AGPLv3) data mining software written in Java. The aim of ELKI is research in algorithms, with an emphasis on unsupervised techniques in cluster analysis and outlier detection. ELKI provides huge data index structures like R* tree that offer main performance gains to attain high performance and scalability. ELKI is planned to be simple to extend for researchers and students in this domain plus welcomes contributions in specific of new methods. ELKI tries to provide that a huge collection of paraeterizable algorithms, that allows simple and fair evaluation and benchmarking of algorithms. Data mining research directs to several algorithms for similar tasks. A fair and useful similarity of these algorithms is complex due to some reasons:

  • Implementation of contrast, partners are not at hand.
  • If implementations of dissimilar authors are available and a valuation in terms of effectiveness is biased to evaluate the hard works of dissimilar authors in well-organized program instead of estimating algorithmic merits. An alternatively efficient data management tool similar to index-structures is able to show considerable impact on data mining tasks and is useful for a diversity of algorithms. Data mining algorithms and data management tasks in ELKI are divided and allow for a free evaluation. This division creates ELKI unique between data mining frameworks similar to Weka or Rapid Miner along with frameworks for index structures like GiST. Simultaneously ELKI is open to random data types, space or similarity measures, or file formats. The primary approach is the independence of fie parsers or database connections, data mining algorithms, distances, distance functions. They trust to serve the data mining and database research community usefully along with the development and publication of ELKI. The framework is open source for scientific usage. In application of ELKI in scientific publications which they would value credit in the form of a citation of the suitable publication that is, the publication related to the release of ELKI they were using. The design goals are extensibility, Contribution, completeness, Fairness, Performance, progress. Extensibility in ELKI has a modular design and permits random combinations of data types, input formats, distance functions, index structures, algorithms and evaluation methods. Contributions in ELKI improve people contribute. By using a modular design that permits small charity like single distance functions plus single algorithms. They have students and external charity participate in the development of ELKI. Completeness for an exhaustive contrast of methods and they aspire at covering as available and qualified work as they can. Fairness is simple to do an unjust comparison by roughly implementing a competitor. They need to implement each method and publish the score code permit for external improvements. The aim to plus proposed improvements like index structures in earlier range and kNN queries. Performance is the modular architecture of ELKI that permits optimized versions of the algorithms and index structure for acceleration. Progress in ELKI is modified with each release.

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

interface in python. It is licensed under a permissive simplified BSD license and is scattered under multiple Linux distributions and commercial use. The library is developed on the SciPy that is Scientific Python that should be installed earlier they can use Scikit-learn. These stacks consist of NumPy is Base n-dimensional array package, Scipy is a fundamental library for scientific computing, Matplot lib is the comprehensive 2D or 3D plotting, I Python is enhanced interactive console, Sympy is symbolic mathematics, Pandas is data structures and analysis. Extension or else the modules for SciPy care conservatively named as SciKits. Such the module offers learning algorithms and is named as Scikit- learn. The version of the library is a stage of robustness and that support necessary use of prediction systems. It heads a deep center on concerns like code quality, performance, easy to use, documentation and collaboration. Though the interface is python, c-libraries are influences for a performance like LAPACK, numpy for arrays, plus matrix operations, LibSVM and the use of python.

  1. Features a) It offers both supervised and unsupervised learning algorithms b) Easy to use c) It has high performance F. Shogun Shogun is an open source and free toolbox [8] written in C++. It provides many algorithms and data structures for machine learning problems. It is licensed below the condition of the GNU General Public License version 3. The center of shogun is on kernel machines like classification problem, support vector machines for regression. Shogun provides full implementation of Hidden Markov models. The main aim of shogun is written in C++ and other interfaces like R, Java, C#, etc., since 1999 shogun is in under active construction. There is a vibrant user community using shogun as a base for research and education and contributing to the core package. Shogun supports Support Vector machines, Linear discriminant analysis, Kernel Perceptron’s, K-Nearest Neighbors, Hidden Markov Models, Clustering algorithms like k-mean and GMM, Dimensionality reduction Embedding like Iso map, PCA, Linear Local Tangent Space Alignment etc., kernel Ridge Regression, Support Vector Regression, etc., Most dissimilar kernels are implemented and ranging from kernels from numerical data to kernel on special data. At present implemented kernels for numeric data contain polynomial, linear, sigmoid kernels, Gaussian. The kernel supports for data consist of weighed Degree, Spectrum, and Weighted Degree with Shifts. In later group of kernels permit processing of arbitrary sequences among fixed alphabets like DNA sequences with full e-mail texts.
  2. Features a) It supports the pre-calculated kernels b) It is likely to use a mutual kernel that is a kernel, consisting of a linear mixture of arbitrary kernels

c) It provides a multiple kernel learning functionality d) The coefficient or else weights of the linear mixture are learned G. FITYK FITYK [9] is an agenda for nonlinear fitting of analytical functions to data. The main brief description is peak fitting software. There are group using it to take away the baseline from data or else to show data only. It is used in Raman spectroscopy, crystallography and so on. Authors has a common understanding of experimental techniques other than powered diffraction and like to create it more useful to as many groups as possible. FITYK provides a variety of nonlinear fitting methods and simple background subtraction and other manipulations to the datasets, support for analysis of series of datasets, easy placement of peaks and changing of peak parameters, and more. Flexibility is the major advantage of the program which parameters of peaks can be arbitrarily bound to each other. For example the width of a peak is an independent variable and the same as the width of other peak or can be given by difficult formula.

  1. GUI vs. CLI The program is divided into two versions such as Graphical User Interface which is comfortable for users and Command Line Interface version named as cfityk.
  2. Features a) Instinctive interfaces like graphical and command line b) Support for data file formats and, thanks to the xylib library c) A group of build-in functions and maintain for user- defined functions d) Equality constraints e) Appropriate systematic errors of the x coordinate of points f) Manual, graphical placement of peaks and auto-placement by peak detection algorithm g) Several optimization methods h) Handling serious of data sets i) Computerization with macros and embedded Lua for extra complex scripting j) From NIST the precision of nonlinear regression confirmed with reference datasets k) An append for powder diffraction data l) Modular architecture m) Open source license (GPL) H. PyBrain A modular Machine Learning Library for Python is PyBrain [10]. Its aim is to provide flexible, easy-to-use yet still algorithms for Machine Learning Tasks as well as a range of predefined environments to test and compare your algorithms. PyBrain is Python-Based Reinforcement Learning, Neural Network Library and Artificial Intelligence. In general, it came up with the name first and later reverse-engineered this rather descriptive "Backronym".

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

  1. How Is PyBrain Different? There are some machine learning libraries out there where PyBrain goals are to easy-to-use modular library that can be used by entry-level students, but they provide the flexibility and algorithms for state-of-the-art research. They are continually working on quicker algorithm and developing fresh environments plus improving usability.
  2. What PyBrain Can Do? PyBrain is a tool for real-life tasks. It has algorithms for neural networks for unsupervised learning, for reinforcement learning and evolution. Most of the problems contract with continuous state and action spaces and function approximates have to use to manage with the large dimensionality. The library is constructing around neural networks in the kernel plus most of the training techniques allow a neural network like the to-be-trained instance.
  3. Features a) Supervised Learning b) Black-box optimization/ Evolutionary Methods c) Reinforcement Learning d) Tasks and Benchmarks e) Compositionality I. UIMA UIMA [11] is an Unstructured Information Management Application. UIMA is a software system that analyzes large volumes of unstructured information in turn to find out knowledge that is related to an end user. Example of UMI application may ingest plain text and recognize entities like organizations, persons, etc., UIMA allow applications to be decayed into components. Each component implements interface defines the framework plus offers self-describing metadata through XML descriptor files. The framework maintains these components plus the data flow between them. Components are written in Java or C++ and the data that flows between components is planned for efficient mapping among these languages. UIMA offers capabilities to cover components as network services plus they can scan to big volumes through replicating processing pipelines over a cluster of networked nodes. Apache UIMA is an Apache-licensed open source execution of the UIMA requirements. Frameworks, Components and Infrastructure are licensed under the apache license. The frameworks are obtainable for both C++ and Java and they run the components. The Java Framework supports both Java and non-Java components. The C++ framework supports annotator in C or C++ and also supports Perl, Python, and TCL annotators. The UIMA-AS plus UIMA-DUCC is Scale out Frameworks and they are add-ons to the Java framework. The UIMA-AS is flexible scale out capability depends on Java Messaging Services plus Active MQ. The UIMA-DUCC expands UIMA-AS By giving cluster management services to scale-out of UIMA pipelines through computing clusters. The frameworks support configure plus running pipelines of Annotator components. These components do the analyzing the formless information. They can write or else configure and

use pre-existing annotators. Few annotators are available and others are in several repositories on the internet. Extra infrastructure support components contain a simple server which receives REST requests plus revisit annotation outcomes.

  1. Features a) Platform independent data representations and interfaces for text and multi-modal analytics b) Analyze unstructured information J. Natural Language Toolkit (NLTK) NLTK is a most important platform for structuring Python programs to work with human language data. It offers easy-to- use interfaces to above 50 corpora, plus lexical resources like Word Net with a matching set of text processing libraries for tagging, classification, semantic reasoning, tokenization, parsing, stemming and an active discussion forum. NLTK is fitting for researchers, linguists, etc., NLTK are obtainable for Linux, Windows, OS X and Mac. NLTK is an open source, free and community-driven project. NLTK is called as a tool for working in computational linguistics by Python plus a library to play with natural language. It offers a programming for language processing. NLTK direct the basics of writing Python programs and other [12]. NLTK plans with few main goals. a) Simplicity To offer an instinctive framework with considerable building blocks which gives sensible information of NLP with no receiving bogged down in the boring housekeeping typically associated along with processing annotated language data. b) Consistency To offer a uniform framework along with reliable interfaces plus data structures and effortlessly guessable method names. c) Extensibility To offer a formation in which novel software modules can be effortlessly accommodated which containing option executions plus rival approaches to the same work. d) Modularity To offer components that is used separately with no needing to realize the rest of the toolkit. NLTK is developed at the University of Pennsylvania in combination with a computational linguistics course in 2001. It is planned with three applications like projects, assignment and demonstrations.
  2. Assignments NLTK supports assignments of complexity and scope. With the easiest assignments, students experiment with components to do a variety of NLP tasks. As students become more well- known with the toolkit and change existing components or else they make the entire system out of existing components.

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

  1. Containers An API wrapper offers a transportable object oriented interface for networking, file browsing, multithreading and GUI development. Program written is compiled in POSIX or else MS Windows platforms with no changes in the code [17].
  2. Graph Tools There are two different types of graph representations in dlib. One is, some graphs depend on an object which summarizes the entire graph like graph plus directed_graph objects. Alternatively, there are graphs that are standing for simple vectors of edges. Here, they use vectors of ordered_sample_pair or else sample_pair objects for direct plus undirected graphs [18].
  3. Machine Learning The most important design goal of this part of the library is to offer an extremely high modular and easy architecture for dealing with kernel is parameterized to permit a user to offer either one of the predefined dlib kernels or as a new user defined kernel. In addition to the functioning of the algorithms are totally divorced from the data on which they work. This creates the dlib functioning generic enough to work on any type of data, images, be it column vector, or else a number of other forms of structured data. Each and every one that is essential is an appropriate kernel [19].
  4. Features a) Documentation b) High Quality Portable Code c) Threading d) Networking e) Graphical User Interfaces f) Numerical Algorithms g) Machine Learning Algorithms h) Graphical Model Interference Algorithms i) Image Processing j) Data Compression and Integrity Algorithms k) Testing l) General Utilities L. Jubatus Jubatus is the first open source platform for online machine learning, distributed computing framework on the data streams of Big Data. Jubatus has multiple features like regression, classification, data mining, etc. [20]. In large databases where computer science will face fresh challenges in Big Data applications like nation-wide M2M sensor network analysis, real-time security observation on the raw Internet traffic, and online advertising optimization for millions of customers. It is useless to be relevant normal approaches for data analysis on little datasets by amassing all data into databases and examining the data in the databases as a batch-processing plus visualizing the summarized results. The future of data analytics platform should enlarge to three directions at the same time they are maintaining bigger data, performing in real-time and applying deep analytics. There is no such analytics platform for huge data streams of constantly

generated Big Data with distributed scale-out architecture. They use a loose model sharing architecture for well- organized training plus sharing of machine learning methods using three fundamental operations like Mix, Update and Analyze and same as with the Map plus decrease operations in Hadoop. The point is to minimize the size of the models plus the number of the Mix operations as maintaining high accuracy while Mix large models for several times causes high networking cost plus high latency in the distributed environment. Then our improvement team contains component, researchers will merge the latest advances in online machine learning, randomized algorithms and distributed computing to offer well-organized machine learning features for Jubatus. At present it supports essential tasks containing regression, outlier detection, etc., a demo system for chirp classification of quick Twitter data streams is obtainable.

  1. Scalable It supports scalable machine learning processing. They can maintain multiple data per second through commodity hardware clusters. It is planned for clusters of commodity and shared-nothing hardware.
  2. Real-Time It updates a model instantaneously for receiving data plus simultaneously analyzes the data.
  3. Deep Analysis It supports several tasks for deep analysis containing graph analysis, clustering, regression, etc.
  4. Difference from Hadoop and Mahout
  • There are several points among Hadoop or Mahout and Jubatus. They are scalable plus working on commodity hardware. Hadoop is not prepared with sophisticated machine learning algorithms as the majority of the algorithms do not fit its Map Reduce paradigm. While Apache Mahout is a Hadoop-based machine learning platform and online processing of data streams is out of the scope.
  • Jubatus procedures, data in an online manner plus obtain high throughput and low latency. To attain these features it makes use of unique loosely model synchronization for scale out plus quick model sharing in distributed environments.
  • They procedures all data in memory plus focus on operations for data analysis [21]. Jubatus is a scalable dispersed computing structure designed for online machine learning. The starting point of the name is the Latin word for that agile animal the cheetah. The foremost aim of Jubatus is to make possible speedy and profound analysis of stream-type big data. The processing consists of desires such as profound, fast, and large volume. Jubatus satisfies both profound analyses along with scalability. Profound analysis is the mechanical categorization of formless information planned for human beings like natural language processing plus mechanical multi-category classification by

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

maximum speed without lag of a data stream. These three requirements have a trade-off relationship plus inherently difficult to gratify each of them concurrently. Jubatus gratify both profound analyses also scalability. Everywhere profound analysis is the mechanical categorization of formless information intended for human beings like natural language. Furthermore, it replaces human labor for indistinctly formulated processing work like prediction, etc. Technical perspective comprises challenges in the area of machine learning, etc. Scalability encompasses the matters of (1) enlarges in processing requests (2) enlarges in data size. Issue (1) it is additional classify into throughput plus response. In widespread batch processing focuses on throughput, whereas real-time processing focuses on response. Approaches to issue (2) either process the data without waiting or else dividing plus storing it. Separate the profound analysis functionality from the scalability, non-functionality. A profound analysis, design the logic of online machine learning to an engine or else CPU that can be constantly upgraded like removable analysis modules. A scalable design is a general infrastructure motherboard that can be scaled by means of installing analysis modules into the general framework. The final aim of Jubatus is to offer everyone with scalable machine learning. The Major strategy is to proffer an extensively easy-to-understand online machine learning framework for large data which is easy to use with hardware which scales out cheap good servers to enable massively parallel distributed processing plus software that is not restricted to a few specialists, data scientists, along with programmers.

  1. Applicable Areas An important classification of Twitter information is the appliance of Jubatus. The information depictions in natural language plus presented with a minimum number of characters, etc. were incorporated into the analysis subject theme. Jubatus hastily classifies the 2000 tweets per second into their matching business categories pus provisions information to an analysis application. This functioning utilizes an online mechanism leaning technique called Jubatus classifiers or else multi-valued classification.

  2. Architecture and Functions Jubatus is collected of a cluster of machine learning engines plus a high-speed framework which ropes them. In dissimilarity to other machine learning engine units that typically handled small-to medium-scale information plus required batch processing plus individual development. Jubatus has a large variety of engines installed in a high-speed framework plus an improved mechanism with general stipulation for high-speeding large data processing with faults within a permitted array tolerated. The categories of machine learning that support Regression, classification, and statistics, etc. Jubatus is a useful application that needs speedy judgment. It is usual to discover plus analysis, creative relationships between the data volumes from dissimilar domains.

  3. Distributed Processing Architecture The flow of large data streams starting left to right. Clients are organized of a number of user processes along with proxy processes. The proxy processes convey the clients' request to the server processes that enables the servers to be transparent to the user processes. User processes are applied by means of using the Jubatus client application programming interface (API) and they are written in a common programming language or else in a scripting language. The communication amongst proxy processes plus server processes are depending on MessagePack remote procedure calls. Non-block input or output enables more well-organized communications plus synchronization control. The server processes perform the training and prediction processing and also learning model synchronization that has linear increasing performance by means of the quantity of servers. Zookeeper procedures, administer the cooperation among proxy plus server processes and the balancing between distributed servers, the selection of new leaders, plus the monitoring of server state that is alive or dead. In parallel processing amid distributed servers major techniques for satisfying jointly profound analysis plus scalability is mix processing. In the mix processing consider are resembling a collection study meeting for self-teaching plus synchronization with other [22].

  4. Features a) Multi-classification algorithms b) Regression algorithms c) Feature extraction method for natural language M. SCaVis A SCaVis is a background in scientific computation, data visualization plus data analysis planned for students, scientists plus engineers. The program integrates multiple open-source software packages interested in a coherent interface by the idea of dynamic scripting. SCaVis software is used all over the place where an analysis of huge numerical data volumes, statistical analysis, data mining, plus math computations are necessary like modeling, natural science, analysis of financial markets plus engineering. Scarves program is completely multiplatform and works on all platforms where Java is installed. Like a Lava application, SCaVis obtains the complete advantage of multicore processors. SCaVis is used with various scripting languages for the Java platform likes BeanShell, Groovy, Jython, JRuby. This carries more and more power plus simplicity for scientific computation. The programming is also being completed in native Java. At last symbolic calculations is completed by using Matlab or Octave high-level interpreted language. SCaVis works on Linux, Windows, and Mac plus Android operating systems. The Android application is known as A Work. Thus the software signifies the final analysis framework that is used on hardware like netbooks, desktops, laptops, android tablets plus production servers. SCaVis are a transportable application. SCaVis needs no installation. Easily download and open the package and run it. Once run it from a hard drive through a USB flash drive or

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

TABLE I COMPARISON OF D ATA MINING TOOLS Name of the Tools Mode of Software Applications Categories Languages DATABIONIC Commercial Visualization, Clustering and Classification Information Analysis, Visualization Java ELKI Free & Open Source

Outlier detection, Visualization and Clustering Data Mining and Machine learning Software Java MALLET Open Source Statistical natural language processing, Document classification, Cluster analysis, Information extraction

Free Artificial Intelligence application, Software stubs, Natural language processing toolkits

Java

ML-FLEX Free & Open Source

Machine learning analyses Data Mining and Machine learning Software Java / Other Programming SHOGUN Free & Open Source

Bioinformatics Free Software Programmed in C++, Data Mining and Machine learning Software, Free Statistical Software

C++

FITYK Open Source Chromatography, Spectroscopy, Power diffraction

Regression and Curve fitting software, Data analysis software, Software that uses wxwidgets

C++ PYBRAIN Open Source Supervised, unsupervised and reinforcement learning

Support Vector Machines, Neural Networks

Python UIMA Open Source Text Mining, Information Extraction Software architecture, Data Mining and Machine learning software

Java with C++ NLTK Free & Open Source

Artificial Intelligence, Information Retrieval, Machine learning

Natural language parsing, Python libraries, Data analysis

Python DLIB Open Source Data Mining, Image processing, Numerical optimization

Computer vision software, Data Mining and Machine learning software

C++ JUBATUS Open Source Classification, Regression, Anomaly Detection Data Mining and Machine learning software, Computing stubs, Free software stubs

C++ SCAVIS Open Source & Commercial

Data analysis and Data visualization

Statistical software, Numerical programming languages, Infographics

Java, Jython CMSR DATA MINER

Open Source Predictive modeling, segmentation, data visualization, statistical data analysis, and rule- based model evaluation

Data Mining and Machine Learning Software Java

VOWPAL WABBIT

Open Source Classification, Regression Data Mining and Machine Learning Software C++

REFERENCES

[1] S. Sarumathi, N. Shanthi, S.Vidhya M. Sharmila. “A Review: Comparative Study of Diverse Collection of Data Mining Tools”. International Journal of Computer, Information, Systems and Control Engineering Vol:8 No:6, 2014 [2] M Ferguson. “Evaluating and Selecting Data Mining Tools”InfoDB, Vol:11 No: [3] Data Bionics Research Group, University of Marburg: Databionic esom tools, Website (2006), http://databionic-esom.sourceforge.html/ [4] ELKI: Environment for Developing KDD-Applications Supported by Index-Structures, (online). Available at: http://elki.dbs.ifi.lmu.de/ [5] McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit", 2002. [6] ML-Flex, “Introduction to ML-Flex (online). Available at: http:// mlflex.sourceforge.net/tutorial/index.html [7] Jason Brownlee, “A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library” 2014. [8] Soren Sonnenburg et al., “The SHOGUN Machine Learning Toolbox”, Journal of Machine Learning Research 11, 1799-1802, 2010. [9] M. Wojdyr, J. Appl. Cryst, ” Fityk” 2010. [10] Tom Schaul, Justin Bayer, Daan Wierstra, Sun Yi, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber, “PyBrain” 2010. [11] UIMA, “The Apache Software Foundation”,2006. [12] S. Bird, E. Steven, Edward Loper and Ewan Klein, “Natural Language Processing with Python” O’REILLY. [13] Steven Bird, Edward Loper, “NLTK: The Natural Language Toolkit”

[14] Davis E. King, “Dlib C++ Library”, 2015, Dlib (online). Available at: http://dlib.net/ [15] Davis E. King, “Containers”, 2013, Dlib (online). Available at: http://dlib.net/containers.html [16] Davis E.King, “Image Processing”, 2015, Dlib (online). Available at: http://dlib.net/imaging.html [17] Davis E.King, “API Wrappers”, 2015, Dlib (online). Available at: http://dlib.net/api.html [18] Davis E.King, “Graph Tools”, 2013, Dlib (online). Available at: http://dlib.net/graph tools.html [19] Davis E.King, “Machine Learning”, 2015, Dlib (online). Available at: http://dlib.net/ml.html

[20] Jubatus (online). Available at: http://www.predictiveanalyticstoday.com/ top-40-free-data-mining-software/ [21] Jubatus, PFN & NTT, 2011. (online). Available at: http://jubat.us/en/overview.html [22] Satoshi Oda, Kota Uenishi, and Shingo Kinoshita,” Jubatus: Scalable Distributed Processing Framework for Realtime Analysis of Big Data”, NTT Technical Review. [23] SCaVis, Scavis community, 2014. [24] Cho Ok-Hyeong, “CMSR Data Miner”,2014. [25] CMSR Data Miner Data Mining & Predictive Modeling Software, Rosella Predictive Knowledge and Data Mining, [26] Vowpal Wabbit: Fast Learning on Big Data, n13,

Mrs. S.Sarumathi received B.E. degree in Electronics and Communication Engineering from Madras University, Madras, Tamil Nadu India in 1994 and the M.E. degree in Computer Science and Engineering from K.S.Rangasamy College of Technology, Namakkal Tamil Nadu, India in 2007. She is doing her Ph.D. programme under the area Data Mining in Anna University, Chennai. She has a teaching experience of about 17 years. At present she is working as Associate professor in Information Technology department at K.S.Rangasamy College of technology. She has published 5 reputed International Journals and two National journals. And also she has presented papers in three International conferences and four national Conferences. She has received many cash awards for producing cent percent results in university examination. She is a life member of ISTE.

Dr.N.Shanthi received the B.E. degree in Computer Science and Engineering from Bharathiyar University, Coimbatore, Tamil Nadu, India in 1994 and the M.E. degree in Computer Science and Engineering from Government College of Technology, Coimbatore, Tamil Nadu, and India in

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf

  1. She has completed the Ph.D. degree in Periyar University, Salem in offline handwritten Tamil Character recognition. She worked as a HOD in department of Information Technology, at K.S.Rangasamy College of Technology, Tamil Nadu, India since 1994 to 2013, and currently working as a Professor & Dean in the department of Computer Science and Engineering at Nandha Engineering College Erode. She has published 29 papers in the reputed International journals and 9 papers in the National and International conferences. She has published 2 books. She is supervising 14 research scholars under Anna University, Chennai. She acts as the reviewer for 4 international journals. Her current research interest includes Document Analysis, Optical Character Recognition, and Pattern Recognition and Network security. She is a life member of ISTE.

International Journal of Computer and Information Engineering Vol:9, No:3, 2015

Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf