






Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
a survey about data mining tools
Typology: High school final essays
1 / 11
This page cannot be seen from the preview
Don't miss anything!







Abstract — Due to the fast and flawless technological innovation there is a tremendous amount of data dumping all over the world in every domain such as Pattern Recognition, Machine Learning, Spatial Data Mining, Image Analysis, Fraudulent Analysis, World Wide Web etc., This issue turns to be more essential for developing several tools for data mining functionalities. The major aim of this paper is to analyze various tools which are used to build a resourceful analytical or descriptive model for handling large amount of information more efficiently and user friendly. In this survey the diverse tools are illustrated with their extensive technical paradigm, outstanding graphical interface and inbuilt multipath algorithms in which it is very useful for handling significant amount of data more indeed.
Keywords — Classification, Clustering, Data Mining, Machine learning, Visualization
I. I NTRODUCTION HE domain of data mining and discovery of knowledge in various research fields such as Pattern Recognition, Information Retrieval, Medicine, Image Processing, Spatial Data Extraction, Business and Education has been tremendously increased over the certain span of time. Data Mining highly endeavors to originate, analyze, extract and implement fundamental induction process that facilitates the mining of meaningful information and useful patterns from the huge dumped unstructured data. This Data mining paradigm mainly uses complex algorithms and mathematical analysis to derive exact patterns and trends that subsists in data. The main aspire of data mining technique is to build an effective predictive and descriptive model of an enormous amount of data. Several real world data mining problems involve numerous conflicting measures of performance or intention in which it is needed to be optimized simultaneously. The most distinct features of data mining are that it deals with huge and complex datasets in which its volume varies from gigabytes to even terabytes. This requires the data mining operations and algorithms are robust, stable and scalable along with the ability to cooperate with different research domains. Hence the various data mining tasks plays a crucial role in each and every aspect of information extraction and this in turn leads to the emergence of several data mining tools. From a pragmatic perspective, the graphical interface used in the tools tends to be more efficient, user friendly and easier to operate, in which they are highly preferred by researchers [1].
Mrs.S.Sarumathi, Associate Professor, is with the Department of Information Technology, K. S. Rangasamy College of Technology, Tamil Nadu, India (phone: 9443321692; e-mail: [email protected]). Dr.N.Shanthi, Professor and Dean, is with the Department of Computer Science Engineering, Nandha Engineering College, Tamil Nadu, India (e- mail: [email protected]).
Fig. 1 A Data Mining Framework
Revolving into the relationships between the elements of the framework has several data modeling notations pointing towards the cardinality 1 or else m of every relationship. For these minimum familiar with data modeling notations.
II. DIFFERENT DATA M INING T OOLS A. DATABIONIC The Databionic Emergent Self-Organizing Map tool [3] is a collection of programs to do data mining tasks such as visualization, clustering and classification. Training data is a collection of points from a high dimensional space known as data space. A SOM contain a collection of prototype vectors in the data space plus a topology between these prototypes. Commonly used topology is a 2-dimensional grid where every prototype that is neuron has four direct neighbors and the locations on the grid from the map space. Additional two distance functions are necessary for each space. Euclidean
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
distance is normally used for the data space, then the City block distance in the map space. The function of SOM training is to adapt the grid of prototype vectors to the specified data generating a 2-dimensional projection that conserves the topology of the data space. The update of a prototype vector using a vector of the training data is a dominant operation at SOM training. The prototype vector of the neuron is drawn nearer towards a given vector in the data space. The Prototypes in the district of the neuron are drawn in the similar direction with less emphasis. During training the emphasis and the size of the district are reduced. Online and Batch training are the two common training algorithms both searches the closest prototype vector for each data point. The best match is updated immediately in online training, but in Batch training first the best matches is collected for all data points then the updates if performed together. Emergency is the capacity of a system to improve higher level structures using the teamwork of various elementary processes. The structures evolve inside the system without external influences in self-organizing systems. Emergency is the form of high level phenomena which cannot be derived from the elementary processes. An emergent structure provides an abstract explanation of a complex system containing low level individuals. Transmitting the principles of self-organization to data analysis is achieved by allowing multivariate data points form themselves into homogeneous groups. Self-organizing Map is a well-known tool for this task that integrates the above mentioned principles. The SOM iteratively regulates to distance structures in a high dimensional space. That provides a low dimensional projection that reserves the topology of the input space as possible. The map is used in unsupervised clustering and supervised classification. The emergence of structure in data is frequently neglected by the power of self-organization. In the scientific literature this part is a misusage of SOM. The maps consist of some tens of neurons, which is used by some authors are commonly very small.
ELKI is open source [4] (AGPLv3) data mining software written in Java. The aim of ELKI is research in algorithms, with an emphasis on unsupervised techniques in cluster analysis and outlier detection. ELKI provides huge data index structures like R* tree that offer main performance gains to attain high performance and scalability. ELKI is planned to be simple to extend for researchers and students in this domain plus welcomes contributions in specific of new methods. ELKI tries to provide that a huge collection of paraeterizable algorithms, that allows simple and fair evaluation and benchmarking of algorithms. Data mining research directs to several algorithms for similar tasks. A fair and useful similarity of these algorithms is complex due to some reasons:
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
interface in python. It is licensed under a permissive simplified BSD license and is scattered under multiple Linux distributions and commercial use. The library is developed on the SciPy that is Scientific Python that should be installed earlier they can use Scikit-learn. These stacks consist of NumPy is Base n-dimensional array package, Scipy is a fundamental library for scientific computing, Matplot lib is the comprehensive 2D or 3D plotting, I Python is enhanced interactive console, Sympy is symbolic mathematics, Pandas is data structures and analysis. Extension or else the modules for SciPy care conservatively named as SciKits. Such the module offers learning algorithms and is named as Scikit- learn. The version of the library is a stage of robustness and that support necessary use of prediction systems. It heads a deep center on concerns like code quality, performance, easy to use, documentation and collaboration. Though the interface is python, c-libraries are influences for a performance like LAPACK, numpy for arrays, plus matrix operations, LibSVM and the use of python.
c) It provides a multiple kernel learning functionality d) The coefficient or else weights of the linear mixture are learned G. FITYK FITYK [9] is an agenda for nonlinear fitting of analytical functions to data. The main brief description is peak fitting software. There are group using it to take away the baseline from data or else to show data only. It is used in Raman spectroscopy, crystallography and so on. Authors has a common understanding of experimental techniques other than powered diffraction and like to create it more useful to as many groups as possible. FITYK provides a variety of nonlinear fitting methods and simple background subtraction and other manipulations to the datasets, support for analysis of series of datasets, easy placement of peaks and changing of peak parameters, and more. Flexibility is the major advantage of the program which parameters of peaks can be arbitrarily bound to each other. For example the width of a peak is an independent variable and the same as the width of other peak or can be given by difficult formula.
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
use pre-existing annotators. Few annotators are available and others are in several repositories on the internet. Extra infrastructure support components contain a simple server which receives REST requests plus revisit annotation outcomes.
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
generated Big Data with distributed scale-out architecture. They use a loose model sharing architecture for well- organized training plus sharing of machine learning methods using three fundamental operations like Mix, Update and Analyze and same as with the Map plus decrease operations in Hadoop. The point is to minimize the size of the models plus the number of the Mix operations as maintaining high accuracy while Mix large models for several times causes high networking cost plus high latency in the distributed environment. Then our improvement team contains component, researchers will merge the latest advances in online machine learning, randomized algorithms and distributed computing to offer well-organized machine learning features for Jubatus. At present it supports essential tasks containing regression, outlier detection, etc., a demo system for chirp classification of quick Twitter data streams is obtainable.
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
maximum speed without lag of a data stream. These three requirements have a trade-off relationship plus inherently difficult to gratify each of them concurrently. Jubatus gratify both profound analyses also scalability. Everywhere profound analysis is the mechanical categorization of formless information intended for human beings like natural language. Furthermore, it replaces human labor for indistinctly formulated processing work like prediction, etc. Technical perspective comprises challenges in the area of machine learning, etc. Scalability encompasses the matters of (1) enlarges in processing requests (2) enlarges in data size. Issue (1) it is additional classify into throughput plus response. In widespread batch processing focuses on throughput, whereas real-time processing focuses on response. Approaches to issue (2) either process the data without waiting or else dividing plus storing it. Separate the profound analysis functionality from the scalability, non-functionality. A profound analysis, design the logic of online machine learning to an engine or else CPU that can be constantly upgraded like removable analysis modules. A scalable design is a general infrastructure motherboard that can be scaled by means of installing analysis modules into the general framework. The final aim of Jubatus is to offer everyone with scalable machine learning. The Major strategy is to proffer an extensively easy-to-understand online machine learning framework for large data which is easy to use with hardware which scales out cheap good servers to enable massively parallel distributed processing plus software that is not restricted to a few specialists, data scientists, along with programmers.
Applicable Areas An important classification of Twitter information is the appliance of Jubatus. The information depictions in natural language plus presented with a minimum number of characters, etc. were incorporated into the analysis subject theme. Jubatus hastily classifies the 2000 tweets per second into their matching business categories pus provisions information to an analysis application. This functioning utilizes an online mechanism leaning technique called Jubatus classifiers or else multi-valued classification.
Architecture and Functions Jubatus is collected of a cluster of machine learning engines plus a high-speed framework which ropes them. In dissimilarity to other machine learning engine units that typically handled small-to medium-scale information plus required batch processing plus individual development. Jubatus has a large variety of engines installed in a high-speed framework plus an improved mechanism with general stipulation for high-speeding large data processing with faults within a permitted array tolerated. The categories of machine learning that support Regression, classification, and statistics, etc. Jubatus is a useful application that needs speedy judgment. It is usual to discover plus analysis, creative relationships between the data volumes from dissimilar domains.
Distributed Processing Architecture The flow of large data streams starting left to right. Clients are organized of a number of user processes along with proxy processes. The proxy processes convey the clients' request to the server processes that enables the servers to be transparent to the user processes. User processes are applied by means of using the Jubatus client application programming interface (API) and they are written in a common programming language or else in a scripting language. The communication amongst proxy processes plus server processes are depending on MessagePack remote procedure calls. Non-block input or output enables more well-organized communications plus synchronization control. The server processes perform the training and prediction processing and also learning model synchronization that has linear increasing performance by means of the quantity of servers. Zookeeper procedures, administer the cooperation among proxy plus server processes and the balancing between distributed servers, the selection of new leaders, plus the monitoring of server state that is alive or dead. In parallel processing amid distributed servers major techniques for satisfying jointly profound analysis plus scalability is mix processing. In the mix processing consider are resembling a collection study meeting for self-teaching plus synchronization with other [22].
Features a) Multi-classification algorithms b) Regression algorithms c) Feature extraction method for natural language M. SCaVis A SCaVis is a background in scientific computation, data visualization plus data analysis planned for students, scientists plus engineers. The program integrates multiple open-source software packages interested in a coherent interface by the idea of dynamic scripting. SCaVis software is used all over the place where an analysis of huge numerical data volumes, statistical analysis, data mining, plus math computations are necessary like modeling, natural science, analysis of financial markets plus engineering. Scarves program is completely multiplatform and works on all platforms where Java is installed. Like a Lava application, SCaVis obtains the complete advantage of multicore processors. SCaVis is used with various scripting languages for the Java platform likes BeanShell, Groovy, Jython, JRuby. This carries more and more power plus simplicity for scientific computation. The programming is also being completed in native Java. At last symbolic calculations is completed by using Matlab or Octave high-level interpreted language. SCaVis works on Linux, Windows, and Mac plus Android operating systems. The Android application is known as A Work. Thus the software signifies the final analysis framework that is used on hardware like netbooks, desktops, laptops, android tablets plus production servers. SCaVis are a transportable application. SCaVis needs no installation. Easily download and open the package and run it. Once run it from a hard drive through a USB flash drive or
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
TABLE I COMPARISON OF D ATA MINING TOOLS Name of the Tools Mode of Software Applications Categories Languages DATABIONIC Commercial Visualization, Clustering and Classification Information Analysis, Visualization Java ELKI Free & Open Source
Outlier detection, Visualization and Clustering Data Mining and Machine learning Software Java MALLET Open Source Statistical natural language processing, Document classification, Cluster analysis, Information extraction
Free Artificial Intelligence application, Software stubs, Natural language processing toolkits
Java
ML-FLEX Free & Open Source
Machine learning analyses Data Mining and Machine learning Software Java / Other Programming SHOGUN Free & Open Source
Bioinformatics Free Software Programmed in C++, Data Mining and Machine learning Software, Free Statistical Software
C++
FITYK Open Source Chromatography, Spectroscopy, Power diffraction
Regression and Curve fitting software, Data analysis software, Software that uses wxwidgets
C++ PYBRAIN Open Source Supervised, unsupervised and reinforcement learning
Support Vector Machines, Neural Networks
Python UIMA Open Source Text Mining, Information Extraction Software architecture, Data Mining and Machine learning software
Java with C++ NLTK Free & Open Source
Artificial Intelligence, Information Retrieval, Machine learning
Natural language parsing, Python libraries, Data analysis
Python DLIB Open Source Data Mining, Image processing, Numerical optimization
Computer vision software, Data Mining and Machine learning software
C++ JUBATUS Open Source Classification, Regression, Anomaly Detection Data Mining and Machine learning software, Computing stubs, Free software stubs
C++ SCAVIS Open Source & Commercial
Data analysis and Data visualization
Statistical software, Numerical programming languages, Infographics
Java, Jython CMSR DATA MINER
Open Source Predictive modeling, segmentation, data visualization, statistical data analysis, and rule- based model evaluation
Data Mining and Machine Learning Software Java
VOWPAL WABBIT
Open Source Classification, Regression Data Mining and Machine Learning Software C++
[1] S. Sarumathi, N. Shanthi, S.Vidhya M. Sharmila. “A Review: Comparative Study of Diverse Collection of Data Mining Tools”. International Journal of Computer, Information, Systems and Control Engineering Vol:8 No:6, 2014 [2] M Ferguson. “Evaluating and Selecting Data Mining Tools”InfoDB, Vol:11 No: [3] Data Bionics Research Group, University of Marburg: Databionic esom tools, Website (2006), http://databionic-esom.sourceforge.html/ [4] ELKI: Environment for Developing KDD-Applications Supported by Index-Structures, (online). Available at: http://elki.dbs.ifi.lmu.de/ [5] McCallum, Andrew Kachites. "MALLET: A Machine Learning for Language Toolkit", 2002. [6] ML-Flex, “Introduction to ML-Flex (online). Available at: http:// mlflex.sourceforge.net/tutorial/index.html [7] Jason Brownlee, “A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library” 2014. [8] Soren Sonnenburg et al., “The SHOGUN Machine Learning Toolbox”, Journal of Machine Learning Research 11, 1799-1802, 2010. [9] M. Wojdyr, J. Appl. Cryst, ” Fityk” 2010. [10] Tom Schaul, Justin Bayer, Daan Wierstra, Sun Yi, Martin Felder, Frank Sehnke, Thomas Rückstieß, Jürgen Schmidhuber, “PyBrain” 2010. [11] UIMA, “The Apache Software Foundation”,2006. [12] S. Bird, E. Steven, Edward Loper and Ewan Klein, “Natural Language Processing with Python” O’REILLY. [13] Steven Bird, Edward Loper, “NLTK: The Natural Language Toolkit”
[14] Davis E. King, “Dlib C++ Library”, 2015, Dlib (online). Available at: http://dlib.net/ [15] Davis E. King, “Containers”, 2013, Dlib (online). Available at: http://dlib.net/containers.html [16] Davis E.King, “Image Processing”, 2015, Dlib (online). Available at: http://dlib.net/imaging.html [17] Davis E.King, “API Wrappers”, 2015, Dlib (online). Available at: http://dlib.net/api.html [18] Davis E.King, “Graph Tools”, 2013, Dlib (online). Available at: http://dlib.net/graph tools.html [19] Davis E.King, “Machine Learning”, 2015, Dlib (online). Available at: http://dlib.net/ml.html
[20] Jubatus (online). Available at: http://www.predictiveanalyticstoday.com/ top-40-free-data-mining-software/ [21] Jubatus, PFN & NTT, 2011. (online). Available at: http://jubat.us/en/overview.html [22] Satoshi Oda, Kota Uenishi, and Shingo Kinoshita,” Jubatus: Scalable Distributed Processing Framework for Realtime Analysis of Big Data”, NTT Technical Review. [23] SCaVis, Scavis community, 2014. [24] Cho Ok-Hyeong, “CMSR Data Miner”,2014. [25] CMSR Data Miner Data Mining & Predictive Modeling Software, Rosella Predictive Knowledge and Data Mining, [26] Vowpal Wabbit: Fast Learning on Big Data, n13,
Mrs. S.Sarumathi received B.E. degree in Electronics and Communication Engineering from Madras University, Madras, Tamil Nadu India in 1994 and the M.E. degree in Computer Science and Engineering from K.S.Rangasamy College of Technology, Namakkal Tamil Nadu, India in 2007. She is doing her Ph.D. programme under the area Data Mining in Anna University, Chennai. She has a teaching experience of about 17 years. At present she is working as Associate professor in Information Technology department at K.S.Rangasamy College of technology. She has published 5 reputed International Journals and two National journals. And also she has presented papers in three International conferences and four national Conferences. She has received many cash awards for producing cent percent results in university examination. She is a life member of ISTE.
Dr.N.Shanthi received the B.E. degree in Computer Science and Engineering from Bharathiyar University, Coimbatore, Tamil Nadu, India in 1994 and the M.E. degree in Computer Science and Engineering from Government College of Technology, Coimbatore, Tamil Nadu, and India in
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf
International Journal of Computer and Information Engineering Vol:9, No:3, 2015
Open Science Index, Computer and Information Engineering Vol:9, No:3, 2015 publications.waset.org/10002609/pdf