



Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
A research study conducted by yi liang, xiangyun cai, and zilun xiong on data mining, specifically focusing on the use of a modified neural network and structure optimized genetic algorithm. The challenges of manual data processing and the need for automated techniques, and provides an overview of data mining methods and techniques. The study's results indicate a higher efficiency in classification and computation time compared to previous methods.
Typology: Study Guides, Projects, Research
1 / 6
This page cannot be seen from the preview
Don't miss anything!




Yi Liang Zhejiang University, China Xiangyun Cai South China University of Technology, China Zilun Xiong Lanzhou University, China
Abstract — Data mining has become one of the widely researched fields off late especially with the increase in technological advancements which have caused the volume of data to be stored and processed for query based applications or decision based processes to increase in great multiplicative measures. Data mining refers to the process of extraction of useful information from a pool of data. Various algorithms have been proposed in the past for the mining process out of which neural based mining algorithms are predominant. The proposed work utilizes BPNN integrated with a genetic optimization algorithm for minimizing and bringing about a optimal search value. An elaborate review of the types and existing techniques has been presented in this paper. The proposed algorithm has been tested with iris data set and results obtain indicate a comparatively higher efficiency based on classification and reduction of computation time. The proposed results have been compared against conventional techniques like SVM classifier based mining and neural network in its stand alone architecture.
Keywords—Data mining, neural network, genetic optimization, knowledge disconvery database.
With recent advancement in technology and state of art gadgets, the quantity of data and its processing in real time has been growing ever since and the dimensions have increased to such an extent that manual processing and searching of data from a large pool of data is a very tedious and time consuming task. The amount of data is growing very fast and manual analysis, even if possible, cannot keep pace. It cannot be expected that any human can analyse millions of records, each having hundreds of field’s thus opening avenues to develop automated techniques for data analysis and processing popularly known as data mining (DM) and Knowledge discovery database (KDD).
Figure 1. A general data mining system
Figure 1 illustrates a general data mining system consisting of the knowledge base and the data base. Data mining is an iterative process consisting of data cleaning as the first and foremost step. It is similar to pre processing of images and the noisy, redundant and irrelevant data are removed from the data set. This is followed by data integration where the signal from multiple data is combined into a common source. The data most relevant to the query under study is selected in the data selection stage followed by transformation into formats suitable for the mining process. The data sets are subjected to intelligent machine learning techniques or algorithms to extract the patters and the obtained data known as knowledge is presented in a visually interpretable manner. Data mining (DM) often referred as knowledge discovery in databases (KDD), is a process of nontrivial extraction of implicit, previously unknown and potentiality useful information from a large volume of data [1]. The extracted information is also referred as knowledge of the form rules, constraints and regularities. Rule mining is one of critical tasks as they provide a concise statement of potentially important information and most research contributions in the past till data have utilized neural network based techniques for deriving the mining rules. Researchers have been using many techniques such as statistical, AI, decision tree, database, cognitive etc. for rule mining. Several major kinds of data mining methods, including generalization, characterization, classification, clustering, association, evolution, pattern matching, data visualization, and meta-rule guided mining,
have been reviewed in [13]. Techniques for mining knowledge in different kinds of databases, included relational, transaction, object oriented, spatial, and active databases, as well as global information systems, have been surveyed and the merits and demerits of each of the techniques have been presented. A review on the types of mining techniques have been presented in [13] and classified into association technique based, clustering technique based, prediction based, classification based, decision tree and pattern based techniques. Elaborated discussions on each of the techniques and the algorithms implemented in each of the techniques have also been discussed. Association is one of the best known data mining technique. In association, sequential patterns are discovered based on a relationship between items in the same transaction [18]. So the association technique is also known as relation technique. The association rule mining technique is used in market basket analysis to identify a set of products that customers frequently purchase together. Frequent item sets play an essential role in many data mining tasks that try to find interesting patterns from databases such as association rules, correlations, sequences, classifiers, clusters and many more of which the mining of association rules is one of the most popular problems [3] [4]. The original motivation for searching association rules came from the need to analyze so called supermarket transaction data, that is, to examine customer behavior in terms of the purchased products. Association rule mining is normally performed in generation of frequent Item sets. The concepts behind association rules are provided at the beginning followed by an overview to some of the previous research works done on this area. On the other hand, Clustering is a data mining technique that makes valuable cluster of objects. The clustering technique describes the classes and puts objects in each class, in the classification techniques, objects are given into predefined classes. To make the concept clearer, consider an example. In a library, there is a large number of books in various titles are available. The challenge is how to keep those books in a way that readers can take several books in a particular topic without any difficulty. By using clustering technique, keep books that have some kinds of similarities in one cluster and label it with a meaningful name. A group of data objects based only on the information found in the data that describes the objects and their relationships have been utilized [11 – 13] in the cluster based algorithm presented by Indira Priya et al where the objective is met by group the objects within a class to be similar to one another and different from the objects in the other groups. The greater similarity of clustering [6] is within a group and the greater difference between groups, the more distinct the clustering. The cluster analysis splits the space into regions, characteristic of the clusters found in the data. The main benefit of a clustered solution is automatic recovery from failure. The difficulties of clustering are complication and inability to recover from database corruption. The prediction is one of a data mining techniques that determine relationship between dependent and independent variables. The prediction analysis technique can be used in
sale to predict profit, sale is an independent variable; profit could be a dependent variable. Then based on the past sale and profit data, a regression curve that is used for profit prediction. From the literature, it has been observed that no approaches or tools can guarantee to generate the accurate prediction in the organization. In this paper, they have analyzed the different algorithm and prediction technique. In spite the fact that the least median squares regression is known to produce better results than the classifier linear regression techniques from the given set of attributes. As comparison they found that Linear Regression technique [9] which takes the lesser time as compared to Least Median Square Regression. Sequential patterns analysis [8] is one of data mining technique that seeks to discover or identify related patterns, regular events or trends in transaction data over a business period. In sales, with past transaction data, it is easy to identify a set of items that customers buy together in a year. Then businesses can use this information to recommend customers buy it with better deals based on their purchasing frequency in the past. The important heuristics employed includes the optimally sized data structure representations of the sequence database; early pruning of candidate sequences; mechanisms to reduce support counting; and maintaining a narrow search space. Rule mining using neural networks (NNs) is a challenging job as there is no straight way to translate NN weights to rules. However, NNs have potential to be used in rule mining since they have been found to be a powerful tool to efficiently model data and modeling data is also an essential part of rule mining. As one of branches of DM methods, rule mining aims to apply algorithms of DM to stored data in databases. The core challenge of rule mining research is to turn information expressed in terms of stored data into knowledge expressed in terms of generalized statements about the characteristic of the data which is known as rules. These rules are used to draw conclusions about the whole universe of the dataset. The feed forward NN with unsupervised learning is a good mining tool for discovering data clustering in databases. Discrimination is also an important issue in this regard. Similar patterns should be placed in the same group for discrimination of like patterns in future. In the NN area, Kohonen self-organizing networks [8] or associative memory networks and counter propagation networks all have good potential to be used for this purpose. Data mining is the term used to describe the process of extracting value from a database. A data warehouse is a location where information is stored. The type of data stored depends largely on the type of industry and the company. Classification based techniques [7] [14][16] are used to classify each item in a set of data into one of predefined set of classes or groups. They utilize decision trees, neural network, and statistics. In classification, the authors developed the software that can learn how to classify the data items into groups. Internally the classifiers could use a Bayesian classifier, the conventional support vector machine classifier, back propagation classifier whose methodology and implementation details have been discussed in [15]. The classifier could also be rule based incorporating genetic search technique for the optimal solution search process which forms
this interconnected neurons all the parallel processing is done in human body and the human body is the best example of Parallel Processing. A neuron is a special biological cell that process information from one neuron to another neuron with the help of some electrical and chemical change. It is composed of a cell body or soma and two types of out reaching tree like branches: the axon and the dendrites. The cell body has a nucleus that contains information about hereditary traits and plasma that holds the molecular equipments or producing material needed by the neurons. The whole process of receiving and sending signals is done in particular manner like a neuron receive signals from other neuron through dendrites. The Neuron send signals at spikes of electrical activity through a long thin stand known as an axon and an axon splits this signals through synapse and send it to the other neurons.
Figure 3: A multi layered artificial neural network The feed-forward neural network architecture is commonly used for supervised learning. Feed-forward neural networks contain a set of layered nodes and weighted connections between nodes in adjacent layers. Feed-forward networks are often trained using a back propagation-learning scheme. Back propagation learning works by making modifications in weight values starting at the output layer then moving backward through the hidden layers of the network.
Input: Data set Target: Classified set
Generate m x n map with a seed neurons Initialize initial weight W(0) Select an instance n Find the winning neuron Determine the error Adapt the weight vectors of the k neuron using genetic optimization after encoding Repeat steps until convergence Add or delete connections between neurons according to the measure distances
B. GA Optimization
The GA is basically based on biological principle of natural selection. The architecture of systems that implement GAs is able to adapt to a wide range of problems. A GA functions by generating a large set of possible solutions to a given problem. It then evaluates each of those solutions, and
decides on a fitness level for each solution set. These solutions then breed new solutions. The parent solutions that are more fit are more likely to reproduce, while those that are less fit are more unlikely to do so. In essence, solutions are evolved over time. This way the search space evolves to reach the point of the solution. The GA can be understood by thinking in terms of real life natural selection. The three elements that constitute GAs are the encoding, the operator and the fitness function. The individuals in genetic space are chromosomes. The basic constitution factors are genes. The position of gene in individual is called locus. A set of individuals constructs a population. The fitness represents the evaluation of adaptability of individual to environment. The elementary operation of genetic algorithm consists of three operands: selection, crossover and mutation. Select is also called copy or reproduction. By calculating the fitness fi of individuals, we select high quality individuals with high fitness, copy them to the new population and eliminate the individual with low fitness to generate the new population. Generally used strategies of selection include roulette wheel selection, expectation value selection, paired competition selection and retaining high quality individual selection. Crossover puts individuals in population after selection into match pool and randomly makes individuals in pairs to form parent generation. Then according to crossover probability and the specified method of crossover, it exchanges part of the genes of individuals that is in pairs to form new pairs of child generation and finally to generate new individuals. Generally used methods of crossover are one point crossover, multi point crossover and average crossover. According to specified mutation rate, mutation substitutes genes with their opposite genes in some loci to generate new individuals.
III. RESULTS AND DISCUSSION The Fisher Iris Database is used in the experimentation and most opted by numerous researchers for verifying the performance of the proposed algorithms for classification and clustering. This is one of the best known databases to be found in the pattern recognition literature containing samples of three categories. The set contains three classes of fifty instances each, where every class refers to a type of Iris plant. 4 attributes are used to predict the iris class, i.e. , sepal length ( T 1), sepal width ( T 2), petal length ( T 3), and petal width ( T 4), all in centimetres. Among the 3classes, class one is linearly separable from the other two classes, and classes two and three are not linearly separable from one another. To ease data extraction, we reformulate the data with three outputs, where class 1 is represented by{1, 0, 0}, class 2 by{0, 1, 0},and class 3 by{0, 0,1}.One class is linearly separable from the other 2; however, the latter are not linearly separable from each other. This data set is prepared by Ronald Fisher. In all four features of 150 irises samples of three classes/types are recorded. Class 1 is Setosa; Class 2 is Verginica; and class 3 is Versicolor.
Figure 4. Extracted features for the dataset
Features measured are average, standard deviation, variance, maximum value and minimum occurrence values as shown in figure 4. In the first experiment all 250 available data patterns have been used for training and testing; however in the second experiment 75 randomly selected patterns from each class have been used for training and the remaining 175
for testing. The first experiment created no misclassifications with the experiment was carried out by varying from 1 to 50 in steps of 5.
Figure 5. Convergence against no of epochs The convergence of the mean square error against the number of iterations or epochs is depicted in figure 4. The mean square error for the given data set converges to a minimum value in 108 iterations and maintains a near or less saturation. Figure 6 illustrates the computation time taken for each of the three technique when implemented on a Intel 3, 3.5GHz processor.
Figure 6 Comparison of computation times
It could be seen that the proposed work provides a drastic reduction in the computation time. The accuracy results of the proposed work are also shown in table 1. Table 1: Performance evaluation of proposed work
A research on data mining based on neural network optimized through genetic algorithm is presented in this paper. Data mining is known for its high robustness, self organizing adaptive, parallel processing capabilities, and distributed storage with a high degree of fault tolerance. The combination of data mining and neural network can greatly improve the efficiency of data mining, and it has been widely used & we have presented neural network based data mining scheme to mining classification rules from given databases. This work is an attempt to apply the approach to data mining by extracting symbolic rules. The expectation from mining IRIS data set would be discovering patterns from examining petal and sepal size of the IRIS plant and how the prediction was made from analyzing the pattern to form the class of IRIS plant. By using this pattern and classification, the unknown data can be predicted more precisely in upcoming years. It is very clearly stated that the type of relationship that being mined using IRIS dataset would be a classification model. This can classify the type of IRIS plant by examining the sizes of petal and sepal. Sepal width has positive relationship with Sepal length and petal width has positive relationship with petal length. An important feature of the rule extraction algorithm is its recursive nature. A set of experiments was conducted to test the approach using a well defined set of data mining problems. The results indicate that, using the approach, high quality rules can be discovered from the given data sets. The extracted rules are concise, comprehensible, order insensitive, and do not involve any weight values. The accuracy of the rules from the pruned network is as high as the accuracy of the fully connected networks. Experiments showed that this method helped a lot to reduce the number of rules significantly without sacrificing classification accuracy.
REFERENCES
S.No. Algorithm Accuracy