Xml Duplicate Detection over Bayesian Network Using Decision Tree Induction Algorithm

Krishnakumar.   S

doi:10.17577/IJERTV3IS052127

Volume 03, Issue 05 (May 2014)

Xml Duplicate Detection over Bayesian Network Using Decision Tree Induction Algorithm

DOI : 10.17577/IJERTV3IS052127

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 44
Total Downloads : 340
Authors : Krishnakumar. S
Paper ID : IJERTV3IS052127
Volume & Issue : Volume 03, Issue 05 (May 2014)
Published (First Online): 05-06-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Xml Duplicate Detection over Bayesian Network Using Decision Tree Induction Algorithm

S. Krishnakumar

IV Semester, ME, Dept of CSE k.Ramakrishnan college of Technology Trichy

Abstract Duplicate detection is a nontrivial task because the duplicates are not exactly equal they are more or less equal and also structure of the xml also varies from one xml data with other which represent the same real world object. To find duplicates in xml data is done by Decision tree induction algorithm which removes accurate matching xml data and incomplete data. Thus it reduces the number of records to classify. From the output of Decision tree induction algorithm, Bayesian network is created to obtain the probability of two xml objects being duplicates. To improve the efficiency of the evaluation, a novel pruning strategy is used. The efficiency and effectiveness is high because duplicate detection is done in two levels one by the Decision tree induction algorithm and then by Bayesian network which is reduced by pruning.

Keywords Decision tree induction algorithm, Bayesian network, Network pruning, Data cleaning

INTRODUCTION

XML is increasingly popular, especially for data published on the Web and data exchanged between organizations. XML data is semi-structured and is organized hierarchically.

Fuzzy Duplicate detection is of critical practical relevance in many applications, including data cleaning data integration and personal information management. The most prominent application area for duplicate detection is customer relationship management (CRM), where multiple entries of the same customer can result in multiple mailings to the same person, incorrect aggregation of sales to a certain customer,etc. Other application areas include bioinformatics, catalog integration, and in general any domain where independently collected data is integrated. Ironically, the problem has been considered under various names, e.g., record linkage, merge/purge, reference reconciliation, or entity resolution

In this paper, we refer to the problem as fuzzy duplicate detection, or duplicate detection for short.The word Fuzzy implies that the result of matching two xml datas is not exactly true or false.the result lies between true and false therefore it is between 0 to 1. If the result obtained is more than Threshold value then it is taken as true and less than threshold value means it is taken as false. Duplicate detection has been studied extensively for relational data stored in a single table . In this case, the detection strategy typically consists in comparing pairs of tuples (each tuple representing an object) by computing a similarity score based on their attribute

values.This will not suit duplicate detection in xml data because the content of xml data may be different but still represents the same object. The xml data is taken from different source, the content and structure of the xml data may vary but they represent same entity. The content of the xml varies due to the errors, different semantics and misspelling. Example of XML data with content and structural difference is shown below.

(A) (B)

person person

name age name details

JohnSmith 25 Smith,john age

25

Fig. 1.1 Two xml elements represent same person

Fig. 1.1 represents the xml data difference occurs in the name element, which is spelled in document (A) as John Smith, while in document (B) it is Smith, John. To find out that John Smith and Smith, John represent the same name, a simple string comparison is not sufficient. A string similarity function is required to discover that these two strings represent the same information. Such similarity functions get two strings as input and return a value indicating how similar (or how different) they are. Some well-known similarity functions are Edit Distance, Jaro-Winkler, Monge-Elkan and SoftTF.IDF, among others. When such functions are used, we usually consider that the input strings match if their similarity value is above a certain threshold. Structural difference exits in the above example. In document (A) the age element is embedded in to person element,but in document(B) the age element is embedded in to details element. The structural difference can be identified from the DTD(Document Type Descriptor).

The method we present in this paper uses the Decision induction tree to remove the xml records which has missing values and the xml records which are exactly duplicate. The classifier Decision induction tree reduces the

number of records that is stored thus it reduces the number of comparison made in the Bayesian network. string similarity function takes longer time to compare the strings, to avoid unnecessary comparison a technique called Network pruning is used.

Structure. This paper is organized as follows: Section

2 presents related work. Section 3 summarizes overall architecture of system. Section 4 describes the classifier Decision tree induction. Section 5 describes the Bayesian network formation. To accelerate the evaluation of Bayesian network by pruning is presented in Section 6.Finally, in Section 7 we conclude and present suggestions for future work.
RELATED WORK

Only more recently research been performed with the specific goal of discovering duplicate object representations in XML databases [1], [2], [3], [4], [5]. These works differ from previous approaches since they were specifically designed to exploit the distinctive characteristics of XML object representations: their structure, textual content, and the semantics implicit in the XML labels. We briefly describe the main features of these methods here.Comparison of xml elements based on parents, children and structure is discussed in [1].Heuristics is proposed to choose any one of the above. It consists of three modules namely candidate definition, duplicate definition and duplicate detection. the first two provide the definitions necessary for duplicate detection (i.e., the set of object representations to compare and the duplicate classifier to use), the third component includes the actual algorithm. Demerit is Heuristics selection is non automatic. The similarities between two nodes are calculated by elements content, nodes name and nodes path in [2]. The problem of defining which parts of two xml data contain the same information is solved. It focuses only on effectiveness not in efficiency.

In [3], not only the duplicate status of children is considered, also the Probability of descendants being duplicate is also considered. Bayesian Network is able to accurately determine the probability of two xml objects being duplicate in [3].Demerit is run time is high. It is effective but has less efficiency. The work done in our proposed method is to increase the efficiency of the duplicate detection done in Bayesian network by using Decision tree induction classification and pruning Network.

The problem of integrating xml data through correlations realized as join operations is done in [4]. Lower and upper bounds for the tree edit distance metric between two trees are calculated. It focuses only on Focus only on efficiency and not in effectiveness. In [5] Tree edit distance is computationally infeasible for unordered data. Hence new distance for xml data called structure aware xml distance is proposed based on the concept of overlays. An overlay between two XML trees U and V is a mapping between their nodes, such that a node u which belongs to U, is mapped to a single node v which belongs to V

if, and only if, they have the same path from the root. This measure is then used to perform a airwise comparison between all candidates. If the distance measure determines that two XML candidates are closer than a given threshold, the pair is classified as a duplicate.
OVERALL ARCHITECTURE

Bayesian Model

/NP

Fig.3.1. Overall Architecture of duplicate detection.

Fig.3.1 represents the overall architecture of duplicate detection. The xml data is processed by Decision tree induction to remove accurate duplicates and xml data whose values are missing. This process reduces the amount of work done by Bayesian network and Network pruning by reducing the xml data to compare.

The classified xml data are given to construct the Bayesian model, from where the four types of conditional probabilities are calculated. To evaluate the conditional probabilities string similarity function is used.string similarity function is a costly process because it takes more time and reduces the efficiency of the system. To improve the efficiency of the system Network pruning is used. It reduces the number of comparison by following strategy like sorting and computing the conditional probability of the system whenever the string similarity function is calculated. If the conditional probability has reached less than the threshold value then the comparison is stopped and the xml data is declared as non duplicate. Thus the comparison of xml data is stopped in midway and increases the efficiency of the system.

The duplicate xml data is removed and only non duplicate is stored. This improves the memory usage utility and increases the searching process faster while retrieving the data. Thus it increases the optimization of memory. The duplicate detection of xml data is done effectively and efficiently. Recall and precision are used to evaluate the effectiveness of our system. Time consumption is evaluated to know the efficiency of the system. If it takes less time to execute, then the algorithm is more efficient.
DECISION TREE INDUCTION

Decision tree is formed to match the new xml data with the existing xml data. Comparing new data with every xml data takes longer time so it reduces the efficiency of the duplicate detection. To reduce the time, Decision tree induction is formed

from the existing xml data. The new xml data is compared with the Decision tree from the root to the leaf where the new data matches the path from root to the leaf.
Input: Dataset of xml data

Step 1: Calculate entropy of the target.

Step 2: The dataset is then split on the different attributes. The entropy for each branch is calculated. Then it is added proportionally, to get total entropy for the split. The resulting entropy is subtracted from the entropy before the split. The result is the Information Gain.

Step 3: Choose attribute with the largest information gain as the

ENTROPY (T,X)=

T v

vvalues(A)

decision node.

Step 4: Branches from decision node is done by split criteria.
The splitting criteria is set to make branch out of the node. It depends on the value of the attribute i.e. node. If the value is discrete then the node is split according to the values of the node.If the value is continuous then split point is found. The

Step 5: A branch with entropy of 0 is a leaf node. Then further splitting of node is avoided.

Step 6: A branch with entropy more than 0 needs further splitting.

Step 7: The algorithm runs recursively on the non-leaf branches, until all data is classified.
BAYESIAN NETWORK MODEL

Bayesian networks provide a graphical formalism to explicitly represent the dependencies among the variables of a domain, thus providing a concise specification of a joint probability distribution. This representation is based on a directed acyclic graph where a set of random variables makes up the nodes of the network and a set of directed links connects pairs of nodes. In this graph, an edge from one node to another means that the first has a direct influence on the second. This influence is quantified through a conditional probability distribution function correlating the states of each node with the states of its parents.
Once all prior and conditional probabilities are defined, the Bayesian network can be used to compute the probability of two XML trees being duplicates, i.e. P(t), where t is the tag for the root node of both trees. Final probability is calculated by

product of value node and child node of the root node. If the final probability is more than the threshold value then the xml data is duplicate else the xml value is not duplicate.
NETWORK PRUNING

To compute the final probability one needs to analyze the whole network and calculate the probabilities for every node. This process, which has a complexity of O(n*m), where n and m are the number of nodes in each XML tree being compared, can be time consuming, especially if we are dealing with a large network. However, when performing duplicate detection, we are usually interested only in objects whose duplicate probability is above a given threshold. This allows us to optimize the network evaluation process. In this section, we propose a novel strategy to reduce the time spent on the BN evaluation.
By choosing the appropriate order by which to evaluate the nodes, we can assure that the algorithm makes the minimal number of calculations, before deciding if a pair of objects is to be discarded. we propose three such heuristics: sorting by depth, by average string size, and by distinctiveness. Each of these heuristics corresponds to a different way of ordering the nodes in step 1 of Network Pruning Algorithm, as explained in the following.

The most important information is usually stored in nodes that are placed closer to the root, while nodes with less distinctive power are stored in deeper levels. Therefore, by ordering the nodes according to the depth of the branch to which they belong, we cause the more distinctive nodes to be evaluated first. First evaluating nodes whose values have a smaller average string length. The idea is simply to perform the cheaper comparisons first, expecting that non duplicate nodes to be discarded before the longer strings have to be compared. We should note that, since shorter strings are more likely to be similar than larger strings, this node ordering could delay the

reaching of the cut-off threshold. However, we expect the cheaper string comparisons to compensate for the increment in the number of compared strings.

The final heuristic consists in sorting the nodes according to their distinctiveness. We define the distinctiveness of a node n as log(A/dn), where A is the total number of objects containing node n and dn is the number of distinct values of node n. This measure follows the idea that nodes with a high number of distinct values are less likely to be similar and thus should be evaluated first. We note that the overhead introduced by the pre computation of these heuristics is negligible.
PERFORMANCE EVALUATION

The performance evaluation is done on different set of xml data. The efficiency of our modified XMLDup is calculated, compared with previous method Dogmatix and shown in graphical form as follows

Fig.7.1. Comparison of Dogmatix and Modifed XMLDup

The Fig.7.1 shows the comparison of performance between the Dogmatix and Modified XMLDup.our proposed system, modified XMLDup outperforms previous method Dogmatix in terms of efficiency.That is our proposed method runs faster than the previous method Dogmatix.

Our proposed method runs faster than previous method Dogmatix because of the usage of Decision Tree Induction algorithm which eliminates the accurate matching xml data and missing information, thus avoiding unnecessary formation of Bayesian network, which is followed by Network pruning. Even though usage of Decision Tree Induction algorithm is an overhead, but it still increases the performance of the duplicate detection.
CONCLUSION

Decision tree induction is created from the existing xml data. The new xml data is matched with the path of the Decision tree. The node of the path is matched with the value of the xml data. If the values match exactly then the new tuple is not stored. Thus optimization of memory is done. The classified xml data is given to construct the Bayesian network. Prior probability, Conditional probability and Final probability is calculated. Evaluation of probabilities of every node reduces the efficiency hence Network pruning is used to increase the efficiency, sorting of nodes done in the pruning process reduces the computation time thus increasing the efficiency.

The future work of this paper is implementing the xml duplicate detection by other machine learning algorithm which increases the efficiency and effectiveness of the duplicate detection.

REFERENCES

[1]. M. Weis and F. Naumann, Dogmatix Tracks Down Duplicates in XML, Proc. ACM SIGMOD Conf. Management of Data, pp. 431-442, 2005.

[2]. A.M. Kade and C.A. Heuser, Matching XML Documents in Highly Dynamic Applications, Proc. ACM Symp. Document Eng. (DocEng), pp. 191-198, 2008.

[3]. L. Leitao, P. Calado, and M. Weis, Structure-Based Inference of XML Similarity for Fuzzy Duplicate Detection, Proc. 16th ACM Intl Conf. Information and Knowledge Management, pp. 293-302, 2007.

[4]. S. Guha, H.V. Jagadish, N. Koudas, D. Srivastava, and T. Yu,Approximate XML Joins, Proc.ACM SIGMOD Conf.Management of Data,2002.

[5]. D. Milano, M. Scannapieco, and T. Catarci, Structure Aware XML Object Identification, Proc. VLDB Workshop Clean Databases (CleanDB), 2006.

[6]. P. Calado, M. Herschel, and L. Leitao, An Overview of XML Duplicate Detection Algorithms, Soft Computing in XML Data Management, Studies in Fuzziness and Soft Computing, vol. 255.2010

[7]. S. Puhlmann, M. Weis, and F. Naumann, XML Duplicate Detection Using Sorted Neighborhoods, Proc. Conf. Extending Database Technology (EDBT), pp. 773-791, 2006.

[8]. L. Leitao and P. Calado, Duplicate Detection through Structure Optimization, Proc. 20th ACM Intl Conf. Information and Knowl- edge Management, pp. 443-452, 2011.

[9]. J.C.P. Carvalho and A.S. da Silva, Finding Similar Identities among Objects from Multiple Web Sources, Proc. CIKM Workshop Web Information and Data Management (WIDM), pp. 90-93, 2003.

[10]. L. Chen, L. Zhang, F. Jing, K.-F. Deng, and W.-Y. Ma, Ranking Web Objects from Multiple Communities, Proc. 15th ACM Intl Conf. Information nd Knowledge Management, pp. 377-386, 2006.

Xml Duplicate Detection over Bayesian Network Using Decision Tree Induction Algorithm

Leave a Reply