- Open Access
- Total Downloads : 291
- Authors : Suchitra Reyya, Sateesh Gudla, M. Prasanna Shivani
- Paper ID : IJERTV2IS3650
- Volume & Issue : Volume 02, Issue 03 (March 2013)
- Published (First Online): 26-03-2013
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
An Efficient Extension Of Conditional Functional Dependencies Using Nested Relational Database
Suchitra Reyya1, Sateesh Gudla2, M.Prasanna Shivani3
1Assistant Professor, Department of CSE,LIET,,Vizianagaram.
2Associate Professor, Department of CSE, LIET,Vizianagaram.
3Department of IT, LIET, Vizianagaram.
Abstract
This paper propose an efficient data cleaning by using extended Conditional Functional Dependencies (eCFDs), which is an extension of Conditional Functional Dependencies(CFDs). eCFDs intend to solve the multi-valued inconsistencies to trounce drawbacks of CFDs which use pattern tableau to hold individual tuples in a table for cleaning relational data by supporting only single valued attributes. SQL techniques are used to create patterns of semantically related values for detecting single tuple CFD violations. By introducing a query and an algorithm we provide better competence for eliminating data redundancy in multi-valued attributes using nested relational database. We experimentally analyze the efficiency and performance of these eCFD based techniques in improving data quality and displays the tentative results graphically.
Keywords: CFDs, eCFDs, Nested relational database.
-
Introduction
Data warehouse consists of huge amount of data integrated from different database sets where, the chances of data being redundant and inconsistent are highly possible. If these warehouses are not frequently cleaned, the inconsistent data can lead to mis conceptions there by making the cleaning process more complex [7]. Data redundancy implies that different tuples may represent the same data which causes anomalies and data corruption. Thus data cleaning plays a prominent role in preprocessing. One way of achieving this is by using functional dependencies. In FDs data redundancy is reduced by enforcing integrity constraints at schema level. But the extension of FDs
called CFDs were developed which aim at capturing violations for individual tuples by using the concept of pattern tables [1].When these pattern tables are implied on the entire database the tuples which violate the pattern tables are detected and hence can be easily corrected. The downside with CFDs is that they hold for only single value attributes but not for multivalued attributes [2].
This paper deals with eCFDs for improving data quality which are extended from CFDs to trounce the problem of multi-valued attributes by using nested relational database [5][6]. eCFDs are capable of capturing inconsistencies using nesting.
This paper is organized into different sections where section 1 gives a brief introduction to data cleaning concept, section 2 thrash out related work, section 3 pioneer how to purge redundancy, section 4 explains methodology to reduce redundancy by introducing algorithm, section 5 displays experimental results and section 6 holds the conclusion.
-
Related Work
Relations that contain redundant information may potentially endure from update anomalies.Functional dependencies are used to detect the redundancies by enforcing integrity constraints at schema level [1]. The constraints hold for all the tuples in the table. The main idea behind FD is that the value of a particular attribute uniquely determine the value of some other attribute [8] A relation A B if for every valid instance of A, that value of A uniquely determines the value of B.
Definition (Conditional Functional Dependencies): Refinement of FDs termed as CFDs capture inconsistencies that traditional FDs cannot detect. They are not expressible as traditional FDs because CFDs do not hold not entire relation, instead it holds on tuples matching the conditions specified in the pattern table [1].
A CFD on R is a pair (R : A B, Tp), where (i) A, B are sets of attributes from relation (R), (ii) R : A B is a customary FD, rooted in ; and (iii) Tp is a tableau with all attributes in A and B , where for each X in A or B and each tuple t Tp, t[X] is either a constant a , or an unnamed variable – .
Given an instance I of a relation schema R and a set of CFDs on R, it is to find all the tuples that violate some CFD in . The query below is an SQL technique for detecting violations of a CFD [2].
Select distinct t.A
from R t, Tp tp
where t[A1] = tp[A1] AND . . . AND t[An] = tp[An]
group by t.A having count(distinct B)> 1
This query groups tuples with the same value on attribute A and it counts the number of distinct values in B. The query returns inconsistent tuples in the database if there is more than one value for B.
-
Eliminating Redundancy using Nested Relational Database
This paper proposes eCFDs which hold for multi value data. This is achieved using nested database [6]. The overall design of the system is shown below.
DB
Verification of eCFDs for multi value attributes using T
Checking for fixed patterns using Tp
Checking for fixed patterns using Tp
Update violated tuples with correct data
Figure 1: A model for achieving eCFds
The model architecture describes the step by step flow of process carried to implement eCFDs. In the first step the data in the database is compared with table Tm to check for multi value inconsistencies. In the second step if the inconsistencies are found then it compares each tuple with matching fields from the database and table Tm with pattern table Tp. In the third step finally the incorrect tuple is updated with matching value from the multi tableau and saved in database. Defnition (eCFD): Let R be some relational schema given as (R:XY, Tp,Tm) where (1) X,Y,Tp,Tm attr( R ); (2) XY is an embedded FD; (3) Tp is a pattern tableau consisting of finite number of pattern tuples; (4) Tm is a multi tableau holding multiple values of Y for each X attribute.
Example 1.1: Let us consider a schema cust(AC ,PN
,NM, STR, CT, ZIP). It specifies a customer relationship in terms of the customers phone (area code (AC), phone number (PN)), name (NM), and address (street (STR), city (CT), zip code (ZIP)). An instance D0 of cust is shown below.
Table 1: Customer Database (CD)
AC
PN
NM
STR
CT
ZIP
718
1111111
Mik
TreeAve
Albany
12238
518
2222222
Joe
Elm Str
Colonie
12205
100
2222222
Jim
Oak Str
Troy
12181
212
3333333
Ben
5th Ave
NYC
10001
646
4444444
Ian
High St
NYC
10011
0891
2552705
Vani
Eenadu
VSP
530013
08922
2342345
Rani
Boypalm
VSP
530012
345
6784563
Pete
Troy
Ohio
23124
0890
1233211
Raja
Eenadu
VSP
530045
In the above table we can observe that the city NYC and VSP has ifferent area codes (AC).
1: CT {NYC} with AC {212, 646, 347, 917}.
2: CT {VIZAG} with AC {0891, 08922}.
Hence in 1 if the city is NYC then the area codes must be one of the following values. Similarly in 2 if the city is VIZAG the area code must be one of the above given values. Here CT is associated with disjunction of options rather than single value. But in the above example tuple t3 contains CT as NYC but AC is 100 which is not one of the area codes associated with NYC and tuple t9 has CT as VSP but AC is 0890. This problem is not overcome in CFDs as it does not hold multi-valued attributes. This multi-valued attribute problem can be solved with by less complexity by taking two tables Tm & Tp.
TABLE Tm: Tm is a multi tableau holding all the tuples consisting of multi- value of Y for a single attribute X. The Tm table for the given customer (cust) instance is shown below.
Table 2: multi tableau-Tm
CT
AC
NYC
{212,646,347,917}
VSP
{0891,08922}
Albany
{718}
Now to correct the area code in tuple t4 appropriate area code must be selected from multiple values of AC in table Tm. So we use the pattern table Tp which consists of fields, street(STR) and area code(AC) to select the area code that match the fields in the tuple. The pattern tableau for the cust relation is shown
from cust C, table1 Tm where (Tm.CT < > '@'
AND Tm.AC[i.in] < > '@') AND (C.CT = Tm.CT
AND C.AC < > Tm.AC[i..in]);
Once the violations are displayed the area code must be corrected with the appropriate one from the multiple area codes.
The below query is used to update the area code by considering customer relation table (C ), multi tableau (Tm ) and pattern tableau ( Tp) where the area code of Tm matches with that of Tp and the street in customer relation matches with that of Tp.
update
cust set
C.AC = (select Tp.AC
from table2 Tp, table1 Tm, cust C
below.
STR
AC
5th Ave
212
8th Ave
212
High.St
646
Eenadu
0891
Boypalem
08922
STR
AC
5th Ave
212
8th Ave
212
High.St
646
Eenadu
0891
Boypalem
08922
Table 3: Pattern Tableau-Tp
where Tp.AC = Tm.AC[i.in]
AND C.STR = Tp.STR);
The following Algorithm can be used to explain the methodology to reduce redundancy using nesting.
Algorithm : Eliminating inconsistent tuples in multi-
valued attribute using nested relational database
Input: Inconsistent multi-valued database (CD)
Output: Clean database(CCD)
Method: Solving multi-valued inconsistencies using Nested Relational database
-
Begin
From cust relation we observe that in the tuple t4, street is 8th AVE and city is NYC which uniquely determines area code as 212. Thus now the violation showed in tuple t4 can be eliminated by correcting it with area code as 212 using these two tables.
-
-
Methodology to Reduce Redundancy
We introduce a query for solving multi value inconsistencies in a simplified way with less complexity.
This query shows the violated tuples by considering customer relation table(C) and multi tableau(Tm) where the citys area code (AC) does not match with the disjunction of options present in the multi tableau Tm.
select ALL
-
Consider a table CD with different fields.
-
Take a new table multi tableau(Tm) having multiple values for attribute X ie. X [i..in].
-
For each tuple in table CD having attr X, compare it with X [i…in] of table Tm.
-
If comparision successful then goto step
15
-
Else
-
Display tuples not satisfying step 4.
-
End for
-
Consider a pattern tableau Tp having prefixed constants and unnamed variables.
-
For updating the inconsistent values shown in step 7, for each tuple in table CD compare Tm.X [i.in] with Tp.X and CD.Y with Tp.Y respectively.
-
If comparision successful then
-
Update attr X with tp.X
-
End if
14. End for
-
Original database (CD) is updated as Clean Database (CCD)
-
End
Now after applying the above algorithm on TABLE 1, the consistencies in the table will be replaced by the accurate data, thereby making the database clean and error free. The below is the clean database (TABLE 4) obtained as a result of the algorithm. The AC in tuple t4 is replaced with 212 and similarly tuple t9 with erroneous AC is replaced with appropriate area code (AC) as 0891.
Table 4: Cleaned Customer Database (CCD)
AC |
PN |
NM |
STR |
CT |
ZIP |
718 |
1111111 |
Mik |
TreeAve |
Albany |
12238 |
518 |
2222222 |
Joe |
Elm Str |
Colonie |
12205 |
212 |
2222222 |
Jim |
Oak Str |
Troy |
12181 |
212 |
3333333 |
Ben |
5th Ave |
NYC |
10001 |
646 |
4444444 |
Ian |
High St |
NYC |
10011 |
0891 |
2552705 |
Vani |
Eenadu |
VSP |
530013 |
08922 |
2342345 |
Rani |
Boypalm |
VSP |
530012 |
345 |
6784563 |
Pete |
Troy |
Ohio |
23124 |
0891 |
1233211 |
Raja |
Eenadu |
VSP |
530045 |
-
Experimental Analysis
In this section, we present our findings about the performance of our models for reducing redundancies over multi-value databases.
Setup: For the experiments, we used postgre SQL on Windows XP with 1.86 GHz Power PC dual CPU with 3GB of RAM.
Efficiency of eCFDs: The efficiency of the eCFDs is graphically displayed by considering the multi-value database. When CFDs are posted on this database all tuples having multiple values for each attribute are changed to only one value and obtains duplicate data. But in eCFDs each attribute can hold multiple data thereby preserving the data.
Figure 2: Graph displaying the efficiency of eCFDs
Size of the Relation: In this experiment we have analyzed how the redundancy is reduced by considering the space occupied by the pattern table (TP) and multi tableau (Tm) in CFDs and eCFDs respectively. It is observed Tp can hold only one value for each fixed pattern where as Tm can efficiently hold multiple values for each attribute in the pattern thus reducing tuple size
Table 6: Comparison of size of tuples in Tp (CFDs) & Tm( eCFDs)
No. of multiple values
Tp
Tm
Pattern 1
1
2
Pattern 2
3
Pattern 3
1
4
No.of multiple values
No.of multiple values
5
4
Table 5: Comparison of efficiency in CFDs & eCFDs
No. of reduced tuples CFDs eCFDs
Tuple variation1 7 10
Tuple variation2 14 20
Tuple variation3 27 30
Tuple variation4 40 50
Table 5: Comparison of efficiency in CFDs & eCFDs
No. of reduced tuples CFDs eCFDs
Tuple variation1 7 10
Tuple variation2 14 20
Tuple variation3 27 30
Tuple variation4 40 50
50
Number of tuples
Number of tuples
40
30
20
10 CFD
eCFD
0
3
2 T
1 T
0
Pattern1 Pattern2 Pattern3
No.of Patterns
Number of reduced tuples
Figure 3: Graph displaying size of relation
-
CONCLUSION
We presented that nesting based eCFDs proved to be better than CFDs in eliminating redundancy in nested relational database. eCFDs acquire no extra complexity in inconsistency detection. We have developed SQL based technique and algorithm for eCFD violation detection. Our experimental results for efficiency and size of the relation proved eCFDs to be more effective compared to CFDs.
-
REFERENCES
-
Philip Bohannon, Wenfei Fan, Floris Geerts and Xibei Jia3 Anastasios Kementsietsidis3: Conditional Functional Dependencies for Data Cleaning
-
Wenfei Fan, Floris Geerts , Laks V.S. Lakshmanan and Ming Xiong : Discovering Conditional Functional Dependencies, IEEE conference on data engineering
-
Sangeeta Viswanadham, V. Valli Kumari: Eliminating Data Redundancy using Nesting based wMVD, IEEE Conference Publicat ions, Elect ronics Computer Technology (ICECT), 2012 4th Internat ional Conference, April (2012).
-
Sangeeta Viswanadham, V. Valli Kumari: Eliminating Data Redundancy, Cube (2012)
-
Suchitra Reyya, Sangeeta Viswanadham: Eliminating Data Redundancy through Weak Multivalued Dependencies using Nesting
-
Suchitra Reyya, M. Sundara Babu, Ratnam Dodda: An Efficient Storage of Spatial Database using Nested Relational Database , International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 7, September 2012
-
Erhard Rahm, Hong Hai Do: Data Cleaning Problems and Current Approaches, IEEE Techn. Bulletetin on Data Engineering, Dec. (2000)
-
Korth, H., Roth, M.: Query languages for Nested Relational Databases, Nested Relations and Complex Objects in Databases, Springer, (1991).