A Distributed Mining Of Association Rules On Horizontally Partitioned Data By Preservingprivacy

DOI : 10.17577/IJERTV1IS8118

Download Full-Text PDF Cite this Publication

Text Only Version

A Distributed Mining Of Association Rules On Horizontally Partitioned Data By Preservingprivacy

Krishna Sowjanya.K#1, R.Srinivas*2

Computer Sceince and Engineering, Jawaharlal Nehru Technological University Kakinada.

1Krishna Sowjanya.K, Student, M. Tech., Sri SaiAdityaInistitute of Science &Technology, Kakinada, India

*R.Srinivas, M.Tech, (Ph.D)

HOD_CSE.Dept, Sri SaiAdityaInistitute of Science &Technology, Kakinada, India

Abstract:

Data mining on large databases has been a major concern in research community. It extracts important knowledge from huge amount of data. Sometimes these data may be split among various parties. Privacy issues may prevent parties from sharing data or information about data directly. This paper addresses mining of association rules by creating privacy to the data of individual databases. In this scenario we consider two parties having confidential data and running a mining algorithm on the union of their databases without revealing any unnecessary information. Here we need to protect both privileged information and enable its use for research or other purposes.

Index Term

Data Mining, FDM , Support, Confidence, Security, Privacy.

Introduction

Data mining technology identifies patterns from large quantities of data. Most data mining tools operate by running algorithm on data which was gathered into a centralized site.

rules over such data by reducing information shared on each site.

Deriving association rules without revealing individual information of sites can be simply done by computing global support and confidence of an Association rule AB=>C knowing only local support of AB &ABC and size of each database. It doesnt involve in sharing any individual data.

Thus, it can be done by extending Apriori algorithm to distributed case using following lemma.

  1. If a rule has support >S% globally, it must have support>S% on at least one of individual sites.

    This algorithm works as follows:

    1. Request each site to send all rules with support at least S.

    2. For each rule given by site, request other sites to send count of their transaction that support the rule and total count of all transactions at the site.

    3. From this rules with support s can be found by computing following lemmas.

_

However, privacy concerns may not allow us to build centralized warehouse because data is distributed among several sites where data is not allowed to transfer from one site to another. We assume homogenous databases in all sizes i.e; all sites have same schema but each site has information on different entities. This paper addresses mining of association

Support AB=>C =

Support AB =

=1

_()

=1

_

=1

_()

=1

Confidence AB=>C=

SupportAB=>C

Support AB

Finding global support/ confidence values:

The above procedure protects individual data, but it reveals about the rule it support which may be sometimes sensitive data. For example, Industry collaborations/ Trade groups may want to identify best practices to help members. But, some practices are trade secrets. How do we provide results to all while preserving secrets? This paper addresses a solution for not revealing sensitive data. We consider more than two parties here.

Overview:

Private association rule mining follows the method of passing the values to local data mining sites instead of centralized warehouse. There are two phases in this method.

  1. Finding candidate item sets.

  2. Finding which item set has global support/ confidence values.

Finding candidate item sets

In this phase, each locally supporteditemsets is tested to see if it is supported globally. In the figure,the itemset ABC is known to be supported at one or more sites,and each computes their local support. The first site chooses arandom value R, and adds to R the amount by which its supportfor ABC exceeds the minimum support threshold. This value ispassed to site 2, which adds the amount by which its supportexceeds the threshold. This is passed to site three, which again addsits excess support. The resulting value (18) is tested using asecure comparison to see if it exceeds the Random value (17).If so, itemset ABC is supported globally.

Diagrammatically represented as follows:

R=17

Site 1

In this phase we use commutative encryption. Here each site encrypts its frequent item set and are passed to a common party to eliminate duplicates and were decrypted. This item set is passed to each party and each party decrypts each item set which finally gives us common item set. Diagrammatically shown as follows:

18 R?

ABC:Yes!

Site 2

ABC: 6 DBSize=200

13+20-5%*300

ABC: 5

DBSize

13

R+count- 5%*DBsize

=17+5-5%*100

17

Site 3

ABC: 20

DBSize = 300

17+6-5%*200

E3(E1(C))

E2(E3(D)) Site 1 E (C)

1

C

E3(C) E3(D) C,D

Fig. 2. Determining if itemset support exceeds 5% threshold

Background and Related work:

Privacy preserving has been used in several fields. It mainly addresses two issues

  1. Privacy by data distortion

  2. Building Decision tree.

    In data distortion, it does not reveal information

    E2(E3(E1(C)))

    Site 2 site 3

    D C

    Fig1: Determining global candidate itemsets

    and thus data is safe for mining. Here distorted data and information on distribution of data can be used to generate an approximation to original distribution rather than original values.

    So, we can mine on distorted data directly. Recently, data distortion is applied to Boolean association rules and also used for modifying data values and reducing reconstruction effort by simply mining on distorted data.

    In building decision trees, aim is to build ID3 decision tree where training set is distributed between two parties. The main idea is to find an attribute having maximum information gain and minimum conditional entropy.

    Association rules mining

    The association rule mining problem is defined as follows. Let I = {i1,i2, . in}be set of items. Let DB is a set of transactions where each transaction T is an item set such that T cI. given an item set X c I, a transaction T contains X if and only if X cT. An association rule is of the form X=>Y has support S in the transaction database DB if S% of transaction in DB contain X U Y. The association rule in DB with confidence C if C% of transactions contain X and also Y. So, mining association rules is finding all rules whose support and confidence are higher than user specified support and confidence.

    We consider association rules in this paper as follows:

    Distributed mining

    =1

    Let us assume that a transaction DB is horizontally partitioned among n sites namely (S1,S2, . Sn) where DB = (DB1 U DB2 U DBn) and DBi is Si( 1 i n). An item set X has local support count X.supiof the transaction contain X. the global support count is given by X.sup= X. supi . Now, an item

    1. Generating candidate sets: Using classic Apriori candidate generation algorithm candidate sets are generated CGi(k) based on GLi(k-1), itemsets supported by Si at i iteration.

    2. Local Pruning:For each item set XCGi(k), compare withDBi in site Si to compute support X.supi. If X is locally large Si it is included in LLi(k) set.

    3. Exchanging support counts: LLi(k) is sent to each site and local support is computed for items and are unionedUiLLi(k).

    4. Distributing results: Each site sends its local support in UiLLi(k). From this we can compute L(k).

Where,

L(k) set of large item sets consists of k- item sets that are globally supported.

LLi(k) locally large item sets consists k- item sets supported locally at site Si.

GLi(k) – globally large k-item sets locally supported at site Si.

Secure Association Rule Mining

We will now construct distributed association rule mining algorithm that provides privacy of site data. Here we consider two or more parties. Assuming no collision, let us consider i 3 be number of sites. Each site has a database DBihaving support S% and confidence C%. Our goal is to discover all association rules satisfying thresholds. Along with this, no site should learn contents of other site i.e; rules or support and confidence values. For this we follow the methodology as follows.

Methodology:

Our procedure follows general FDM

set is globally supported if algorithm with special protocols. First we

=1

X.sup S x ( ||) . So, global union locally supported item set without

confidence rule is given for X=>Y is

{XUY}.sup/X.sup.

For distributed association rule mining, a fast distributed mining (FDM) of association rules were given and it follows the procedure as follows.

revealing originator of particular item set. Secondly, a method for securely testing if support count exceeds the threshold

  1. Locally large item sets union :FDM algorithm reveals large item sets supported by each site. For not revealing it we exchange

    locally large item sets in a way that source of each item is unknown. For this we assume commutative encryption algorithm with no collision probability.

    The main idea is each site encrypts locally supported item sets along with enough fake item sets to hide the actual number supported. Each site then encrypts item sets from other sites, then these item sets are merged in phase 2 and 3 as discussed below and duplicates are deleted.

    For union of locally large item sets, the protocol is given as follows.

    Protocol1Findingsecureunionoflargeitemsets ofsizek

    Require:N 3sitesnumbered1..N 1,setF ofnon-item sets.

    //Encryptionofalltherulesbyallsites

    foreachsiteido

    generateLLi(k)as in steps1and2 oftheFDMalgorithm

    LLei(k)=

    for eachXLLi(k) do

    LLei(k)=LLei(k){Ei(X)}

    endfor

    for j=|LLei(k)|+1to|CG(k)| do LLei(k)=LLei(k){Ei(randomselecti onfromF)}

    endfor endfor

    //Encryptionbyallsites

    for Roundj=0toN1 do ifRoundj=0then

    Each siteisends

    permutedLLei(k)tosite(i+1)modN

    else

    Eachsite iencryptsall itemsin

    LLe(ijmodN)(k) withEi,permutes,andsendsittosite(i+1)m odN

    Site N-1e broadcastaRuleSet(k) to sites 0.. N-2

    n

    d

    iIn protocol F represents the data

    that canf be used as fake itemsets. |LLei(k)|

    endfor

    //Mergeodd/evenitemsets EachsiteisendsLLei+1modN tositeimod2

    represents the set of encrypted k item sets at site i. Ei is the encryption and Di is the decryption by site i.

    Clearly, protocol 1 finds the union without revealing which item set belongs to which site. However, it reveals number of

    Site 0 sets RuleSet1

    [(1)/2]

    =

    =1

    LLe(2j-1)(k)

    itemsets having common support between

    sites i.e; reveals information which is less

    =1

    Site 1 sets RuleSet2 = [(1)/2] LLe(2j)(k)

    //Mergeallitemsets Site1sendspermutedRuleSet1tosite0 Site0setsRuleSet=RuleSet0 URuleSet1

    //Decryption

    for i=0toN1 do SiteidecryptsitemsinRuleSetusingDi SiteisendspermutedRuleSettositei+1m odN

    endfor

    Site N-1 decrypts items in RuleSet using DN-

    1

    RuleSet(k)= RuleSet-F

    harmful.

    The phases of above protocol are discussed as follows:

    Phase0: No communication occurs here. Each site runs the algorithm on its own input.

    Phase1: Firstly, each site computes LLei-1(k). The size of this set is size of global candidate set CG(k), known to each site. So, a site can produce set using a uniform random number generator.

    Phase2: In this phase, each site gets fully encrypted sets of itemsets from other even sites. If there are k item sets known to e common among [N/2] odd sites, generate

    k random numbers and put them into LLei(k). Repeat for each [N/2]-1 subset. Then fill each LLei(k) with randomly choosen values until it reaches size |CGi(k)|. The same is done for site1.

    Phase3: Iftherearekitemsincommonbetweeneven andoddsites,site0selectskrandomitemsfromR uleSet0 andinsertsthemintoRuleSet1.RuleSet1 isthenfilledwith randomlygeneratedvalues.Sincetheencryptio nguarantees thatthevaluesarecomputationallyindistinguish ablefroma uniformdistribution,andthesetsizes|RuleSet0

    |,|RuleSet1|,

    and|RuleSet0RuleSet1|(andthus|RuleSet|

    )areidentical in the simulation and real execution. Hence proves security in this phase.

    Phase4: Each site sees only the

    andencryptsitwithEN1.TheseareplacedinR uleSet.RuleSetisthen filledwithitemschosenfromF,alsoencryptedw ithEN1. SincethechoiceofitemsfromF israndominboththerealandsimulatedexecutio n,andtherealitemsexactlymatch intherealandsimulation,theRuleSetsiteN1r eceivesinPhase4iscomputationally indistinguishablefromthereal execution.

    Thus above protocol is privacy preserving with above assumptions.

  2. Support threshold testing without revealing support count: The above algorithm gives complete set of locally large item sets LL(k). From these we have to find out item set swhich are globally supported. For this, we have to know if X.sup> S% * |DB| for each item set X LL(k). to compare against a sum of local values we should follow the following lemmas:

    =1

    encrypted items after decryption by the preceding site. Some of these may be

    X.sup S * |DB| = S * (

    ||)

    identical to items seen in phase2, but since

    .sup S * (

    ||)

    all items must be in the union this reveals

    =1 i

    =1

    nothing.Thesimulatorforsitei isbuiltasfollows:takethevaluesgeneratedinPh ase2

    =1

    (. ||)) 0

    stepN1i,andplacethemintheRuleSet.Then insertrandomvaluesinRuleSetuptotheproper size(calculated asinthesimulator forPhase3).Thevalueswehavenotseen beforearecomputationallyindistinguishablefr omdatafroma uniform

    distribution,andthesimulator includes thevalueswe haveseen(andknewwouldbethere),sothesimul atedview iscomputationallyindistinguishable fromtherealvalues.

    The simulator for site N-1 is different, since it learns RuleSet(k).TosimulatewhatitseesinPhase4,s iteN

    1takeseachiteminRuleSet(k),thefinalresult,

    So checking for support is same as

    =1

    checking if (. ||))

    0. But it should be done without revealing X.supior |DBi|. That can be done by following the below algorithm.

    Algorithm for securely finding global support counts Input:N3sitesnumbered0..N1,m2|D B|

    ruleset=

    atsite0:

    for eachrcandidateset do

    choose random integer xrfrom a uniform distribution over 0.. m-1;

    t =r.supis|DBi|+xr (modm); ruleset=ruleset{(r,t)};

    endfor sendrulesettosite1; for i=1toN2 do

    for each(r,t)ruleset do

    .

    .

    C =>

    = .

    =1 C

    =1

    = .

    t¯=r.supis|DBi|+t (modm);

    = . C * = .

    ruleset=ruleset{(r,t)}{(r,t¯)} ;

    endfor

    =1

    =1

    sendrulesettositei+1;

    endfor

    atsiteN-1:

    foreach(r,t)ruleset do t¯=r.supis|DBi|+t (modm); securelcomputeif(t¯xr) (modm)<m/2withthe

    site0;{Site0knowsxr}

    if (t¯xr) (modm)<m/2then multi-castrasagloballylargeitemset.

    endif endfor

    In this algorithm for each item set x.

    each site chooses a random number xr and adds it to support . || and sends to next site. So the actual value was hidden by adding this random number to the original value. Similarly second site also

    adds its own random value and sends to other sites and so on. It can be written as

    = . . ) 0

    =1

    Since each site knows XY.supiand X.supi, we can easily use algorithm2 to securely calculate the confidence of a rule.

    Conclusion and Further work

    The main contribution of this paper is proposing a general framework to privacy preserving mining of association rules. We have given procedures to mine distributed association rules on horizontally partitioned data and also that privacy concerns will increase for mining of data. We also shown that our algorithms works with less communication complexity. It is possible to mine globally valid results from distributed data without revealing information that having private data of individual sources.

    =1

    (. ||)) + xr (mod

    In the future research, a common

    framework with more formal and reliablefor

    m).since |DB|m/2 we may also get negative values. Since no values are known to other sites, this method of considering random value is secure.

  3. Finding confidence of a rule securely:To find if the confidence of a rule X => Y is higher than the given confidence threshold C, we have to check if

. C. Algortihm2 only reveals if an

.

item set is supported, it does not reveal the

privacy preservation will enable next generation data mining technology tomake substantial advances in alleviating privacy concerns.

References

[1] R.Agrawaland R.Srikant,Fastalgorithms forminingassociation rules,inProceedingsofthe20th InternationalConferenceonVery LargeDataBases. Santiago, Chile: VLDB,Sept. 12-151994, pp.487499.Available: http://www.vldb.org/dblp/db/conf/vldb/

support count. The following equations

[2]

vldb94-487.

D.W.-L.Cheung,V.Ng,A.W.-

show how to securely compute if confidence exceeds a threshold using algorithm2. The support of {X U Y}.supiis denoted as XY.supi.

C.Fu,andY.Fu,Efficientmining ofassociationrulesindistributeddatabases,IEEETr ans.KnowledgeDataEng.,vol.8,no.6,pp.911 922,Dec.1996.

[3] Privacy-preserving Distributed Mining ofAssociation Rules on Horizontally Partitioned DataMurat antarcoglu and Chris Clifton, Senior Member, IEEE

[3] R.AgrawalandR.Srikant,Privacy- preservingdatamining,in

[4]

Proceedingsofthe2000ACMSIGMODConference onManagement ofData.

Dallas,TX:ACM,May14-192000,pp.439

450.[Online].Available:http://doi.acm.org/10.114 5/342009.335438

D.AgrawalandC.C.Aggarwal,Onthedesignandqu antification of privacy preserving data mining algorithms, in

ProceedingsoftheTwentiethACMSIGACT- SIGMOD-SIGART Symposiumon Principles of Database Systems. Santa Barbara, California, USA: ACM, May 21-23 2001, pp. 247255.

[5] D.W.-L.Cheung,J.Han,V.Ng,A.W.- C.Fu,andY.Fu,Afast distributed algorithm for miningassociation rules, inProceedingsofthe 1996InternationalConferenceonParallelandDistrib utedInformationSystems(PDIS96).MiamiBeach,Fl orida,USA:IEEE,Dec.1996, pp.3142.

Leave a Reply