Effect of Privacy Protection on Discrimination

Asmita Kashid; Ruhi Patankar; Vrushali Kulkarni

doi:10.17577/IJERTV3IS110850

Volume 03, Issue 11 (November 2014)

Effect of Privacy Protection on Discrimination

DOI : 10.17577/IJERTV3IS110850

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 53
Total Downloads : 209
Authors : Asmita Kashid, Ruhi Patankar, Vrushali Kulkarni
Paper ID : IJERTV3IS110850
Volume & Issue : Volume 03, Issue 11 (November 2014)
Published (First Online): 22-11-2014
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Effect of Privacy Protection on Discrimination

Asmita Kashid

Post Graduate Student Department of Computer Engineering, MIT

Pune, India

Ruhi Patankar

Assistant Professor

Vrushali Kulkarni

Head of Department Department of Computer Engineering, MIT

Pune, India

Department of Computer Engineering, MIT Pune, India

Abstract Data is very important aspect in todays world. However mere data is not useful for a specific individual or organizations. Data mining is a technique which finds the knowledge hidden in a raw data. However there are two problematic aspects related to data mining: potential privacy violation and potential discrimination. According to civil and social rights law, discrimination means treating people unfairly or unequally only because they are the members of a particular category or a minority, without considering their individual merit. Data mining may lead to discriminatory decisions, if it uses a historical dataset which is biased towards a particular community, to extract classification or association rules. Discrimination Prevention Data Mining deals with discovering, preventing and measuring discrimination. Privacy means the right of a person to decide how to use her/his sensitive information. Privacy violation occurs if a persons sensitive information is displayed to an unauthorized entity as a result of data mining tasks. Privacy Preserving Data Miming provides methods and tools for publishing useful information while preserving data privacy. Recently, it is identified that these two fields are dependent on each other. It is important to bridge the gap between the individual researches in these two areas. In this paper, we are trying to identify the effect of privacy protection on discrimination. An architecture of proposed work is also specified. Also some future research ideas are specified.

Keywords Discrimination discovery; discrimination prevention; data anonymization techniques; privacy preserving techniques

INTRODUCTION

Data is very important aspect in todays world. However mere data is not useful for a specific individual or organizations. It is necessary to find the knowledge hidden in a raw data. Data mining is a technique used to do this task. However there are two problematic aspects related to data mining: potential privacy violation and potential discrimination. These both can be thought of as side effects of the data mining tasks. Discrimination means treating people unequally just because they belong to a minority, without considering their individual merit. It is not a mere existence of statistical imbalance in the data, but a property of a decision that may lead to such an imbalance. Data mining may lead to discriminatory decisions, if it uses a historical dataset which is biased towards a particular community, to extract

classification or association rules. Discrimination Prevention Data Mining (DPDM) deals with discovering, preventing and measuring discrimination. Privacy means a right of a person to decide how to use her/his sensitive information (e.g. salary). Privacy violation occurs when values of published sensitive attributes can be linked to specific individuals. It is an intentional or unintentional intrusion into personal data. Data mining faces the problem of privacy violation as a side effect of data mining tasks. Privacy Preserving Data Mining (PPDM) deals with developing techniques to modify the original data in some way, so that private data remain private even after data mining process. It protects identities of people under consideration. It deals with privacy attacks, privacy models and anonymization techniques.

Recently, it is identified that these two fields are dependent on each other. It is important to bridge the gap between the individual researches in these two areas. PPDM and DPDM can be combined for several reasons 1) these two areas are dependent on each other. Hiding discriminatory attribute for privacy protection affects the discrimination caused. E.g. in case of employee hiring, it is interesting to investigate, what happens if the employer knows the race of job-seeking candidate and what happens if the race is unknown to the employer. 2) Both these areas have common challenges. E.g. trade-off occurs between achieved privacy and data utility loss. Trade-off also occurs between discrimination removal and data utility loss. 3) They have common methodological problems to be solved. E.g. privacy attacks may occur after releasing the data. Discrimination threats may occur after releasing the data. 4) As privacy preservation is a well explored area, many of the privacy preservation methods can be used for discrimination prevention.

Main aim of this paper is to analyze the effect of privacy protection on discrimination. The rest of the paper is organized as follows: section II shows the literature survey related to these two fields under the heading related work. Section III defines basic terminology in data mining, DPDM and PPDM. Section IV presents how privacy protection affects discrimination. Section V gives a brief overview of our proposed work. Section VI presents conclusions and future work.
RELATED WORK

The research of DPDM has been started in 2008[1]. Method for discrimination discovery is explained in [2]. There are three different approaches for discrimination prevention [3] preprocessing, inprocessing and postprocessing. Preprocessing approach consists of creating methods to remove discrimination from the original dataset

BASIC TERMINOLOGIES

Basic Definitions in Data Mining

Some basic definitions [3] of data mining are mentioned below. There definitions are used as background knowledge for measuring and discovering discrimination:
- A data set is a collection of data objects (records) and their attributes.
- An item is an attribute along with its value, e.g., Race = black.
- An item set, i.e., X, is a collection of one or more items, e.g., {Foreign worker = Yes; City = NYC}.
- A classification rule is an expression X C, where C is a class item (a yes/no decision), and X is an item set containing no class item.
- The support of an item set, sup (X), is the fraction of records that contain the item set X. We say that a rule X C is completely supported by a record if both X and C appear in the record.
- Confidence of a classification rule, conf (X C), measures how often the class item C appears in records that contain X.
- Hence, if supp (X) > 0 then,
  
  (1)
  
  Support and confidence range over [0, 1].
- A frequent classification rule is a classification rule with support and confidence greater than respective specified lower bounds. Support is a measure of statistical significance, whereas confidence is a measure of the strength of the rule.
Basic Definitions in DPDM

Some definitions related to rule-based discovery and prevention of discrimination [2] [3] are mentioned below. They have significance throughout the discrimination discovery and prevention process.
- A data item is said to be potentially Discriminatory (PD) if it is decided as discriminatory according to laws and regulations.
- A classification rule X C is potentially discriminatory (PD) when X = A, B with A DIs, a nonempty discriminatory item set and B a nondiscriminatory item set.
- Let A, B C be a PD classification rule extracted from DB with conf (B C) > 0. The extended lift (elift) of the rule is,
  
  (2)
- Let A, B C be a PD classification rule extracted from DB with conf B C) > 0. The selection lift (slift) of the rule is,
  
  (3)
  
  The slift is the ratio of the proportions of benefit denial, e.g. credit denial, between the protected and unprotected groups, e.g. women and men resp., in the given context, e.g. those who live in NYC.
- Let be one of the measures i.e. elift or slift and be a fixed threshold and let A be a PD itemset. A PD classification rule c = A, B C is -protective w.r.t. if (c) < . Otherwise, c is -discriminatory.
- Let DB …) be a data table, DA a set of PD attributes associated with it, and be one of the
  
  measures i.e. elift,, slift. DB is said to satisfy – protection or to be -protective w.r.t. DA and if
  
  each PD frequent classification rule c = A, B C
  
  extracted from DB is -protective, where A is a PD itemset and B is a PND itemset.
Basic Definitions in PPDM [10]

Explicit identifier is a set of attributes that explicitly/uniquely identifies record owners.
Quasi_Identifier is a set of attributes that could potentially identifies record owners.
Sensitive attributes contain sensitive person specific information such as disease, salary or disability status.
Non-Sensitive attributes contain all the attributes which do not belong to other three categories.
Data Anonymization is an approach of PPDP that hides the identity and/or sensitive data of record owners, assuming sensitive data must be retained for data analysis i.e. Hide sensitive data in such a way that they will be reverted back for analysis purpose.

ID	Gender	Job	Age	Credit_approved
1	Male	Engineer	35	Yes
2	Male	Engineer	38	Yes
3	Male	Lawyer	38	No
4	Female	Writer	30	No
5	Male	Writer	30	Yes
6	Female	Dancer	31	No
7	Female	Dancer	32	Yes

TABLE 1. PRIVATE DATA TABLE WITH BIASED DECISION RECORDS

IMPACT ANALYSIS OF PRIVACY PROTECTION ON DISCRIMINATION

In this section, we will see how privacy protection can affect the discrimination caused by using an example. Consider TABLE 1, which represents raw customer credit data, where each record represents a customers specific information [13].

Gender, Job, Age can be taken as Quasi_identifier attributes. Class attribute has two values Yes and No, to indicate whether a particular customer has received credit or not. Suppose Gender is taken as a discriminatory attribute.

Suppose = 1.2 and slift is taken as a discriminatory measure. A frequent PD classification rule {Gender = Female} Credit_approved = no is extracted from the table.

(4)

The rule is -discriminatory as slift > .

If we apply data anonymization technique called slicing

[12] on TABLE 1, then TABLE 1 will be transformed to TABLE 2 as below:

TABLE 2. TRANSFORMATION OF TABLE 1 AFTER APPLYING SLICING

ID	Gender	Job	Age	Credit_approved
1	Male	Engineer	35	Yes
2	Male	Engineer	38	Yes
3	Male	Lawyer	38	No
4	Male	Writer	30	No
5	Female	Writer	30	Yes
6	Female	Dancer	31	No
7	Female	Dancer	32	Yes

Slift of the same rule will become:

(5)

The rule has become -protective as slift < , after applying slicing technique on the table. This proves that data anonymization methods can achieve -protection.

PROPOSED WORK

In our proposed work, we are going to use the same concepts as discussed in section IV. The problem statement is, to analyze effect of different privacy preserving (data anonymization) techniques on discrimination prevention.

We are trying to compare the percentage of discrimination removal (-protection achieved) by different data

anonymization techniques. There are many data anonymization techniques used in the PPDP, such as generalization [9], suppression [9], permutation [10], slicing [12], bucketization [11] anatomy [11] etc. As we have seen in section IV, privacy protection and anti-discrimination are dependent on each other. So it is important to find impact of privacy protection on anti-discrimination. Although full- domain generalization technique is used to make data privacy protected and discrimination prevention, there is still a scope to test impact of other data anonymization techniques on discrimination. So we are planning to do comparative study of different data anonymization techniques. Scope of our proposed work is limited to direct discrimination and use of preprocessing approach. Proposed architecture is depicted in Fig 1.

Input to our proposed system will be discrimination threshold, discriminatory dataset, discriminatory attribute, sensitive attribute, quasi identifier attribute, and data anonymization method.

Fig. 1. Architecture of the Proposed Work

After taking input dataset from the user, the first step is to discover discrimination in the dataset [2]. Then apply the inputted data anonymization method on the input dataset. The dataset will get transformed to a new dataset. Again discover discrimination from the transformed dataset. Finally calculate percentage of discrimination removal [3]. For each of the inputted data anonymization method, the same process will be repeated. We can compare percentage of discrimination caused by different methods. Main aim is the impact analysis, which can be done by using n number of methods depending upon time constraints. Though number of discriminatory datasets can be inputted to system, for testing purpose, we are going to use a single dataset. We are going to use two datasets for testing purpose: Adult dataset [18] and German Credit dataset [19].
CONCLUSIONS AND FUTURE WORK

Privacy preserving and anti-discrimination are dependent on each other. Different data anonymization techniques can have different impact on discrimination. Some techniques may increase discrimination, some may decrease it or some may not have any effect on discrimination. Hence it might help if we find relation between them. The knowledge of this relationship, can help in making the original data protected against both privacy and discrimination risks. It is also observed that we cannot protect original data against privacy attacks without taking into account anti-discrimination requirement. Our proposed system can work as a tool for analyzing effect of privacy preserving techniques on discrimination. Our proposed system will provide a proper methodology to analyze effect of privacy preserving techniques on discrimination. The proposed tool can be extended to other data anonymization techniques in the privacy literature. Our system will also give an idea about which data anonymization techniques are best suitable for discrimination removal. This will be promising step towards making data both privacy protected and discrimination free. Our proposed system is scalable system, where a new research in privacy preserving area can be combined easily with discrimination research.

REFERENCES

D. Pedreschi, S. Ruggieri, and F. Turini, Discrimination-Aware Data Mining, Proc. 14th ACM Intl Conf. Knowledge Discovery and Data Mining (KDD 08), pp. 560-568, 2008.
S. Ruggieri, D. Pedreschi, and F. Turini, Data Mining for Discrimination Discovery, ACM Trans. Knowledge Discovery from Data, vol. 4, no. 2, article 9, 2010.
S. Hajian & J. Domingo-Ferrer, A Methodology for Direct and Indirect Discrimination prevention in data mining, IEEE transaction on knowledge & data engg. pp. 1445-1459, July 2013.
F. Kamiran and T. Calders, Data preprocessing techniques for classification without discrimination, Springer, 2011.
F. Kamiran, T. Calders, and M. Pechenizkiy, Discrimination Aware Decision Tree Learning, Proc. IEEE Intl Conf. Data Mining (ICDM 10), pp. 869-874, 2010.
T. Calders and S. Verwer, Three Naive Bayes Approaches for Discrimination-Free Classification, Data Mining and Knowledge Discovery, vol. 21, no. 2, pp. 277-292, 2010.
D.Pedreschi, S.Ruggieri and F.Turini, Measuring Discrimination in Socially-Sensitive Decision Records, Proc. Ninth SIAM Data Mining Conf. (SDM 09), pp. 581-592, 2009.
R.Agrawal and R.Srikant, Privacy-preserving Data Mining, In Proc. of the ACM SIGMOD, pp. 439-450,2000.
L.Sweeney, Achieving k-anonymity privacy protection using generalization and suppression, 2002.
B.C.M Fung, K. Wang, R. Chen, P.S.Yu, Privacy-preserving data publishing: A survey of recent developments, ACM Comput. Surv. 42(4), Article 14, 2010.
Xiao, X., Tao, Y, Anatomy: Simple and effective privacy preservation, In proc of VLDB, pp 139-150, 2006.
Tiancheng Li, Ninghui Li, Jian Zhang, and Ian Molloy, Slicing: A new approach for privacy preserving data publishing, IEEE transactions on knowledge and data engineering, Vol. 24, no.3. 2012.
S Hajian and J. Domingo-Ferrer, A Study on the Impact of Data Anonymization on Anti-Discrimination, Proc. I.EEE 12th International Conference on Data Mining Workshops, pp. 352-359, 2012.
S.Ruggieri, Data Anonymity Meets Non-Discrimination, IEEE 13th International Conference on Data Mining Workshops (ICDMW), pp. 875-882, 2013.
S. Hajian, A. Monreale, D. Pedreschi, J. Domingo-Ferrer and F. Ginnotti, Injecting Discrimination and Privacy Awareness into Pattern Discovery, Proc. IEEE 12th International Conference on Data Mining Workshops, pp. 360-369, 2012.
S. Hajian, A. Monreale, D. Pedreschi, J. Domingo-Ferrer and F. Ginnotti, Fair Pattern Discovery, Proc. 29th Annual ACM Symposium on Applied Computing, pp. 113-120, 2014.
S.Ruggieri, S. Hajian, F. Kamiran, and X. Zhang, Anti-discrimination Analysis using Attack Strategies, 25th European Conference on Machine Learning and 18th Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD-2014), to appear.
R. Kohavi and B.Becker, UCI Repository of Machine Learning Databases, http://archive.ics.uci.edu/ml/datasets/Adult,1996.
D.J. Newman, S.Hettich, C.L. Blake, and C.J. Merz, UCI Repository of Machine Learning Databases, http://archive.ics.uci.edu/ml, 1998.

Effect of Privacy Protection on Discrimination

Leave a Reply