Clustering Proficient Students using K-Means Algorithm

Apoorva A

doi:10.17577/IJERTCONV4IS27007

NCRIT - 2016 (Volume 4 - Issue 27)

Clustering Proficient Students using K-Means Algorithm

DOI : 10.17577/IJERTCONV4IS27007

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 97
Total Downloads : 12
Authors : Apoorva A
Paper ID : IJERTCONV4IS27007
Volume & Issue : NCRIT – 2016 (Volume 4 – Issue 27)
Published (First Online): 24-04-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Clustering Proficient Students using K-Means Algorithm

Apoorva A

Dept. of MCA

Global Institute Of Management Sciences Bangalore, India

AbstractEducational Data Mining is apart where in a combination of techniques such as data mining, machine Learning and statistics, is smeared on educational data to get valuable information. The objective of this paper is to cluster proficient students among the students of the educational institution to predict placement chance. Clustering is accomplished using k-means algorithm based on KSA( knowledge, Communication skill and attitude) To assess the performance of the algorithm, a student data set from an institution in Bangalore were collected for the study as a synthetic data. A model is proposed to arrive at the result. The accuracy of the results obtained from the algorithm was found to be promising.

Keywords Educational data mining, proficient student, k- means algorithm

INTRODUCTION

Educational Data Mining (EDM) is the presentation of Data Mining (DM) techniques to educational data, and so, its objective is to examine these types of data in order to resolve educational research issues.

An institution consists of many students. For the students to get placed, he/she should have good score in KSA. KSA is nothing but Knowledge, communication skills and attitude. This is one of the very important criteria for selection of student while placing him. It is also a known fact that better placements results in good admissions. All the students will not have high KSA score. Hence it is necessary to identify those students who possess good KSA and who dont. Thus there is a need for clustering to eliminate students who are not competent to be placed.
PROBLEM STATEMENT

Normally hundreds of students will be there in institutions. It is a tedious task and time consuming to predict placement chance for all students and it is not necessary also to predict placement chance for those students who are incompetent academically. Hence there is a need for clustering the proficient students having good KSA score whose placement chance can be predicted.
RELATED WORKS

Performance appraisal system is basically a formal interaction between an employee and the supervisor or management conducted periodically to identify the areas of strength and weakness of the employee. The objective is to be consistent about the strengths and work on the weak areas to

improve performance of the individual and thus achieve optimum process quality [8].(Chein and Chen,2006 [9] Pal and Pal ,2013[10]. Khan, 2005 [11], Baradwaj and Pal, 2011 [12], Bray [13], 2007, S. K. Yadav et al.,2011[14]. K-means is one of the best and accurate clustering algorithms. This has been applied to various problems. K-means approach belongs to one kind of multivariate statistical analysis that cut samples apart into K primitive clusters. This approach or method is especially suitable when the number of observations is more or the data file is enormous .Wu, 2000[1]. K-means method is widely used in segmenting markets. (Kim et al., 2006[2]; Shin &Sohn, 2004 [3]; Jang et al., 2002[4]; Hruschka& Natter, 1999[5]; Leon Bottou et al., 1995 [6]; Vance Fabere et al., 1994[7].

METHODOLOGY

Concept and research framework

The methodology along with its computational processes for determining the proficient student, is outlined below:

Step 1: Data collection.

Knowledge represented in terms of Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.

Step 2: Data preprocessing

Preprocessing was done using chi-square test for the goodness of fit to remove the attributes which doesnt contribute to the result.

Step 3: K-means clustering technique

This step clusters proficient students among all the students of the institution using K-means clustering algorithm.

Step 4: Evaluate the result

Variables	Description				Possible Values
Stu_id	Id of the student				{Int}
Name	Name of the student				{Text}
sub	Subject name				{Text}
M1,M2,M3,M4	Marks subject	scored	in	each	{1, 2, 3, 4, 5…100}
T	Total marks				{ 1% – 100% }
Com	(Communication skills+Attitude) score out of 10				{1, 2, 3, 4, 5…10}
Min	Minimum marks for passing a subject				32
Max	Maximum marks for passing a subject				100

DATA DESCRIPTION Table I: Database description

Stu_Id ID of the student. It can take any integer values.

Name:- Name of the student.

sub represents the name of the subject. It can take only text values ranging from A-Z.

M1,M2,M3 :various subjectmarks scored by a student. It can take only the numeric values from 0 to 100.

T: total marks scored by each student represented in the form percentage i.e., 1% to 100%.

Com: Communication and attitude score out of 10

Min:-Minimum marks for passing a subject

Max:- Maximum marks for passing a subject

EXPERIMENTAL EVALUATION

Step 1: Data collection

Table II : Input Table

stu_id			1	2	3	4
Name			vikas	guru	sayed	deepak
Sub	Min	Max	M1	M2	M3	M4	M5
Ca	32	100	20	98	45	92
Bi	32	100	23	98	69	83
Java	32	100	24	97	67	74
Se	32	100	25	96	89	92
Cf	32	100	26	95	88	88
Db	32	100	28	90	56	81
				..
T			624	1910	1416	1482
Com			7	9	7	8

This is an extrat of the student database with the fields or variables listed above. Marks scored in selected subjects of a student over a period of three years i.e., from June, 2011 to April, 2014 is considered and collected from an institution in Bangalore.

Step 2: Data preprocessing:

Preprocessing is done using following statistical technique.

Chi-square test: is applied to remove the useless variable that doesnt contribute to the result. From the above table III name, max and min were removed.

Table III : Preprocessed table

Steps of the k-means algorithm is explained below:

Step 1: Clustering using k-means algorithm.

Step 1: Preprocessed table will be the input for k-means.

Step 2: Cluster proficient student segment [PCS] and determine the exact number of clusters. The value of K is incremented in each step and the results are shown below.

Partition of PCS is done initially by taking k=2

After Applying k means clustering with k=2, we have Table IV: Partial view of clusters of students, for

k=2

Cluster1	Cluster2
1	2
10	3
11	4
12	5
13	6
16	7
18	8
21	9
22	14
24	15
26	17
27	19
29	20
30	23
	25
	28

The above table shows the grouping of students into two groups. .

Table V : Difference between clusters for k=2

Cluster	Cluster1	Cluster2
Custer 1	0	0.229
Custer 2	0. 229	0

For k = 2, the distance between the groups are labeled, in this

is the minimum value.

For k=3 applying k means clustering, we have the following results

Table VI : Partial view of three clusters, for k=3

stu_id	1	2	3	4
Sub	M1	M2	M3	M4	M5
ca	20	98	45	92
bi	23	98	69	83
java	24	97	67	74
se	25	96	89	92
cf	26	95	88	88
db	28	90	56	81
		..
T	627	1910	1416	1482
Com	7	9	7	8

Cluster 1	Cluster 2	Cluster 3
1	9	2
10	24	3
11		4
12		5
13		6
16		7
18		8
21		14
22		15
26		17
27		19
29		20
30		23
		25
		28

The above table indicates the partial view of 3 -clusters.

Table VII : Differences between clusters

Cluster	Cluster1	Cluster2	Cluster3
Custer 1	0	0.116	0.165
Custer 2	0.116	0	0.154
Custer 3	0.165	0.154	0

For k = 3, the distance between the groups are labeled, in this

0.12 is the minimum value

For k=4: we have the following results.

Table VIII : Partial view of four clusters, for k=4

Cluster 1	Cluster 2	Cluster 3	Cluster 4
1	9	3	2
	10	4
	11	5
	12	6
	13	7
	16	8
	18	14
	21	15
	22	17
	24	19
	26	20
	27	23
	29	25
	30	28

Table IX : Comparison of distance between the clusters

Cluster	Cluster1	Cluster2	Cluster3	Cluster4
Custer 1	0	0.104	0.187	0.342
Custer 2	0. 104	0	0.083	0.238
Custer 3	0.187	0.083	0	0.154
Custer 4	0.342	0.238	0.154	0

Comparison table given above compares the two clusters in terms of distance between them. Cluster 2- cluster 1

=0.104537 given in row 1 column 3.Similarly the other values are calculated. This table is the resultant of application of k-means, incrementing value of k in every step by 1.

Table X : Cluster distance table

Number of cluster	The short cluster distance
Cluster 2	0.2293
Cluster 3	0.1658
Cluster 4	0.3428
Cluster 5	0.3133

The first value 0.2293 in the shorter cluster distance field represents the distance between the cluster 1and 2,similarly the second value viz., 0.1658 represents the distance between 1 and 3. The other values in the table can be interpreted similarly.

From the above table it can be observed that, values in the shorter cluster distance attribute starts increasing by great extent i.e., from 0.1658 to 0.3428, after cluster 2..Hence it can be concluded that the maximum number clusters that can be formed is 3. So we choose k=3 and 3rd cluster because the centroid of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).

Step 2: Choosing the cluster

When k takes value 3i.e., k=3, the 3rdcluster is chosen as the best cluster as the centroid value of the third cluster is nearest to maximum marks of the subjects i,e., 2000(20 subjects).

Step 3: Identifying the elements of the cluster.

Table XI : Elements of Cluster 3

Cluster 3
2
3
4
5
6
7
8
14
15
17
19
20
23
25
28

The table above represents the elements of the best cluster identified.

RESULTS:

From the deduction the cluster 3 is found to be the best cluster having the number of proficient students given below:

Table XII : Elements of Cluster 3

Cluster 3

2

3

4

5

6

7

8

14

15

17

19

20

23

25

28
CONCLUSION

The main objective was to identify proficient students by clustering using k-means algorithm. It was found that cluster 3 having proficient students emerged among the students of the institution as the best cluster. The algorithm has an accuracy of 89%. Thus the solution for the above problem was found successfully.

REFERENCES

Kim, S. Y., Jung, T. S., Suh, E. H., & Hwang, H. S. (2006). Student segmentation and strategy development based on student lifetime value: A case study. Expert Systemswith Applications, 31(1), 101 107.
Shin, H. W., &Sohn, S. Y. (2004). Product differentiation and market segmentation as alternative marketing strategies. Expert Systems with Applications, 27(1), 2733.
Jang, S. C., Morrison, A. M. T., & OLeary, J. T. (2002). Benefit segmentation of Japanese pleasure travelers to the USA and Canada: Selecting target markets based on the profitability and the risk of individual market segment. Tourism Management, 23(4), 367378.
Hruschka, H., & Natter, M. (1999). Comparing performance of feed forward neural nets and k-means of cluster-based market segmentation. European Journal of Operational Research, 114(3), 346353.
Leon Bottou, YoshuaBengio, Convergence Properties of the K- Means Algorithms, Advances in Neural Information Processing Systems 7, 1995.
Vance Fabere, Clustering and the Continuous k-Means Algorithm, Los Alamos Science, 1994.
Rakesh Agrawal, Tomasz Imielinski, Arun Swam, Mining Association Rules between Sets of Items in Large Databases, ACM SIGMOD Record, 1993
Archer-North and Associates, Performance Appraisal, http://www.performance-appraisal.com, 2006, Accessed Dec, 2012.
Chein, C., Chen, L., "Data mining to improve personnel selection and enhance human capital: A case study in high technology industry", Expert Systems with Applications, In Press (2006).
K. Pal, and S. Pal, Analysis and Mining of Educational Data for Predicting the Performance of Students,(IJECCE) International Journal of Electronics Communication and Computer Engineering, Vol. 4, Issue 5, pp. 1560-1565, ISSN: 2278-4209, 2013.
Z. N. Khan, Scholastic achievement of higher secondary students in science stream, Journal of Social Sciences, Vol. 1, No. 2, pp. 84-87, 2005.
B.K. Bharadwaj and S. Pal. Mining Educational Data to Analyze Students Performance, International Journal of Advance Computer Science and Applications (IJACSA), Vol. 2, No. 6, pp. 63-69, 2011.
M. Bray, The shadow education system: private tutoring and its implications for planners, (2nd ed.), UNESCO, PARIS, France, 2007.
S. K. Yadav, B.K. Bharadwaj and S. Pal, Data Mining Applications: A comparative study for Predicting Students Performance, International Journal of Innovative Technology and Creative Engineering (IJITCE), Vol. 1, No. 12, pp. 13-19, 2011.

Clustering Proficient Students using K-Means Algorithm

Leave a Reply