- Open Access
- Total Downloads : 118
- Authors : Zhao Jin, Shuxia Lu, Mi Zhou
- Paper ID : IJERTV5IS120168
- Volume & Issue : Volume 05, Issue 12 (December 2016)
- DOI : http://dx.doi.org/10.17577/IJERTV5IS120168
- Published (First Online): 16-12-2016
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
A Map Reduce – based SVM Ensemble with Stochastic Gradient Descent
Zhao Jin
Key Lab. of Machine Learning and Computational Intelligence
College of Mathematics and Information Science, Hebei University
Baoding City, Hebei Province, 071002, China
Shuxia Lu *
Key Lab. of Machine Learning and Computational Intelligence
College of Mathematics and Information Science, Hebei University
Baoding City, Hebei Province, 071002, China
Mi Zhou
Key Lab. of Machine Learning and Computational Intelligence College of Mathematics and Information Science,
Hebei University
Baoding City, Hebei Province, 071002, China
AbstractStochastic Gradient Descent algorithm (SGD) is a simple and effective algorithm for SVM. It is particularly fast for linear classification and it is also adapted to the non-linear classification with Mercer kernel. The running time scales linearly with the number of iterations and does not depend on the number of the training size. In order to improve the convergence rate and classification accuracy with large data sets. This paper proposes a MapReduce-based SVM ensemble algorithm with SGD. We utilize Hadoop Distributed File System to store big training set and MapReduce parallel computing model to training several SVMs as SVM ensemble. The results show that our methods achieve a faster convergence rate than Pegasos that is a traditional SGD algorithm.
KeywordsStochastic Gradient Descent; Support Vector Machine; MapReduce; Voting Mechanism; Ensemble
-
INTRODUCTION
Support Vector Machines [1] is a machine learning algorithm for classification with better generalization ability. It is based on Statistical Learning Theory. VC dimension and structural risk minimization are the theoretical foundation of SVM.
Stochastic gradient descent is an effective approach for training SVM, where the objective is the native form rather than dual form. It proceed by iteratively choosing a labeled example randomly from training set and updating the model weights through gradient descent of the corresponding instantaneous objective function.
T.Zhang [2] proved that a constant learning rate in SGD would numerically achieve good accuracy, enabling a
running time of 1/ 2 for linear kernel. The algorithm
Norma [3] suggests a learning rate proportional to1/ t . Shai Shalev-Shwartz et al [4] presented Pegasos algorithm, which is effective stochastic sub-gradient descent algorithm for solving SVM, which adopted a learning rate of 1/ t . Zhang Wang et al [5] presented a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training that have constant space and constant time complexity per update.
Krzysztof Sopyla et al [6] presented SGD-BB algorithm that shows lower sensitivity to the choice of initial parameters.
According to Machine Learning Theory, training examples is not the more the better for recognizing the pattern. In fact, the proceed of SGD algorithm is not using all training examples but a subset which is chosen from training set randomly through the training process. Therefore, when the training set is too large to load into PCs memory, we can choose a subset of training set to achieve the optimal solution. There are some methods to deal with the big date problem. Nasullah Khalid Alham et al [7] presented MRESVM that is a MapReduce-based distributed SVM ensemble algorithm for scalable image annotation. Ferhat et al [8] proposed a different MapReduce-based distributed SVM algorithm for binary classification. There are two differences between [7] and [8]. First, the subsets to train a single SVM are selected using the method of bootstrapping in [7]. In the algorithm of [8], the training set is simply separated into different blocks that are treating as subsets. Second, [7] is an ensemble algorithm which has only one process of MapReduce. [8] is an iteration algorithm that has many times of MapReduce process. Both [7] and [8] use Sequential Minimal Optimization (SMO) [9] to train SVM.
In this paper, we proposed a MapReduce-based SVM ensemble using SGD algorithm that is called MR-SGD. In MR-SGD, we use the MapReduce programming model and Hadoops Distributed File System to train several SVMs parallel. Each SVM of MR-SGD uses a subset of training set with SGD algorithm. Our proposed algorithm is evaluated with Pegasos [4]. The results show that MR-SGD achieves a fast convergence rate in almost all cases.
This paper is organized as the following: section II discussed the basic stochastic gradient descent algorithm for SVM, both for linear kernel and non-linear kernel. In section III, our proposed algorithm and its details of implementation are discussed. The section IV shows the experimental results on datasets and in the last section conclusion are discussed.
-
STOCHASTIC GRADIENT DESCENT (SGD) FOR SVMS
-
SGD for linear kernel
Consider a binary classification problem with data
After a predetermined number T of iterations, we can acquire the norm of SVM classifier WT+1. The pseudo-code of linear kernel SGD is below.
set S x , y ,i 1,…, m, where instance x Rn is an n-
i i i
dimensional input vector and y 1,1 is the
Algorithm 1 linear kernel SGD
i
corresponding label of that instance. Training an SVM using S is formulated as solving the following optimization problem
min PW W 2 1 lW ;x, y,
-
Input: S, , T
-
Initialize:W1 0 ; 3. For t 1,…,T
4. Choose it 1,…, S uniformly at random;
2 m x, y S
5. 1
( 1 )
t t
where 0
is a regularization parameter used to control
6.
W (1 1)W
model complexity and
lW;x, y max0,1 y W , x
(2)
t1 t t
i
i
t
t
t
t
is the hinge loss function where W , x
denotes the standard
-
If y Wt ,xi 1
-
Wt1 Wt1 t yi xi
inner product between the vectors W and x.
The classifier is expressed as
f x WT x
(3)
t t
-
Output: WT 1 .
-
-
-
SGD for non-linear kernel
where W is a vector of weights associated with each input. We study the case where the bias is set to zero.
SGD operates as follows. Initially, we set W1 to be zero vectors. On iteration t of the process, we first choose a
Represented Theorem [10] implies that the optimal solution of Eq. (1) can be expressed as a linear combination of the training instances. So SGD for SVM can be used to solve non-linear problems with Mercer kernels rather than with direct access to the feature vectors.
it it
it it
training example x , y by picking an index it 1,…, m
uniformly at random. We then replace the objective in Eq. (1)
This section we consider predictors, which are linear
it it
it it
with an approximation based on the training example x , y . We can get
functions of some implicit mapping x of the instances. The minimization problem is below:
min PW W 2 lW ;x , y .
minP(W ) W 2 1 lW ;x, y
(9)
2 it it
2 m ( x, y )S
( 4 )
We consider the sub-gradient of the above approximate objective, given by:
where the loss function is:
lW;x, y max0,1 y W ,x
(10)
t Wt t yi xi ,
(5)
We then replace the objective in Eq. (9) with an
t t approximation based on the trainin example x , y just as
1, if
y W , x 1
it it
where t
it t it .
(6)
before:
0, otherwise
min PW W 2 lW ; x , y .
(11)
Then we update the norm W using the formula below
2 it it
Wt1 Wt tt,
(7)
Rather than explicitly calculating W by using x, it saves
where t 0
is a learning rate. Then we can rewrite the
training examples to implicitly represent W and using the
update formula as follows
kernel function
K(x, x')
x,x'
when calculating
1
predictionWT x .
Wt1 1 t Wt tt yit xit
(8)
The process is the same as the case of linear kernel. Set W1
to zero vector initially. According Eq. (8), we can get:
W 1 1W 1 y x 1 y x
Let W be a List where each element records two information: training example that is not zero and its selected times so far.
-
-
A MAPREDUCE-BASED SVM ENSEMBLE
1
1
2 1
2
2
3 2
3 2
W 1 1 W
1 1 1 1 1 1
1 y x
2 2 2 2
WITH STOCHASTIC GRADIENT DESCENT
Since the run-time of SGD algorithm does not depend directly on the size of the training set, it is especially suited for learning from large datasets. However, there is some
1 y x 1 y x
2 1 1 1 2 2 2 2
difficulty for traditional algorithms when the file of training set is too big to load into the memory of PC.
1 y x
2 1 1 1
2 y2x2
We proposed an algorithm which is a MapReduce-based SVM ensemble using SGD (MR-SGD). In MR-SGD, we
3
3
4 3
4 3
W 1 1 W
1 y x
3 3 3 3
utilize the Hadoops Distributed File System and the MapReduce programming model to conquer the big data
1 y x
3 1 1 1
1 y x
y x 1 y x
2 2 2 3 3 3 3
y x y x
problem and to achieve a faster convergence rate than single SVM.
-
MapReduce Model
3 1 1 1
2 2 2
3 3 3
Here we introduce Apache Hadoop [11], which has two
··· ···
Apparently we can induce the presentation of Wt+1
core components: Hadoop Distributed File System (HDFS)
[12] and MapReduce parallel computation programmingmodel [13].
1 t
t
t
Wt1 j y jxj j1
Note that the instance
(12)
x j , j {1,…,t} is chosen by
HDFS is used to store large data. It split data file into multiple chunks. Each chunk is stored in different data nodes.
MapReduce is a parallel processing programming model
uniformly at random at the jth iteration. Then we can get the predictor with Mercer kernel function.
can handle the large data sets which are stored in HDFS. The basic function of the MapReduce model is iterate over the
t
t
f x W T x 1
y K
x , x
(13)
input, compute key/value pairs from each part of input, group
t t t
j1
t t t
all intermediate values by key, then iterate over the resulting groups and finally reduce each group. The model process in
It is obvious from Eq. (12) that the training examples can be ignored if the corresponding is zero because they have no contribute to the predictor.
The pseudo-code of non-linear kernel SGD is below.
Algorithm 2 non-linear kernel SGD
-
Input: S, , T
-
Initialize: W1 empty
3. For t =1,…,T
4. Choose it 1,…, S uniformly at random
1 lengthWt
parallel. Map task is an initial transformation step, in which individual input records are processed in parallel. The system shuffles and sorts the map outputs and transfers them to the reduce tasks. Reduce task is a summarization step, in which all associated records are processed together by a single entity.
-
-
MR-SGD
There are two properties about SGD algorithm:
-
The run-time scales linearly with the number of iterations and does not depend on the number of the training size;
-
It proceeds by iteratively choosing a labeled example
t
t
t
t
-
If
yi t
j1
j y j K
x j , xi 1
randomly from training set. If the number of iteration is less than the training set size, all of the instances that are chosen
t
t
-
If xi in Wt
-
Increment corresponding by 1
-
Else
t
t
-
Put xi in Wt
-
Set corresponding to 1
-
Output: WT 1
to train classifier are just a partition of the training set.
Given the properties of SGD described above, we can utilize the HDFS to store training set and use MapReduce programming model to select multiple subset of the training set and to train SVM on each subset in parallel using a cluster of computers. Finally, we aggregate all the trained SVMs using Voting Mechanism to predict a testing instance. Fg.1 shows the process of MR-SGD.
subset 1
subset 1
training set
training set
SVM 1
and the parameters are shown in Table 1 and Table 2 respectively.
TABLE I. DATASETS
subset 2
SVM 2 predictor
Dataset
#Training
#Testing
#Features
····
····
Usps
7,291
2,007
256
subset k
SVM k
Mnist
60,000
10,000
780
Letter
15,000
5,000
16
. 1 Workflow of MR-SGD
Adult
32,561
16,281
123
subset 2
SVM 2 predictor
Dataset
#Training
#Testing
#Features
····
····
Usps
7,291
2,007
256
subset k
SVM k
Mnist
60,000
10,000
780
Letter
15,000
5,000
16
. 1 Workflow of MR-SGD
Adult
32,561
16,281
123
Fig
The implementation details of MR-SGD describes below
-
First, we pick up k subsets from training set where k should be odd to support the voting Mechanism. According to Machine Learning Theory, the number of training examples is not the more the better. The size of subset that required training an effective SVM ensemble is different with different training set.
Assume we have m training examples and want to randomly choose d examples as a subset. Map task deal with each instance in the training set. It loops k times to decide whether this instance should be chosen into the corresponding
subset. In the ith( i 1,…, k) loop, it will generate a random
integer from 1 to m. If the random integer is less than d, then the instance will be put into the ith subset to train an SVM in Reduce task.
-
Then Reduce tasks use the subsets which are chosen in Map tasks to train SVMs with SGD algorithm which is mentoned in section II. There is k SVMs just as we set in the first step.
-
Finally, we aggregate k SVMs using Voting Mechanism to decide the label of the test samples.
-
-
-
-
EXPERIMENTS
In this section we present experimental results that demonstrate the merits of our algorithm. The basic SGD algorithm is Pegasos [4]. To evaluate the classification accuracy and convergence rate of Pegasos and MR-SGD, several benchmark datasets are used to illustrate both the linear kernel and the non-linear kernel situations. MR-SGD experiments were carried out on a cluster contains 6 machines. Each of machines has four E5-2609 2.50GHz processors and 4GB RAM memory. The CentOS-6.4 was used as an operating system. The version of Apache Hadoop is 2.4.1.
The experiments in this section were performed on four datasets that are downloaded from LIBSVM web page [15]. The Usps and Mnist datasets were used for the task of classifying digits 0, 1, 2, 3, 4 versus the rest of the classes. The original Letter datasets labels represent 26 alphabets and we set the former 13 alphabets as positive class and the rest as negative class. We use both the linear kernel and Gaussian kernel Kx, y e x y 2 in our experiments. The parameters include the regularization parameter 1 for linear kernel and 2 for Gaussian kernel and the parameter which controlling the width of the Gaussian kernel. The datasets characteristics
TABLE II. PARAMETERS
Usps
Mnist
Letter
Adult
1
8E-3
2E-4
6E-4
6E-3
2
1.35E-5
1.67E-5
1.55E-6
3.07E-5
0.1
0.2
40.0
0.05
For the experiments on Mnist dataset, the size of subset to training each SVM is set to 30,000. However, on Usps, Letter and Adult datasets, we used all examples in original training set because the original training set is not a big data set. We should better use all of the training examples to complete the training process. Results are shown as follows.
Fig. 2 Testing accuracy on Usps dataset with linear kernel
Fig. 3 Testing accuracy on Mnist dataset with linear kernel
Fig. 4 Testing accuracy on Letter dataset with linear kernel
Fig. 5 Testing accuracy on Adult dataset with linear kernel
The classification accuracy of MR-SGD is slightly higher than that of Pegasos on four datasets.
From Figure 6 to Figure 9, it shows that MR-SGD with non-linear kernel has a faster convergence rate than Pegasos on four datasets. The classification accuracy of MR-SGD is slightly higher than that of Pegasos on four datasets.
From Figure 2 to Figure 9, it shows that the classification accuracy of MR-SGD with Gaussian kernel are 11 percent higher than that of MR-SGD with linear kernel on Usps and Mnist datasets. The classification accuracy of MR-SGD with Gaussian kernel is 24 percent higher than that of MR-SGD with linear kernel on Letter dataset. However, the classification accuracy of MR-SGD with Gaussian kernel is similar to that of MR-SGD with linear kernel on Adult dataset.
Fig. 6 Testing accuracy on Usps dataset with Gaussian kernel
Fig. 7 Testing accuracy on Mnist dataset with Gaussian kernel
Fig. 8 Testing accuracy on Letter dataset with Gaussian kernel
Fig. 9 Testing accuracy on Adult dataset with Gaussian kernel
From Figure 2 to Figure 5, it shows that the convergence rate both Pegasos and MR-SGD with the number of iteration growing. And it also shows that MR-SGD with linear kernel has a faster convergence rate than Pegasos on four datasets.
-
CONCLUSIONS
We proposed a MapReduce-based SVM ensemble with SGD algorithm, which we called MR-SGD. The idea of the proposed algorithm is to improve the convergence rate and classification accuracy of basic SGD algorithm for SVM. The experiments show that MR-SGD can achieve faster convergence rate and higher classification accuracy in almost all cases of binary classification. One of the future works is to find a better aggregate strategy of SVM Ensemble. On the other hand, we will exploit different sampling method to enhance performance of MR-SGD.
ACKNOWLEDGMENT
This research is supported by the natural science foundation of Hebei Province No. F2015201185.
REFERENCES
-
Cortes C, Vapink V, Support Vector Networks, Machine Learning, vol 20, 1995, pp. 273-297.
-
Tong Zhang, Solving Large Scale Linear Prediction Problems Using Stochastic Gradient Descent Algorithms, International Conference, 2004, pp. 919-926.
-
Jyrki Kivinen, Alexander J. Smola, and Robert C. Williamson, Online Learning with Kernels, in IEEE Transactions on Signal Processing, vol 52, pp. 2165-2176.
-
Shalev-Shwartz, Y. Singer, N. Srebro, and A. Cotter, Pegasos: Primal Estimated Sub-gradient Solver for SVM, Mathematical Programming, vol 127, 2011, pp. 3-30.
-
Zhuang Wang, Koby Crammer, Slobodan Vucetic, Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large- Scale SVM Training, Journal of Machine Learning Research, vol 13, 2012, pp. 3103-3131.
-
Krzysztof Sopyla, Pawel Drozda, Stochastic Gradient Descent with Barzilai-Borwein update step for SVM, Information Sciences, vol 316, 2015, pp. 218-233.
-
Nasullah Khalid Alham, Maozhen Li, Yang Liu, Man Qi, A MapReduce-based distributed SVM ensemble for scalable image classification and annotation, Computers and Mathematics with Applications, vol 66, 2013, pp. 1920-1934.
-
Ferhat Ozgur CATAK, Mehmet Erdal BALABAN, A MapReduce- based distributed SVM algorithm for binary classification, Turkish Journal of Electrical & Computer Science, 2013, pp. 863-873.
-
JC Platt, Fast Training of Support Vector Machines using Sequential Minimal Optimization, Advances in Kernel Methods, 1999, pp. 185- 208.
-
G. Kimeldorf, G. Wahba, Some results on Tchebycheffian spline functions, Journal of Mathematical Analysis & Applications, vol 33, 1971, pp. 82-95.
-
Apache Hadoop. http://hadoop.apache.org/, 2016.
-
Honnutagi, Pooja S, The Hadoop Distributed File System, International Journal of Computer Science & Information Technolo, vol 5, 2014, pp. 6238-6243.
-
J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, Communications of the ACM, vol 51, 2008, pp. 107 113.
-
S. Sonnenburg, V. Franc, E.Y. Tov, M. Sebag, PASCAL large scale learning challenge, 2008.
-
Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2016.