- Open Access
- Total Downloads : 21
- Authors : Jaswinder Kaur, Dr. Sudhir Kumar Sharma
- Paper ID : IJERTCONV5IS10035
- Volume & Issue : ICCCS – 2017 (Volume 5 – Issue 10)
- Published (First Online): 24-04-2018
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License: This work is licensed under a Creative Commons Attribution 4.0 International License
Modified Cascade-2 Algorithm with Adaptive Slope Sigmoidal Function for Function Approximation
Jaswinder Kaur1 Dr. Sudhir Kumar Sharma2
Research Scholar Professor
Ansal University KIIT College of Engineering Guragon, India Guragon, India
Abstract Cascade-2 algorithm is a variant of well-known cascade-correlation algorithm that is widely investigated constructive training algorithm for designing cascade feedforward neural networks. This paper proposes a modified Cascade-2 algorithm with adaptive slope sigmoidal function (MC2AASF). The algorithm emphasizes on architectural adaptation and functional adaptation during learning. This algorithm is a constructive approach of designing cascade architecture. To achieve functional adaptation, the slope of the sigmoidal function is adapted during training. One simple variant is derived from MC2AASF is where the slope parameter of sigmoidal function used at the hidden layers nodes is fixed to unity. Both the variants are compared to each other on five function approximation tasks. Simulation results show that adaptive slope sigmoidal function presents several advantages over standard fixed shape sigmoidal function, resulting in increasing flexibility, smoother learning, better generalization performance and better convergence.
Keywords- Adaptive slope sigmoidal function; Cascade- Correlation algorithm; Cascade-2 algorithm; Constructive neural networks; Dynamic node creation.
-
INTRODUCTION
Artificial neural networks have been successfully applied to problems in data processing, robotics, and numerical control of computer, decision making, and function approximation, classification and regression analysis.
FeedForward Neural network is a layered neural network in which the neurons are organized in the form of layers and the neurons in one layer get the input from the previous layer and feed their output to the next layer. Among the various types of neural networks feedforward neural networks (FFNNs) is used most widely.
The generalization capability and convergence time of supervised learning in FNNs depends on various factors such as choice of network architecture (number of hidden nodes and network topology), the choice of training algorithm and the choice of activation function of each node. This suggests the need for an algorithm that can find appropriate size of the network architecture automatically and that also learns the weights during training.
Constructive neural networks (CoNN) consist of minimum architecture in which hidden nodes are added one at a time incrementally.
Many constructive neural networks (CoNN) proposals for regression problems are given in [1]-[4]. Kwok and Yeung [1] survey the major CoNN algorithms for regression problems. In their proposed taxonomy that is based on the concept of a state-space search, group the algorithms into six different categories. Among these, the most popular for regression problems is the Cascade-Correlation algorithm (CCA) proposed by Fahlman and Lebiere [5] and next is the dynamic node creation (DNC) algorithm proposed by Ash [6]. The latter algorithm constructs a single hidden layer FNN automatically, whereas the former constructs cascade architecture during training.
In each phase, CCA adds one hidden node in a separate hidden layer at a time and hidden node is connected to all inputs as well as previously trained hidden nodes. After the training of input weights of current hidden node gets completed, it is connected to output nodes with input weights frozen and all inputs of output nodes are trained again. This algorithm has inspired many new variants and also has been used in the reinforcement learning methods. The several variants of CCA algorithm and similar type of algorithms have been proposed from time to time in the literature. These algorithms differ from each other in various aspects, the connectivity patterns of the current hidden node
i.e. cascade architecture or single hidden layer FNN, activation function used at hidden layers node, objective function used for candidate node training, the optimization method used for training the individual hidden node, the stopping criteria for candidate node training and halting criteria for the node addition. Lastly, they can also be classified on the basis of how the connection weights are frozen and once again trained.
Cascade-2 algorithm [7] was also first proposed by Fahlman, who proposed the idea of CCA. Cascade-2 algorithm differs from CCA by training current hidden node to directly minimize the residual error rather than to maximize its covariance with residual error. Besides this, hidden node has adjustable output connections to all of the output nodes and all other things are common in both algorithms. Several authors have demonstrated that CCA is effective for classification tasks but not very successful on regression problems. This is because for its correlation term tending to drive the hidden node activations to their extreme values, thereby, making it hard for the network to produce a smoothly varying output [8]- [10].
Logistic activation function is widely used at hidden nodes in FNNs due to its nonlinear capability. In general, the slope parameter of sigmoidal function is fixed to unity prior to training and cannot be adapted to suit different problems during training. We can achieve a great nonlinear mapping capability if slope parameter of the sigmoidal function is adapted by the training data. In past many researchers had used adaptive slope sigmoidal function (ASSF) for fixed size FNNs and reported better generalization performance and faster learning with less number of hidden nodes [11]-[15]. The ASSF defined as follows:
separate hidden layer in the form of cascade architecture just as in the case of Cascade-2 algorithm. During the training of the current hidden node, the input and output connection weights, slope parameter and bias of output node are trained by using gradient descent method in sequential mode, minimizing the squared error objective function [16].
Let iwni represents weight between the n-th hidden node and i-th input while owk represents weight between the k-th hidden node and output node. The connection weight hiwnj
represents weight between the n-th hidden node and j-th
gx,
1 ; x R
1 e x
previously trained hidden node. The connections weights
iw and ow act as the biases for the n-th hidden node and
n0 0
(1)
where R is assumed as the set of real numbers and is called slope parameter. For each hidden node, the value of the slope parameter is updated during the learning process. If the slope
parameter is unity, then this activation function is equivalent to standard log-sigmoidal function.
In this paper we propose a modified Cascade-2 algorithm with adaptive slope sigmoidal function (MC2AASF).
The paper is organized as follows: In Section-2, we propose MC2AASF. Section-3 presents the experimental design to compare the efficiency of the two variants. In Section-4, the results are presented and discussed. In Section-
5, conclusions are presented.
output node, respectively. The biases of the hidden nodes and output node are represented using the 0-th auxiliary input x0 and 0-th auxiliary hidden node O0 , respectively. The values
of x0 and O0 are set to unity. The training pairs are represented by x p , f p ; p 1,2,……., P , where P is the
number of training exemplars. The index p is always assumed to be present implicitly [16].
If xi is the i-th component of the input, then the total input
for th n-th hidden node is as follows:
Ni n1
net n iwni xi hiwnj O j
-
THE PROPOSED ALGORITHM
This section presents the MC2AASF for designing cascade
i 0
j 1
(2)
architecture and differs from Cascade 2 algorithm in these four
The output for the n-th hidden node is as follows:
1
aspects:
O gnet ,
1
n 0
-
MC2AASF starts the network with one hidden node. Input and output nodes are not directly connected.
-
MC2AASF algorithm uses only one objective
n n n
1 ennetn
n 1
(3)
function (squared error criterion) to train (input and output connection weights simultaneously) of each
A cascade network, having n hidden nodes implements the function
hidden node in one stage. One practical advantage of MC2AASF is that we do not need to switch between two different optimizations.
-
MC2AASF freezes both input and output connection
f n x
n
k 0
owk Ok
f n1
x Fn x
(4)
weights of the each trained hidden node.
-
Stochastic gradient descent method is used for
where
fn1
xis the function implemented by the cascade
training the individual hidden node, thus we are not restricted to using batch mode.
The proposed MC2AASF algorithm focuses on both architectural and functional adaptation during the learning. The
architecture that had (n 1) hidden nodes and where
Fn x own On ow0
(5)
number of input and output nodes is decided according to the characteristic of a given problem. We formulate MC2AASF for regression problem. Without loss of generality, we
We can specify the objective function for training the current
2
-
h hidden node by (6) that is the squared error function on a per example basis.
consider minimal architecture has Ni nodes in input layer, one
node in each hidden and output layer. The output node has a linear activation function, while the hidden layers node has
S 1
2
f f n x
f f n1 x Fn x
1
2
2
ASSF defined in (1). There is a hidden node which is added in the current network and trained at a time and it does not change its weights (input and output) after training gets completed for current hidden node. The currently added hidden node is connected to all the input nodes as well as previously trained hidden nodes and connected to the output nodes and makes
1 e
2
n1
F x2
(6)
where en1 is the residual error that is left from the previously added hidden nodes ( i.e. it is the desired output for the current n-th hidden node).
The cascade network is trained by using gradient descent method applied to the minimization of the objective function defined in (6) on a per pattern basis.
If w is any trainable parameter of the network, then its
weight increment with momentum term is defined as
The proposed algorithm is empirically compared against the algorithm in which the slope parameter is kept constant (equal to unity) i.e that is not updated at all (MC2A).
-
-
EXPERIMENTAL DESIGN
The following five two-dimensional regression functions are used to compare the learning behavior of MC2AASF and MC2A. These functions have been studied in [2], [4]:
-
Simple interaction function (SIF)
follows:
wp w
wp 1
S p
w w
y 10.391×1
0.4×2 0.6 0.36.
(19)
(8)
-
Radial function (RIF)
2 2
where, 0,1 is a constant, also known as momentum
y 24.234 x1 0.5
x2 0.5
0.75
w
parameter and w 0,1 is a constant, also known as the
x1
0.52 x 0.52
2
learning rate. Let e e
n1
-
Fx be the residual error, then
-
-
Harmonic function (HF)
(20)
weight increment without indices defined as follows:
F
y 42.6590.1 x1 0.50.05 x1
0.54
w w w w e w
10×1 0.5 x 0.5 5x 0.5
2 2 4
2
2
(21)
We can easily derive the following results:
(9)
-
Additive function (AF)
y 1.3356 1.51 x e2x1 1 sin 3 x
1
0.62
2
F
owk
Ok ;
k 0, n
e3x2 0.5 sin4 x
0.92
1
(22)
1
2
(10) (e) Complicated interaction function (CIF)
F ow
On x ;
i 0,1,, N
y 1.91.35 ex1 sin13x 0.62 ex2 sin7x
n i
n
iwni net
i
(11)
(23)
For each function 1,450 uniformly distributed random points were generated in the two-dimensional space
F ow
hiw
On
n net
O j ;
j 1,, n 1
0 x1 , x2 1 . The used data was normalized in the interval
On
nj
On n On
n
1 On ;
n 1
(12)
[-1, +1] and then partioned into the training set (TRS),validation set (VS), and testing set (TS). The first 225 exemplars were used for TRS, the following 225 exemplars
were used for VS and the final 1,000 exemplars for TS.
netn
n
F ow
n
0;
On
n
own
On
n
n 0,
net n ;
(13)
(14)
30 independent runs were performed for each regression function. For each trial, initial weights sets were generated in the interval [-1, +1] at random.
After a series of experiments, we set the values of the parameters as constant for all regression functions. Hidden nodes were added up to a maximum of 15. Each individual
We are able to write the weight update rules as for
n 1,2…., Nh ; where Nh is the maximum number of hidden nodes added in cascade network architecture and p 1,2,….., P
owk w owk w e Ok ; k 0, n
hidden node was trained up to a maximum of 300 epochs. The learning rate w and momentum constant w for the weights were 0.1 and 0.8, respectively. The learning rate and momentum constant for slope parameter were 0.1 and
(15)
iwni w iwni w e own On xi ; i 0,1,, Ni
0.8, respectively. We started slope parameter with a value of unity and updated it so that it reached its optimal value during
training. Each trained hidden node acquired different optimal
(16)
hiwnj w hiwnj w e own On O j ; j 1,2,, n 1
value in the range min , max 0.1,10 in our simulation.
n
n
e own
On
n
net n
(17)
(18)
The final performance of selected network (same
configuration of the network, where validation MSE was minimum) was measured from the TS.
-
-
RESULTS AND DISCUSSIONS
The results of the 300 experiments conducted are presented in this section. For drawing summary, we considered all experiments that are executed. For brevity, we present summary data in Table 1. For comparing of the two variants of the discussed algorithm, the following measures are used.
-
The minimum of the MSE (MINMSE) on test set achieved in all the experiments for regression function is in the third column.
-
The maximum of the MSE (MAXMSE) on test set achieved in all the experiments for regression function is in the fourth column.
-
The mean of the MSE (MMSE) on test set achieved in all the experiments for regression function is in the fifth column.
TABLE 1. SUMMARY RESULTS OF THE TWO VARIANTS OF THE PROPOSED ALGORITHMS.
Function
Algorithm
MINMSE 10-2
MAXMSE 10-2
MMSE 10-2
STDMSE 10-2
MINHN
MHN
STDHN
RMMSE
SIF
MC2A
0.1855
0.7594
0.2694
0.1029
5
12.667
2.783
1.141
MC2AASF
0.1147
0.3799
0.2361
0.0696
9
13.433
1.995
RF
MC2A
0.8914
4.0576
1.8828
0.7022
6
13.600
2.078
1.413
MC2AASF
0.7789
3.0431
1.3326
0.5474
7
13.000
2.304
HF
MC2A
5.1611
6.7188
6.2061
0.4368
1
5.333
4.880
2.200
MC2AASF
1.7654
4.4761
2.8211
0.7366
6
11.867
2.849
AF
MC2A
1.5372
8.6172
3.0695
1.9604
8
13.200
2.265
1.079
MC2AASF
1.3102
5.0254
2.8445
0.9277
6
13.200
2.578
CIF
MC2A
2.8277
3.9131
3.1953
0.2713
3
11.433
3.936
1.309
MC2AASF
1.6035
3.3129
2.4401
0.5368
8
13.233
1.888
-
The standard deviation of the MSE (STDMSE) on test set achieved in all the experiments for the regression function is in the sixth column.
-
The minimum number of hidden nodes (MINHN) found by the algorithm in all the experiments for regression function is in the seventh column.
-
The mean number of hidden nodes (MHN) found by algorithm in all the experiments for regression function is in the eighth column.
-
The standard deviation in the number of hidden nodes (STDHN) found by algorithm in all the experiments for regression function is in the ninth column.
-
Ratio of the mean of the MSE (RMMSE) in all the experiments of the MC2A to the MC2AASF is in the tenth column.
The MC2AASF gives lower MSE than the variant MC2A for all the regression functions. In order to observe the significance difference in generalization performance, we performed a t-test. The null hypothesis is rejected with 95% confidence level for all the regression functions. It is inferred that there is a significant difference in MSE achieved by the two variants of the algorithm. The RMMSE is greater than one for all tasks. All these show that the generalization performance and convergence capability of MC2AASF is better than the MC2A.
The common feature of the both variants is the freezing of previously trained nodes for the sake of computational efficiency and avoids the moving-target problem. Since the local error vector is already computed as a necessary part of the weight update equation, to update the slope parameter, it does not impose any significant computational burden for the variant MC2AASF.
-
-
CONCLUSION
In this paper, we proposed a modified Cascade-2 algorithm with adaptive slope sigmoidal function. This algorithm is a constructive approach of building cascade architecture and thus obviating the need for a priori guessing the network architecture. The functional adaptation is achieved through the adaptive slope parameter of sigmoidal function that prevents the nonlinear nodes from saturation and increases their learning capabilities. The algorithm determines not only the optimum number of hidden nodes in cascade architecture, as also the
optimum slope parameter for them. From the results obtained, we may conclude that the MC2AASF gives better generalization performance and smoother learning than the variant MC2A. The proposed constructive training algorithm can be used for forecasting of electric energy demand for a smart city.
REFERENCES
-
T. Y. Kwok, D. Y. Yeung, Constructive Algorithms for Structure Learning in feedforward Neural Networks for Regression Problems, IEEE Transactions on Neural Networks, vol. 8, no. 3, pp. 630-645, May 1997.
-
T.Y. Kwok, D.Y.Yenug, Objective functions for training new hidden units in constructive neural networks, IEEE Transactions on Neural Networks, vol. 8, no. 5, 1997, pp. 1131-1148.
-
J. J. T. Lahnajarvi, M. I. Lehtokangas, and J. P. P. Saarinen, Evaluation of constructive neural networks with cascaded architectures, Neurocomputing, vol. 48, pp. 573-607, 2002.
-
L. Ma and K. Khorasani, New training strategies for constructive neural networks with application to regression problems, Neurocomputing, vol. 17, pp. 589-609, 2004.
-
S. E. Fahlman and C. Lebiere, The cascade correlation learning architecture, Advances in Neural Information Processing System 2,
D. S. Touretzky, Ed. CA: Morgan Kaufmann, pp. 524-277, 1990.
-
T. Ash, Dynamic node creation in backpropagation networks, Connection Science, vol. 1, no. 4, pp. 365-375, 1989.
-
L. Prechelt, "Investigation of the cascor family of learning algorithms, Neural Networks 10 (5), 1997, pp 885-896.
-
S. E. Fahlman and J. A. Boyan, The Cascade 2 Learning Architecture, Technical Report(forthcoming), CMU-CS-94-100,
Carnegie Mellon University, 1994
-
M. C. Nechyba and Y. Xu, Neural network approach to control system identification with variable activation functions, IEEE International Symposium on Intelligent Control, 16-18 August 1994, Columbus, Ohio, USA.
-
J. N. Hwang, S. Shien and S. R. Lay, The Cascade Correlation Learning: A Projection Pursuit Learning Perspective, In IEEE Transactions on Neural Networks, vol. 7, no. 2, March 1996.
-
T. Yamada, T. Yabuta, Remarks on a neural network controller which uses an auto-tuning method for nonlinear functions, IJCNN, 1992 vol. 2, pp. 775-780.
-
Z. Hu and H. Shao, The study of neural network adaptive control systems, control and Decision, vol. 7, pp. 361-366, 1992.
-
C. T. Chen and W. D. Chang, A feedforward neural network with function shape autotuning, Neural Networks, 1996, vol. 9, issue (4), 627-641.
-
S. Xu and M. Zhang, A novel adaptive activation function, In Proc. Int. JointConf. Neural Networks, 2001, vol. 4, pp. 2779-2782.
-
P. Chandra ,Y. Singh, An activation function adapting training algorithm for sigmoidal feedforward networks, Neurocomputing, 61 ( 2004), pp. 429-437.
-
S. K. Sharma and P. Chandra. "An adaptive slope sigmoidal function cascading neural networks algorithm." Emerging Trends in Engineering and Technology (ICETET), 2010 3rd International Conference on. IEEE, 2010.