Abstract
The success of the Sigmoidal Feedforward Networks in the solution of complex learning task can be attributed to their Universal Approximation Property. These networks are trained using non-linear iterative optimization method (of first-order or second-order) to solve a learning task. The convergence rate in Sigmoidal Feedforward Network training is affected by the initial choice of weights, therefore, in this paper, we propose two new weight initialization routines (Routine-1 and Routine-2) using characteristics of input and output data and property of activation function. Routine-1 uses the linear dependency of weight update step size on derivative of activation function and thus, initialize weights and bias to activate the activation function region near zero (input), where the derivative is maximum, therefore, increasing the weight update step size, and hence, the convergence speed. The same principle is used to derive Routine-2, that initialize weights and bias to activate distinct point in the significant range of activation function (where significant range defines the non-saturated region in activation function), such that, each node evolves independently of each other, and act as distinct feature identifier. Initializing weights in significant range reduces chances of (hidden) nodes getting stuck in saturated state. The networks initialized using proposed routines has higher convergence and higher probability to achieve deeper minima. The efficiency of proposed routines is evaluated by comparing them to conventional random weight initialization routine and 11 weight initialization routines proposed in literature (4 well established routines and 7 recently proposed routines) for several benchmark problems. The proposed routine is also tested for larger networks sizes and larger datasets such as MNIST. The results show that the performance of proposed routines is better than conventional random weight initialization routine and 11 established weight initialization routines.
Similar content being viewed by others
Notes
Linear activation function is used at the output layer in this paper, though sigmoidal activation function can also be used.
if we augment the input with x0 = 1, then, the relation can also be written as \(n_{j}^{(h)}={\sum }_{i=0}^{I} w_{ji}x_{i}\), where, wj0 ≡ 𝜃j. In this paper, we show the threshold 𝜃j explicitly.
Error is refered to the deviation of the actual output from desired output of SFN.
(−λ,λ) is the active/useful range of activation function, in which the value of it’s derivative is greater than 5% of the maximum value. The value of λ for hyperbolic tangent function equals 2.1783 (approximated upto four decimal places).
Consideration of initialization for γ, for the logistic activation function is same as that for the hyperbolic tangent activation function, as stated after relation (29).
Wine Quality Dataset can be used as both regression and classification problem. In this work, the dataset is used as a regression problem.
For each of 12 training problems and for 3 weight initialization routines, 30 networks are trained. Thus, we get 14, 3 × 30 matrix for MSEtrain and 14, 3 × 30 matrix for MSEtest. Due to the volume of data obtained, the results for MSEtrain and MSEtest are not reported in this work.
14 t-test tables are obtained for each training problem. The entry 1 in i th row and j th column, indicates that method i is statistically better than method j, whereas, entry 0 indicates that method i is statistically similar to method j. Due to the the volume of data, we report the summarized result of the data (all the tables obtained for each function are superimposed).
For each of 14 training problems and for each of 12 WIR’s, 30 networks are trained. Thus we get 14 12 × 30 matrix for MSEtrain and 14, 12 × 30 matrix for MSEtest. Due to the volume of data obtained, the results for MSEtrain and MSEtest are not reported in this work.
References
Haykin S (2004) Neural network. A comprehensive foundation. Neural Networks 2
Cybenko G (1989) Approximation by superpositions of a Sigmoidal Function. Math Control Signals Syst 2(4):303
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359
Funahashi KI (1989) On the approximate realization of continuous mappings by neural networks. Neural Netw 2(3):183
Scarselli F, Tsoi AC (1998) Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw 11(1):15
Erdogmus D, Fontenla-Romero O, Principe JC, Alonso-Betanzos A, Castillo E (2005) Linear-least-squares initialization of multilayer perceptrons through backpropagation of the desired response. IEEE Trans Neural Netw 16(2):325
Riedmiller M, Braun H (1993) .. In: IEEE international conference on neural networks. IEEE, pp 586–591
Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989
Hagan MT, Demuth HB, Beale MH, De Jesús O (1996) Neural network design, vol 20, Pws Pub, Boston
Møller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525
Karlik B, Olgac AV (2011) Performance analysis of various activation functions in generalized MLP architectures of neural networks. Int J Artif Intell Exp Syst 1(4):111
Chandra P (2003) Sigmoidal function classes for feedforward artificial neural networks. Neural Process Lett 18(3):205
Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Tech. rep. DTIC Document
Yam JY, Chow TW (2000) A weight initialization method for improving training speed in feedforward neural network. Neurocomputing 30(1):219
Yam YF, Chow TW (1995) Determining initial weights of feedforward neural networks based on least squares method. Neural Process Lett 2(2):13
Bottou L (1988) .. In: Proceedings of the international workshop neural networks application. Neuro-Nimes, vol 88, pp 197–217
Nguyen D, Widrow B (1990) .. In: 1990 IJCNN international joint conference on neural networks. IEEE, pp 21–26
Drago G P, Ridella S (1992) Statistically controlled activation weight initialization (SCAWI). IEEE Trans Neural Netw 3(4):627
Kim Y, Ra J (1991) In: 1991 IEEE international joint conference on neural networks. IEEE, pp 2396–2401
Chen CL, Nutter RS (1991) In: 1991 IEEE international joint conference on neural networks, 1991. IEEE, pp 2063–2068
Yam JY, Chow TW (2001) Feedforward networks training speed enhancement by optimal initialization of the synaptic coefficients. IEEE Trans Neural Netw 12(2):430
Sodhi SS, Chandra P, Tanwar S (2014) .. In: International joint conference on neural networks (IJCNN). IEEE, pp 291–298
Sodhi SS, Chandra P (2014) .. In: 2014 IEEE international advance computing conference (IACC). IEEE, pp 1275–1280
Mittal A, Chandra P, Singh AP (2015) .. In: 2015 international conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1371–1376
Qiao J, Li S, Li W (2016) Mutual information based weight initialization method for sigmoidal feedforward neural networks. Neurocomputing 207:676
Timotheou S (2009) A novel weight initialization method for the random neural network. Neurocomputing 73(1–3):160
Adam SP, Karras DA, Magoulas GD, Vrahatis MN (2014) Solving the linear interval tolerance problem for weight initialization of neural networks. Neural Netw 54:17
Bhatia M, Veenu, Chandra P (2018) A new weight initialization method for sigmoidal FFANN. J Intell Fuzzy Syst (Preprint) 1
Kurata G, Xiang B, Zhou B (2016) .. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pp 521–526
Mittal A, Singh AP, Chandra P (2020) .. In: Intelligent systems, technologies and applications. Springer, pp 141–153
Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551
Chandra P, Singh Y (2004) Feedforward sigmoidal networks-equicontinuity and fault-tolerance properties. IEEE Trans Neural Netw 15(6):1350
Hashem S, Schmeiser B (1995) Improving model accuracy using optimal linear combinations of trained neural networks. IEEE Trans Neural Netw 6(3):792
Lawrence S, Giles CL (2000) .. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: new challenges and perspectives for the new millennium, vol 1. IEEE, pp 114–119
Jabbar H, Khan D (2015) Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science Communication and Instrumentation Devices
Sheela KG, Deepa SN (2013) Review on methods to fix number of hidden neurons in neural networks. Mathematical Problems in Engineering
LeCun YA, Bottou L, Orr GB, Müller KR (2012) .. In: Neural networks: tricks of the trade. Springer, pp 9–48
Takeda M, Goodman JW (1986) Neural networks for computation: Number representations and programming complexity. Appl Opt 25(18):3033
Igel C, Hüsken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105
Bache K, Lichman M (2013) Uci machine learning repository
Breiman L (1991) The II method for estimating multivariate functions from noisy data. Technometrics 33(2):125
Cherkassky V, Gehring D, Mulier F (1996) Comparison of adaptive methods for function estimation from samples. IEEE Trans Neural Netw 7(4):969
Cherkassky V, Mulier FM (2007) Learning from data: concepts, theory and methods. Wiley, New York
Masters T (1993) Practical neural network recipes in C++. Morgan Kaufmann
Maechler M, Martin D, Schimert J, Csoppenszky M, Hwang J (1990) .. In: Proceedings of the 2nd international IEEE conference on tools for artificial intelligence, pp 350–358
Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–67
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547
Ein-Dor P, Feldmesser J (1987) Attributes of the performance of central processing units: a relative performance prediction model. Commun ACM 30(4):308
Harrison D Jr, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manag 5 (1):81
Kibler D, Aha DW, Albert MK (1989) Instance-based prediction of real-valued attributes. Comput Intell 5(2):51
Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. Chapman and Hall/CRC
Mittal A, Singh AP, Chandra P (2017) A new weight initialization using statistically resilient method and Moore–Penrose inverse method for SFANN. Int J Recent Res Asp 4:98
Fanaee-TH, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progr Artif Intell 2(2–3):113
Hamidieh K (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Comput Mater Sci 154:346
LeCun Y, Cortes C, Burges CJ (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist. 10, 34
Acknowledgements
This publication is an outcome of the R&D work undertaken project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation. The authors would like to thank editor of journal for providing feedback.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mittal, A., Singh, A.P. & Chandra, P. Weight and bias initialization routines for Sigmoidal Feedforward Network. Appl Intell 51, 2651–2671 (2021). https://doi.org/10.1007/s10489-020-01960-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01960-5