Skip to main content
Log in

Weight and bias initialization routines for Sigmoidal Feedforward Network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The success of the Sigmoidal Feedforward Networks in the solution of complex learning task can be attributed to their Universal Approximation Property. These networks are trained using non-linear iterative optimization method (of first-order or second-order) to solve a learning task. The convergence rate in Sigmoidal Feedforward Network training is affected by the initial choice of weights, therefore, in this paper, we propose two new weight initialization routines (Routine-1 and Routine-2) using characteristics of input and output data and property of activation function. Routine-1 uses the linear dependency of weight update step size on derivative of activation function and thus, initialize weights and bias to activate the activation function region near zero (input), where the derivative is maximum, therefore, increasing the weight update step size, and hence, the convergence speed. The same principle is used to derive Routine-2, that initialize weights and bias to activate distinct point in the significant range of activation function (where significant range defines the non-saturated region in activation function), such that, each node evolves independently of each other, and act as distinct feature identifier. Initializing weights in significant range reduces chances of (hidden) nodes getting stuck in saturated state. The networks initialized using proposed routines has higher convergence and higher probability to achieve deeper minima. The efficiency of proposed routines is evaluated by comparing them to conventional random weight initialization routine and 11 weight initialization routines proposed in literature (4 well established routines and 7 recently proposed routines) for several benchmark problems. The proposed routine is also tested for larger networks sizes and larger datasets such as MNIST. The results show that the performance of proposed routines is better than conventional random weight initialization routine and 11 established weight initialization routines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. Linear activation function is used at the output layer in this paper, though sigmoidal activation function can also be used.

  2. if we augment the input with x0 = 1, then, the relation can also be written as \(n_{j}^{(h)}={\sum }_{i=0}^{I} w_{ji}x_{i}\), where, wj0𝜃j. In this paper, we show the threshold 𝜃j explicitly.

  3. Error is refered to the deviation of the actual output from desired output of SFN.

  4. (−λ,λ) is the active/useful range of activation function, in which the value of it’s derivative is greater than 5% of the maximum value. The value of λ for hyperbolic tangent function equals 2.1783 (approximated upto four decimal places).

  5. Consideration of initialization for γ, for the logistic activation function is same as that for the hyperbolic tangent activation function, as stated after relation (29).

  6. https://www.kaggle.com/pablomonleon/montreal-bike-lanes

  7. https://www.kaggle.com/pablomonleon/montreal-bike-lanes

  8. Wine Quality Dataset can be used as both regression and classification problem. In this work, the dataset is used as a regression problem.

  9. For each of 12 training problems and for 3 weight initialization routines, 30 networks are trained. Thus, we get 14, 3 × 30 matrix for MSEtrain and 14, 3 × 30 matrix for MSEtest. Due to the volume of data obtained, the results for MSEtrain and MSEtest are not reported in this work.

  10. 14 t-test tables are obtained for each training problem. The entry 1 in i th row and j th column, indicates that method i is statistically better than method j, whereas, entry 0 indicates that method i is statistically similar to method j. Due to the the volume of data, we report the summarized result of the data (all the tables obtained for each function are superimposed).

  11. For each of 14 training problems and for each of 12 WIR’s, 30 networks are trained. Thus we get 14 12 × 30 matrix for MSEtrain and 14, 12 × 30 matrix for MSEtest. Due to the volume of data obtained, the results for MSEtrain and MSEtest are not reported in this work.

References

  1. Haykin S (2004) Neural network. A comprehensive foundation. Neural Networks 2

  2. Cybenko G (1989) Approximation by superpositions of a Sigmoidal Function. Math Control Signals Syst 2(4):303

    Article  MathSciNet  Google Scholar 

  3. Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359

    Article  Google Scholar 

  4. Funahashi KI (1989) On the approximate realization of continuous mappings by neural networks. Neural Netw 2(3):183

    Article  Google Scholar 

  5. Scarselli F, Tsoi AC (1998) Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw 11(1):15

    Article  Google Scholar 

  6. Erdogmus D, Fontenla-Romero O, Principe JC, Alonso-Betanzos A, Castillo E (2005) Linear-least-squares initialization of multilayer perceptrons through backpropagation of the desired response. IEEE Trans Neural Netw 16(2):325

    Article  Google Scholar 

  7. Riedmiller M, Braun H (1993) .. In: IEEE international conference on neural networks. IEEE, pp 586–591

  8. Hagan MT, Menhaj MB (1994) Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Netw 5(6):989

    Article  Google Scholar 

  9. Hagan MT, Demuth HB, Beale MH, De Jesús O (1996) Neural network design, vol 20, Pws Pub, Boston

  10. Møller MF (1993) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw 6(4):525

    Article  Google Scholar 

  11. Karlik B, Olgac AV (2011) Performance analysis of various activation functions in generalized MLP architectures of neural networks. Int J Artif Intell Exp Syst 1(4):111

    Google Scholar 

  12. Chandra P (2003) Sigmoidal function classes for feedforward artificial neural networks. Neural Process Lett 18(3):205

    Article  Google Scholar 

  13. Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation. Tech. rep. DTIC Document

  14. Yam JY, Chow TW (2000) A weight initialization method for improving training speed in feedforward neural network. Neurocomputing 30(1):219

    Article  Google Scholar 

  15. Yam YF, Chow TW (1995) Determining initial weights of feedforward neural networks based on least squares method. Neural Process Lett 2(2):13

    Article  Google Scholar 

  16. Bottou L (1988) .. In: Proceedings of the international workshop neural networks application. Neuro-Nimes, vol 88, pp 197–217

  17. Nguyen D, Widrow B (1990) .. In: 1990 IJCNN international joint conference on neural networks. IEEE, pp 21–26

  18. Drago G P, Ridella S (1992) Statistically controlled activation weight initialization (SCAWI). IEEE Trans Neural Netw 3(4):627

    Article  Google Scholar 

  19. Kim Y, Ra J (1991) In: 1991 IEEE international joint conference on neural networks. IEEE, pp 2396–2401

  20. Chen CL, Nutter RS (1991) In: 1991 IEEE international joint conference on neural networks, 1991. IEEE, pp 2063–2068

  21. Yam JY, Chow TW (2001) Feedforward networks training speed enhancement by optimal initialization of the synaptic coefficients. IEEE Trans Neural Netw 12(2):430

    Article  Google Scholar 

  22. Sodhi SS, Chandra P, Tanwar S (2014) .. In: International joint conference on neural networks (IJCNN). IEEE, pp 291–298

  23. Sodhi SS, Chandra P (2014) .. In: 2014 IEEE international advance computing conference (IACC). IEEE, pp 1275–1280

  24. Mittal A, Chandra P, Singh AP (2015) .. In: 2015 international conference on advances in computing, communications and informatics (ICACCI). IEEE, pp 1371–1376

  25. Qiao J, Li S, Li W (2016) Mutual information based weight initialization method for sigmoidal feedforward neural networks. Neurocomputing 207:676

    Article  Google Scholar 

  26. Timotheou S (2009) A novel weight initialization method for the random neural network. Neurocomputing 73(1–3):160

    Article  Google Scholar 

  27. Adam SP, Karras DA, Magoulas GD, Vrahatis MN (2014) Solving the linear interval tolerance problem for weight initialization of neural networks. Neural Netw 54:17

    Article  Google Scholar 

  28. Bhatia M, Veenu, Chandra P (2018) A new weight initialization method for sigmoidal FFANN. J Intell Fuzzy Syst (Preprint) 1

  29. Kurata G, Xiang B, Zhou B (2016) .. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies , pp 521–526

  30. Mittal A, Singh AP, Chandra P (2020) .. In: Intelligent systems, technologies and applications. Springer, pp 141–153

  31. Hornik K, Stinchcombe M, White H (1990) Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw 3(5):551

    Article  Google Scholar 

  32. Chandra P, Singh Y (2004) Feedforward sigmoidal networks-equicontinuity and fault-tolerance properties. IEEE Trans Neural Netw 15(6):1350

    Article  Google Scholar 

  33. Hashem S, Schmeiser B (1995) Improving model accuracy using optimal linear combinations of trained neural networks. IEEE Trans Neural Netw 6(3):792

    Article  Google Scholar 

  34. Lawrence S, Giles CL (2000) .. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: new challenges and perspectives for the new millennium, vol 1. IEEE, pp 114–119

  35. Jabbar H, Khan D (2015) Methods to avoid over-fitting and under-fitting in supervised machine learning (comparative study). Computer Science Communication and Instrumentation Devices

  36. Sheela KG, Deepa SN (2013) Review on methods to fix number of hidden neurons in neural networks. Mathematical Problems in Engineering

  37. LeCun YA, Bottou L, Orr GB, Müller KR (2012) .. In: Neural networks: tricks of the trade. Springer, pp 9–48

  38. Takeda M, Goodman JW (1986) Neural networks for computation: Number representations and programming complexity. Appl Opt 25(18):3033

    Article  Google Scholar 

  39. Igel C, Hüsken M (2003) Empirical evaluation of the improved Rprop learning algorithms. Neurocomputing 50:105

    Article  Google Scholar 

  40. Bache K, Lichman M (2013) Uci machine learning repository

  41. Breiman L (1991) The II method for estimating multivariate functions from noisy data. Technometrics 33(2):125

    MathSciNet  MATH  Google Scholar 

  42. Cherkassky V, Gehring D, Mulier F (1996) Comparison of adaptive methods for function estimation from samples. IEEE Trans Neural Netw 7(4):969

    Article  Google Scholar 

  43. Cherkassky V, Mulier FM (2007) Learning from data: concepts, theory and methods. Wiley, New York

    Book  Google Scholar 

  44. Masters T (1993) Practical neural network recipes in C++. Morgan Kaufmann

  45. Maechler M, Martin D, Schimert J, Csoppenszky M, Hwang J (1990) .. In: Proceedings of the 2nd international IEEE conference on tools for artificial intelligence, pp 350–358

  46. Friedman JH (1991) Multivariate adaptive regression splines. Ann Stat 19:1–67

    MathSciNet  MATH  Google Scholar 

  47. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 47(4):547

    Article  Google Scholar 

  48. Ein-Dor P, Feldmesser J (1987) Attributes of the performance of central processing units: a relative performance prediction model. Commun ACM 30(4):308

    Article  Google Scholar 

  49. Harrison D Jr, Rubinfeld DL (1978) Hedonic housing prices and the demand for clean air. J Environ Econ Manag 5 (1):81

    Article  Google Scholar 

  50. Kibler D, Aha DW, Albert MK (1989) Instance-based prediction of real-valued attributes. Comput Intell 5(2):51

    Article  Google Scholar 

  51. Martinez WL, Martinez AR (2007) Computational statistics handbook with MATLAB. Chapman and Hall/CRC

  52. Mittal A, Singh AP, Chandra P (2017) A new weight initialization using statistically resilient method and Moore–Penrose inverse method for SFANN. Int J Recent Res Asp 4:98

    Google Scholar 

  53. Fanaee-TH, Gama J (2014) Event labeling combining ensemble detectors and background knowledge. Progr Artif Intell 2(2–3):113

    Article  Google Scholar 

  54. Hamidieh K (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Comput Mater Sci 154:346

    Article  Google Scholar 

  55. LeCun Y, Cortes C, Burges CJ (1998) The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist. 10, 34

Download references

Acknowledgements

This publication is an outcome of the R&D work undertaken project under the Visvesvaraya PhD Scheme of Ministry of Electronics & Information Technology, Government of India, being implemented by Digital India Corporation. The authors would like to thank editor of journal for providing feedback.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Apeksha Mittal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mittal, A., Singh, A.P. & Chandra, P. Weight and bias initialization routines for Sigmoidal Feedforward Network. Appl Intell 51, 2651–2671 (2021). https://doi.org/10.1007/s10489-020-01960-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-01960-5

Keywords

Navigation