Abstract
Symbolic regression (SR) is a powerful method for building predictive models from data without assuming any model structure. Traditionally, genetic programming (GP) was used as the SR engine. However, for these purely evolutionary methods it was quite hard to even accommodate the function to the range of the data and the training was consequently inefficient and slow. Recently, several SR algorithms emerged which employ multiple linear regression. This allows the algorithms to create models with relatively small error right from the beginning of the search. Such algorithms are claimed to be by orders of magnitude faster than SR algorithms based on classic GP. However, a systematic comparison of these algorithms on a common set of problems is still missing and there is no basis on which to decide which algorithm to use. In this paper we conceptually and experimentally compare several representatives of such algorithms: GPTIPS, FFX, and EFS. We also include GSGP-Red, which is an enhanced version of geometric semantic genetic programming, an important algorithm in the field of SR. They are applied as off-the-shelf, ready-to-use techniques, mostly using their default settings. The methods are compared on several synthetic SR benchmark problems as well as real-world ones ranging from civil engineering to aerodynamics and acoustics. Their performance is also related to the performance of three conventional machine learning algorithms: multiple regression, random forests and support vector regression. The results suggest that across all the problems, the algorithms have comparable performance. We provide basic recommendations to the user regarding the choice of the algorithm.
Similar content being viewed by others
Notes
By “vanilla GP” we mean the original system presented by Koza in [14] or derived systems that rely solely on tree manipulation to evolve the final model. However, we do not consider all tree-based GP systems “vanilla”. An example of such a system is GPTIPS (mentioned later in the Introduction and discussed further in Sect. 2.2.1) which is tree-based and using tree manipulation as the main driver of the structural changes but has other features beyond just the tree manipulation.
However, we cannot call GSGP-Red a technology because the code [9] does not actually output the model, only the metrics needed for evaluation of the algorithm. Nevertheless, we included the algorithm because it otherwise fulfills the critera and is an important algorithm in the field of SR.
By an internal constant we mean a constant other than a coefficient of a top-level linear combination. Example: in \(3x^2 + 6\sin (1.3x)\), the “3” and “6” are not internal constants, because these are tuned by the top-level multiple regression, while the “1.3” is internal constant (part of the nonlinear basis function).
When trying to run PGE it crashed several times for reasons not obvious, which prevents the user to run experiments systematically. An example of a bug we found is on line 27 in the source file expand.py (see https://github.com/verdverm/pypge/blob/a6a031fb/pypge/expand.py#L27): there is a typo in a variable name used which prevents using the square root function node. We did not track down the causes of other crashes.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.ensemble.RandomForestRegressor.html.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.grid_search.GridSearchCV.html.
For details about the implementation and parameters see http://scikit-learn.org/0.17/modules/generated/sklearn.svm.SVR.html.
The only exception is EFS: we changed the round variable to false (which was originally hard-coded to true) according to the issue we opened on the algorithm’s GitHub repository, see https://github.com/flexgp/efs/issues/1.
FFX has a built-in 50 s timeout for performing the fit of the elastic net. If the elastic net fails to fit within this time a constant model is returned for that particular fit. Note, however, that FFX fits the elastic net multiple times for each of the multiple runs (see Sect. 2.2.3) and there is no support for timeout of this combined run and returning the results obtained so far.
The number of nodes is used as a simple common measure of complexity accross all the algorithms only for reporting purposes. The individual algorithms use their own measures of complexity to find the best model.
References
I. Arnaldo, K. Krawiec, U.M. O’Reilly, Multiple regression genetic programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 879–886. https://doi.org/10.1145/2576768.2598291
I. Arnaldo, U.M. O’Reilly, K. Veeramachaneni, Building predictive models via feature synthesis, in Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation, GECCO ’15 (ACM, New York, 2015), pp. 983–990. https://doi.org/10.1145/2739480.2754693
K. Bache, M. Lichman, UCI machine learning repository (2013). http://archive.ics.uci.edu/ml. Accessed 30 Jan 2016
V.V. De Melo, Kaizen programming, in Proceedings of the 2014 Annual Conference on Genetic and Evolutionary Computation, GECCO ’14 (ACM, New York, 2014), pp. 895–902. https://doi.org/10.1145/2576768.2598264
EFS commit 6d991fa. http://github.com/flexgp/efs/tree/6d991fa. Accessed 12 Oct 2015
FFX 1.3.4. http://pypi.python.org/pypi/ffx/1.3.4. Accessed 27 Aug 2015
J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)
A. Garg, A. Garg, K. Tai, A multi-gene genetic programming model for estimating stress-dependent soil water retention curves. Comput. Geosci. 18(1), 45–56 (2013). https://doi.org/10.1007/s10596-013-9381-z
GSGP-Red commit 0e5f4d5. https://github.com/laic-ufmg/GSGP-Red/tree/0e5f4d5. Accessed 6 Dec 2018
M. Hinchliffe, H. Hiden, B. McKay, M. Willis, M. Tham, G. Barton, Modelling chemical process systems using a multi-gene genetic programming algorithm, in Late Breaking Paper, GP’96. Stanford, USA (1996), pp. 56–65
J.H. Holland, Adaptation in Natural and Artificial Systems (MIT Press, Cambridge, 1992)
M. Keijzer, Scaled symbolic regression. Genet. Program. Evolvable Mach. 5(3), 259–269 (2004). https://doi.org/10.1023/B:GENP.0000030195.77571.f9
M.F. Korns, Genetic Programming Theory and Practice IX, chap. Accuracy in Symbolic Regression (Springer, New York, 2011), pp. 129–151. https://doi.org/10.1007/978-1-4614-1770-5_8
J.R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)
LLC, E.A.: DataModeler [Software] (2016). http://www.evolved-analytics.com/. Accessed 14 Dec 2019
S. Luke, L. Panait, Lexicographic parsimony pressure, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’02 (Morgan Kaufmann Publishers Inc., San Francisco, 2002), pp. 829–836. http://dl.acm.org/citation.cfm?id=646205.682619
J.F.B.S. Martins, L.O.V.B. Oliveira, L.F. Miranda, F. Casadei, G.L. Pappa, Solving the exponential growth of symbolic regression trees in geometric semantic genetic programming, in Proceedings of the Genetic and Evolutionary Computation Conference, GECCO ’18 (ACM, New York, 2018), pp. 1151–1158. https://doi.org/10.1145/3205455.3205593
T. McConaghy, FFX: fast, scalable, deterministic symbolic regression technology, in Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, New York, 2011), pp. 235–260. https://doi.org/10.1007/978-1-4614-1770-5_13
J. McDermott, D.R. White, S. Luke, L. Manzoni, M. Castelli, L. Vanneschi, W. Jaskowski, K. Krawiec, R. Harper, K. De Jong, U.M. O’Reilly, Genetic programming needs better benchmarks, in Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, GECCO ’12 (ACM, New York, 2012), pp. 791–798. https://doi.org/10.1145/2330163.2330273
A. Moraglio, K. Krawiec, C. Johnson, Geometric semantic genetic programming, in Parallel Problem Solving from Nature - PPSN XII, Lecture Notes in Computer Science, vol. 7491, ed. C. Coello, V. Cutello, K. Deb, S. Forrest, G. Nicosia, M. Pavone (Springer, Berlin, 2012), pp. 21–31. https://doi.org/10.1007/978-3-642-32937-1_3
J. Ni, R.H. Drieberg, P.I. Rockett, The use of an analytic quotient operator in genetic programming. IEEE Trans. Evolutionary Comput. 17(1), 146–152 (2013). https://doi.org/10.1109/TEVC.2012.2195319
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
scikit-learn 0.17.1. https://pypi.python.org/pypi/scikit-learn/0.17.1. Accessed 21 Jun 2016
M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data. Science 324(5923), 81–85 (2009)
M. Schmidt, H. Lipson, Eureqa (Version 0.98 beta) [Software] (2014). www.nutonian.com
D.P. Searson, GPTIPS 2 (2015). http://sites.google.com/site/gptips4matlab. Accessed 9 Jun 2015
D.P. Searson, GPTIPS 2: An Open-Source Software Platform for Symbolic Data Mining (Springer International Publishing, Cham, 2015), pp. 551–573. https://doi.org/10.1007/978-3-319-20883-1_22
D.P. Searson, D.E. Leahy, M.J. Willis, GPTIPS: an open source genetic programming toolbox for multigene symbolic regression, in Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1 (2010), pp. 77–80
G.F. Smits, M. Kotanchek, Genetic programming theory and practice II, chap., in Pareto-Front Exploitation in Symbolic Regression (Springer US, Boston, 2005), pp. 283–299. https://doi.org/10.1007/0-387-23254-0_17
A. Tsanas, A. Xifara, Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. EnergyBuild. 49, 560–567 (2012). https://doi.org/10.1016/j.enbuild.2012.03.003
E. Vladislavleva, G. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evolut. Comput. 13(2), 333–349 (2009). https://doi.org/10.1109/TEVC.2008.926486
E. Vladislavleva, G. Smits, M. Kotanchek, Better Solutions Faster: Soft Evolution of Robust Regression Models In Pareto Genetic Programming (Springer US, Boston, 2008), pp. 13–32. https://doi.org/10.1007/978-0-387-76308-8_2
T. Worm, Pypge. https://github.com/verdverm/pypge. Accessed 13 Dec 2019
T. Worm, K. Chiu, Prioritized grammar enumeration: symbolic regression by dynamic programming, in Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, GECCO ’13 (ACM, New York, 2013), pp. 1021–1028. https://doi.org/10.1145/2463372.2463486
I.C. Yeh, Modeling of strength of high-performance concrete using artificial neural networks. Cem. Concr. Res. 28(12), 1797–1808 (1998). https://doi.org/10.1016/S0008-8846(98)00165-3
H. Zou, T. Hastie, Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 67(2), 301–320 (2005). https://doi.org/10.1111/j.1467-9868.2005.00503.x
Acknowledgements
Jan Žegklitz was supported by the Czech Science Foundation project Nr. 15-22731S. Petr Pošík was supported by the Grant Agency of the Czech Technical University in Prague, Grant No. SGS14/194/OHK3/3T/13.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Žegklitz, J., Pošík, P. Benchmarking state-of-the-art symbolic regression algorithms. Genet Program Evolvable Mach 22, 5–33 (2021). https://doi.org/10.1007/s10710-020-09387-0
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10710-020-09387-0