Symbolic regression - an overview

Zelinka Ivan
Tomas Bata University Zlin, Faculty of Technology Institute of Information Technologies
Mostní 5139, 760 01 Zlín, Czech Republic, Phone: +420 577 603 102, E-Mail:

Keywords: analytic programming, genetic programming, grammar evolution, evolution algorithms, symbolic regression


Term "symbolic regression" (SR) represents process during which are measured data fitted by suitable mathematical formula like "$ x^2 + C$", $ \sin(x)+1/e^x$, etc. This process is amongst mathematician quite well known and used when some data of unknown process are obtained. For long time SR was domain only of humans but for a few last decades it is also domain of computers. Idea how to solve various problems by SR by means of evolutionary algorithms (EAs), come from John Koza who used genetic algorithm (GA) in so called genetic programming (GP) [1], [2]. Genetic programming is basically symbolic regression which is done by evolutionary algorithms instead of by humans. Ability to solve really hard problems were proved during time many times and GP is today on such level of performance that it is able to synthesise (for example) extremely sophisticated electronic circuits [3]. It is clear that importance of symbolic regression will increase due to increasing complexity of solved problems in science and industry.
Main attribute of symbolic regression is that it is executed by evolutionary algorithms and whole evolutionary process is working with so called functional set and terminal set. Functional set is a set of all user defined or used functions while terminal set is a set of all constant or variables.
In this participation are discussed three different methods for symbolic regression - genetic programming, grammar evolution (GE) and so called analytic programming (AP), which is novelty tool for symbolic regression, based on different principles in regard of GP or GE. Higher attention will be dedicated to AP because of its novelty.

Symbolic regression and evolutionary algorithms - main ideas

Symbolic regression is in fact based on existence of so called evolutionary algorithms. This class of algorithms is based on Darwinian theory of evolution and one of its main attributes is that there is no calculated only one solution, but a class of possible solutions at once. This class of possible and acceptable solutions is called "population". Members of this populations are called "individuals" and mathematically said, they represent possible solution, i.e. solution which can be realised in real world application. Main aim of evolutionary algorithms is to find during evolutionary process the best solution of all. Evolutionary algorithms differ among themselves in many points of view like for example individual representation (binary, decimal) or offspring creation (standard crossover, aritmetic operations, vector operations, etc...). They also differ in philosophical background on which they were developed and usually they are named according to this point of view.
Symbolic regression is based on evolutionary algorithms and its main aim is to "synthetise" in an evolutionary way such "program" (mathematical formulas, computer programs, logical expresssions, etc...) which will solve user defined problem as well as possible. While domain of EAs is of numerical nature (real, complex, integer, discrete), domain of symbolic regression is of functional nature, i.e. it consist of function set like (sin(), cos(), gamma(), MyFunction(),...) and so called terminal set (t, x, p, ...). From mix of both sets is then synthetised final program, which can be quite complicated in point of view of its structure.
In the novadays there are three methods allowing to do that: genetic programming, grammar evolution and analytic programming.

Genetic Programming

Genetic programming [1], [2] is oldiest method of automatic programme creation by means of genetic algorithm, which was developed by american informatic J. R. Koza (see also This method is based on computer language Lisp which is able to manipulate with symbolic expressions. During existence of GP there was done by GP numerous examples like data fitting, logical expressions synthesis, robot trajectory optimisation, synthesis of a programme for artificial ant movement, system identification, etc. Main principle of GP is such that programs are represented in chromosomes like syntactic trees. Based on GA principles, trees are cutted by crossover operation and cutted parts are exchanged between themselves.
In this way there are created new individuals (offsprings), whose represents in fact a new programs which are evaluated by fitneess as is common in EAs. During creation of new individual-program there are also applied other operators like mutation etc.
Genetic programming also employ a new techniques like automatically defined functions (ADF), which can be specified like automatically defined iterations (ADI), automatically defined loops (ADL), automatically defined recursions (ADR), etc. Term "automatic" means that some functions created during evolution are automatically included into basic set of functions.

Grammar Evolution

Grammar evolution is the second method for symbolic regression which basically stems from GP. According to the author (C. Ryan, GE is symbolic regression, which can be done by arbitrarry programme language like Lisp, C++, Java, XML, Perl, Fortran, etc. In contrary with GP, grammar evolution is basically its extension which does not use direct symbolic representation in Lisp, but use so called Backus-Naur form (BNF). Based on BNF is GE able to do symbolic regression in above mentiones computer languages.
During existence of GE there were done some comparative simulations based on problems solved by Koza in GP. For full texts please see

Main Principles of Analytic Programming

Analytic programming (I. Zelinka, was inspired by two existing methods: by Hilbert spaces and by GP. Principles and general philosophy of analytic programming (AP) stem from these two methods. Into AP an idea about evolutionary creation of symbolic solutions is taken from GP while from Hilbert spaces idea of functional spaces and building of resulting function by means of searching process is adopted into AP. This process is usually done by numerical methods like Ritz or Galerkin methods are [4]. Core of AP is based on set of functions, operators and so-called terminals, which are usually constants or independent variables as well as in GP and GE. Main aim of AP is to synthesise suitable program (mathematical formulas, for example see results of AP (1)-(4)) which would fit measured data as well as possible. For this reason a discrete set handling idea was adopted into AP.
Discrete set handling (DSH) was proposed in [5], [6]. In analytic programming DSH is used to create integer index, which is used in evolutionary process like alternate individual, handled in EA by method of integer handling [5], [6]. An individual is created automatically in population like integer individual whose parameters range cannot exceed the cardinality of used set of functions and terminals.

$\displaystyle \frac{1}{\sqrt{1+\big(\sqrt{1-cac\mid\cot\mid t\mid\mid^2}\sin
\mid\cot \mid t\mid\mid\big)^{2t}}}$     (1)

$\displaystyle \frac{1}{\sqrt{1+\big(\sqrt{1-\frac{1}{t^2}}t\big)^{2t}}}$     (2)

$\displaystyle 1.x^2-4.44089\cdot 10^{-16}x^3-2.x^4+3.33067\cdot$      
$\displaystyle 10^{-16}x^5+1.x^6$     (3)

$\displaystyle \frac{1}{4.469-(1.988\cdot10^{-1})x+2.18x^t)}$      
$\displaystyle ((1.882\cdot 10^{-2})(3.112-1.x)$      
$\displaystyle (-1.315+x)(1.363+x)(2.521+x)$      
$\displaystyle (3.195+x)(3.057+x+(4.843\cdot 10^{-3})x^2)$      
$\displaystyle (-4.685-1.634x+x^2))$     (4)

Today AP exists in three versions. All three versions need for program synthesis the same sets of functions, terminals, etc as Koza use in GP [1], [2]. The first basic version called $ AP_{basic}$ is basic version of AP. It use constants from terminal set to synthetise programs in the same way like Koza [1]. The second one - $ AP_{meta}$ is modification in the sense of constant estimation. In comparing with $ AP_{basic}$ in $ AP_{meta}$ only one nonspecified constant "K" is generated instead of randomly generated constants. Constant "K" is after synthesising indexed so that $ K_1$, $ K_2$, ..., $ K_n$, are in formula obtained and then all $ K_n$ are estimated by different or by the same evolutionary method. Because EA "works under" EA (i.e. $ EA_{master}$ $ \blacktriangleright$ program $ \blacktriangleright$ K indexing $ \blacktriangleright$ $ EA_{slave}$ $ \blacktriangleright$ estimation of $ K_n$) then this version is called AP with metaevolution - $ AP_{meta}$. Last modification is $ AP_{nf}$ which is based on $ AP_{meta}$ is such that instead of other EA is K estimated by suitable non-linear fitting method.
To verify that AP is viable there were done simulations - experiments that were for each case many times repeated, see [7], [8] and [9]. All simulations, especially the last one comparative, has showed that AP is able to solve the same problems like GP or GE at the same level of quality.


The method of symbolic regression described here are of three kind. The first one, genetic programming, is the oldiest one and can be used only by genetic algorithms and Lisp programme language. The second one, grammar evolution is "unfolding" of a genetic programming so that instead of Lisp there can be used different computer programme languages. Hovewer, grammar evolution still use binary representation of individuals and crossover operations like genetic algorithms use. The third method, called analytic programming, is independent on programme language and can be used by any evolutionary algorithm, does not matter how new offspring are calculated.
It can be stated that all three algorithms can be used for symbolic regression tasks and their hierarchy is such that on the lowest level is genetic programming (due to the ability to be used only by Lisp and genetic algorithm), on the higher level is grammar evolution (due to the possibility use different computer languages) and on the highest level is analytic programming (due to ability to be used not only different computer languages but also by any evolutionary algorithm like differential evolution (DE), simulated annealing (SA),...).
For complete information about all three methods it is recommended to see literature at the end or visit homepages of all three methods, i.e. for genetic programming:, for grammar evolution: and for analytic programming: .


Koza J.R. Genetic Programming, MIT Press, ISBN 0-262-11189-6, 1998

Koza J.R., Bennet F.H., Andre D., Keane M., Genetic Programming III, Morgan Kaufnamm pub., ISBN 1-55860-543-6, 1999

J. R. Koza, M. A. Keane, M. J. Streeter, Evolving Inventions, ScientificAmerican, February 2003, p. 40-47, ISSN 0036-8733, see also

Rektorys, Karel, Variational methods in Engineering Problems and Problems of Mathematical Physics. Czech Ed., 1999, ISBN 80-200-0714-8. English edition is also available on the Internet, see

Lampinen Jouni, Zelinka Ivan. New Ideas in Optimization - Mechanical Engineering Design Optimization by Differential Evolution. Volume 1. London : McGraw-Hill, 1999. 20 p. ISBN 007-709506-5.

Zelinka, Ivan, Artificial Intelligence in The Problems of Global Optimization (Czech Ed.) BEN, 2002, 190 p. ISBN 80-7300-069-5

Zelinka I.: Analytic Programming by Means of Soma Algorithm. Mendel '02, In: Proc. 8th International Conference on Soft Computing Mendel'02, Brno, Czech Republic, 2002, 93-101., ISBN 80-214-2135-5

Zelinka I.: Analytic Programming by Means of Soma Algorithm. ICICIS'02, First International Conference on Intelligent Computing and Information Systems, Egypt, Cairo, 2002, ISBN 977-237-172-3

Zelinka I., Oplatkova Z.: Analytic Programming - Comparative Study. CIRAS'03, The second International Conference on Computational Intelligence, Robotics, and Autonomous Systems, Singapore, 2003