Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x

Authors:
Timothée Ewart

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland

0000-0002-3436-1766
View Profile

,
Francesco Cremonesi

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland
View Profile

,
Felix Schürmann

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland
View Profile

,
Fabien Delalondre

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland

Blue Brain Project, École Polytechnique Fédérale de Lausanne, Genève, Switzerland
View Profile

Authors Info & Claims

ACM Transactions on Mathematical Software Volume 46 Issue 3Article No.: 28pp 1–22https://doi.org/10.1145/3408893

Published:15 September 2020Publication History

ACM Transactions on Mathematical Software

Abstract

The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function e^x that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.

References

Muhammad Abbas and Oscar Gustafsson. 2011. Computational and implementation complexity of polynomial evaluation schemes. In Proceedings of the NORCHIP Conference. IEEE, 1--6.Google ScholarCross Ref
George A. Baker and Peter Graves-Morris. 1996. Padé Approximants (2nd ed.). Cambridge University Press. DOI:https://doi.org/10.1017/CBO9780511530074Google Scholar
Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in high-performance computing applications. Proc. IEEE 99 (2018), 1--16.Google Scholar
M. Boersma, M. Kroner, C. Layer, P. Leber, S. M. Muller, and K. Schelm. 2011. The POWER7 binary floating-point unit. In Proceedings of the 20th IEEE Symposium on Computer Arithmetic (ARITH’11). 87--91.Google Scholar
T. Agerwala and J. Cocke. 1987. High Performance Reduced Instruction Set Processors. IBM Watson Research Center.Google Scholar
S. Chevillard, M. Joldeş, and C. Lauter. 2010. Sollya: An environment for the development of numerical codes. In Mathematical Software - ICMS 2010 (Lecture Notes in Computer Science), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327. Springer,Germany, 28--31.Google Scholar
Hugues de Lassus Saint-Genies, David Defour, and Guillaume Revy. 2017. Exact lookup tables for the evaluation of trigonometric and hyperbolic functions. IEEE Trans. Comput. 66, 12 (2017), 2058--2071.Google ScholarCross Ref
W. S. Dorn. 1962. Generalizations of Horner’s rule for polynomial evaluation. IBM J. Res. Dev. 6, 2 (Apr. 1962), 239--245.Google ScholarDigital Library
Marat Dukhan and Richard W. Vuduc. 2013. Methods for high-throughput computation of elementary functions. In Proceedings of the 10th International Conference on Parallel Processing and Applied Mathematics (PPAM’13), Revised Selected Papers, Part I. 86--95.Google Scholar
Milos D. Ercegovac. 1977. A general hardware-oriented method for evaluation of functions and computations in a digital computer. IEEE Trans. Comput. 7 (1977), 667--680.Google ScholarDigital Library
Gerald Estrin. 1960. Organization of computer systems—The fixed plus variable structure computer. In Proceedings of the International Workshop on Managing Requirements Knowledge. 33.Google Scholar
Timothée Ewart, Fabien Delalondre, and Felix Schürmann. 2014. Cyme: A library maximizing SIMD computation on user-defined containers. In Supercomputing, Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.). Lecture Notes in Computer Science, Vol. 8488. Springer International Publishing, 440--449.Google Scholar
Timothée Ewart, Stuart Yates, Francesco Cremonesi, Pramod Kumbhar, Felix Schürmann, and Fabien Delalondre. 2015. Performance evaluation of the IBM POWER8 architecture to support computational neuroscientific application using morphologically detailed neurons. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS’15). ACM, New York, NY.Google ScholarDigital Library
Richard J. Fateman. 2002. Code generation: Evaluating polynomials. University of California, Berkeley. Retrieved from http://people.eecs.berkeley.edu/~fateman/papers/polyval.pdf.Google Scholar
Agner Fog. 1996-2016. The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers. Retrieved from http://www.agner.org/optimize/microarchitecture.pdf.Google Scholar
Agner Fog. 2018. Instruction tables. Retrieved from http://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
W. Fraser. 1965. A survey of methods of computing minimax and near-minimax polynomial approximations for functions of a single independent variable. J. ACM 12, 3 (July 1965), 295--314.Google ScholarDigital Library
Curtis F. Gerald and Patrick O. Wheatley. 2004. Applied Numerical Analysis. Pearson/Addison-Wesley.Google Scholar
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5--48.Google ScholarDigital Library
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (May 2008).Google ScholarDigital Library
HiPEAC 2015. Fast Exponential Computation on SIMD Architectures. HiPEAC.Google Scholar
Intel. 2009--2012. Intel Architecture Code Analyser. Retrieved from https://software.intel.com/en-us/articles/intel-architecture-code-analyzer.Google Scholar
Mioara Joldes, Jean-Michel Muller, and Valentina Popescu. 2017. Tight and rigorous error bounds for basic building blocks of double-word arithmetic. ACM Trans. Math. Softw. 44, 2 (Oct. 2017). DOI:https://doi.org/10.1145/3121432Google ScholarDigital Library
W. Kahan. 2002. On the Cost of Floating-point Computation without Extra-precise Arithmetic. Retrieved from https://people.eecs.berkeley.edu/ wkahan/Qdrtcs.pdf.Google Scholar
Felix Klein. 1932. Elementary Mathematics from an Advanced Standpoint. MacMillan and Co. Limited.Google Scholar
Donald E. Knuth. 1962. Evaluation of polynomials by computer. Commun. ACM 5, 12 (1962), 595--599.Google ScholarDigital Library
Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarDigital Library
Monica S. Lam. 1990. Instruction scheduling for superscalar architectures. Annu. Rev. Comput. Sci. 4 (1990), 173--201.Google ScholarCross Ref
C. Lauter. 2016. A new open-source SIMD vector libm fully implemented with high-level scalar C. In Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers. 407--411.Google ScholarCross Ref
Christoph Quirin Lauter. 2005. Basic Building Blocks for a Triple-double Intermediate Format. Technical Report RR-5702. INRIA. Retrieved from https://hal.inria.fr/inria-00070314.Google Scholar
Richard J. Lipton and Larry J. Stockmeyer. 1978. Evaluation of polynomials with super-preconditioning. J. Comput. Syst. Sci. 16, 2 (1978), 124--139.Google ScholarCross Ref
Sparsh Mittal. 2018. A Survey of Techniques for Dynamic Branch. Retrieved from https://arxiv.org/abs/1804.00261.Google Scholar
S. L. Moshier. 2000. Cephes Math Library. Retrieved from http://www.moshier.net.Google Scholar
Christophe Mouilleron and Guillaume Revy. 2011. Automatic generation of fast and certified code for polynomial evaluation. In Proceedings of the 20th IEEE Symposium on Computer Arithmetic (ARITH’11). IEEE, 233--242.Google ScholarDigital Library
Jean-Michel Muller. 1997. Elementary Functions: Algorithms and Implementation. Birkhauser Boston, Inc., Secaucus, NJ.Google ScholarCross Ref
Jean-Michel Muller. 2005. On the Definition of ulp(x). Retrieved from http://www.ens-lyon.fr/LIP/Pub/Rapports/RR/RR2005/RR2005-09.pdf.Google Scholar
Jean-Michel Muller. 2006. Elementary Functions. Springer.Google Scholar
A. C. R. Newbery. 1975. Polynomial evaluation schemes. Math. Comp. 29, 132 (1975), 1046--1050.Google ScholarCross Ref
Richard E. Overill and Stephen Wilson. 1994. Performance of parallel algorithms for the evaluation of power series. Parallel Comput. 20, 8 (1994), 1205--1213.Google ScholarDigital Library
Angela Pohl, Biagio Cosenza, Mauricio Alvarez Mesa, Chi Ching Chi, and Ben Juurlink. 2016. An evaluation of current SIMD programming models for C++. In Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing (WPMVP’16). ACM, New York, NY.Google ScholarDigital Library
Michael O. Rabin and Shmuel Winograd. 1972. Fast evaluation of polynomials by rational preparation. Commun. Pure Appl. Math. 25, 4 (1972), 433--458.Google ScholarCross Ref
Gavin S. Reynolds. 2010. Investigation of Different Methods of Fast Polynomial Evaluation. Master’s thesis. The University of Edinburgh.Google Scholar
Hugues De Lassus Saint-Genies. 2018. Elementary Functions: Towards Automatically Generated, Efficient, and Vectorizable Implementations. Ph.D. Dissertation. Université de Perpignan.Google Scholar
Naoki Shibata. 2010. Efficient evaluation methods of elementary functions suitable for SIMD computation. Comput. Sci. Res. Dev. 25, 1 (2010), 25--32.Google ScholarCross Ref
Lol Software. 2012. Remez exchange toolbox. Retrieved from http://lolengine.net/wiki/doc/maths/remez.Google Scholar
Ping-Tak Peter Tang. 1989. Table-driven implementation of the exponential function in IEEE floating-point arithmetic. ACM Trans. Math. Softw. 15, 2 (June 1989), 144--157.Google Scholar
P. T. P. Tang. 1991. Table-lookup algorithms for elementary functions and their error analysis. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic. 232--236.Google ScholarCross Ref
David Vandevoorde and Nicolai M. Josuttis. 2002. C++ Templates: The Complete Guide (1st ed.). Addison-Wesley Professional.Google ScholarDigital Library

Index Terms

Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x
1. Mathematics of computing
  1. Mathematical software
    1. Mathematical software performance

Recommendations

Exploiting selective instruction reuse and value prediction in a superscalar architecture

In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches ...
Read More
Design and implementation of a 100 MHz centralized instruction window for a superscalar microprocessor
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and Processors

The maxim of the superscalar architecture is that higher performance can be achieved by executing multiple instructions simultaneously. This can be realized on hardware by using a centralized instruction window. We present the design and implementation ...
Read More
Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors

This paper presents the development of instruction analysis/scheduling CAD techniques to measure the distribution of functional-unit usage and the microoperation level parallelism (MLP), which together determine the proper functional-unit allocation for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Mathematical Software Volume 46, Issue 3
September 2020
267 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/3410509
Editors:
Zhaojun Bai
University of California at Davis, USA
,
Wolfgang Bangerth
Colorado State University, USA
Issue’s Table of Contents
Copyright © 2020 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 September 2020
- Accepted: 1 June 2020
- Revised: 1 September 2019
- Received: 1 July 2018
Published in toms Volume 46, Issue 3

Check for updates
Author Tags
Polynomial evaluation
compute units
elementary function
superscalar architecture
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 1,707
  Total Downloads
- Downloads (Last 12 months)221
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function e^x

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Exploiting selective instruction reuse and value prediction in a superscalar architecture

Design and implementation of a 100 MHz centralized instruction window for a superscalar microprocessor

Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors