Abstract
The evaluation of small degree polynomials is critical for the computation of elementary functions. It has been extensively studied and is well documented. In this article, we evaluate existing methods for polynomial evaluation on superscalar architecture. In addition, we have completed this work with a factorization method, which is surprisingly neglected in the literature. This work focuses on out-of-order Intel processors, amongst others, of which computational units are available. Moreover, we applied our work on the elementary function ex that requires, in the current implementation, an evaluation of a polynomial of degree 10 for a satisfying precision and performance. Our results show that the factorization scheme is the fastest in benchmarks, and that latency and throughput are intrinsically dependent on each other on superscalar architecture.
- Muhammad Abbas and Oscar Gustafsson. 2011. Computational and implementation complexity of polynomial evaluation schemes. In Proceedings of the NORCHIP Conference. IEEE, 1--6.Google ScholarCross Ref
- George A. Baker and Peter Graves-Morris. 1996. Padé Approximants (2nd ed.). Cambridge University Press. DOI:https://doi.org/10.1017/CBO9780511530074Google Scholar
- Prasanna Balaprakash, Jack Dongarra, Todd Gamblin, Mary Hall, Jeffrey K. Hollingsworth, Boyana Norris, and Richard Vuduc. 2018. Autotuning in high-performance computing applications. Proc. IEEE 99 (2018), 1--16.Google Scholar
- M. Boersma, M. Kroner, C. Layer, P. Leber, S. M. Muller, and K. Schelm. 2011. The POWER7 binary floating-point unit. In Proceedings of the 20th IEEE Symposium on Computer Arithmetic (ARITH’11). 87--91.Google Scholar
- T. Agerwala and J. Cocke. 1987. High Performance Reduced Instruction Set Processors. IBM Watson Research Center.Google Scholar
- S. Chevillard, M. Joldeş, and C. Lauter. 2010. Sollya: An environment for the development of numerical codes. In Mathematical Software - ICMS 2010 (Lecture Notes in Computer Science), K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Vol. 6327. Springer,Germany, 28--31.Google Scholar
- Hugues de Lassus Saint-Genies, David Defour, and Guillaume Revy. 2017. Exact lookup tables for the evaluation of trigonometric and hyperbolic functions. IEEE Trans. Comput. 66, 12 (2017), 2058--2071.Google ScholarCross Ref
- W. S. Dorn. 1962. Generalizations of Horner’s rule for polynomial evaluation. IBM J. Res. Dev. 6, 2 (Apr. 1962), 239--245.Google ScholarDigital Library
- Marat Dukhan and Richard W. Vuduc. 2013. Methods for high-throughput computation of elementary functions. In Proceedings of the 10th International Conference on Parallel Processing and Applied Mathematics (PPAM’13), Revised Selected Papers, Part I. 86--95.Google Scholar
- Milos D. Ercegovac. 1977. A general hardware-oriented method for evaluation of functions and computations in a digital computer. IEEE Trans. Comput. 7 (1977), 667--680.Google ScholarDigital Library
- Gerald Estrin. 1960. Organization of computer systems—The fixed plus variable structure computer. In Proceedings of the International Workshop on Managing Requirements Knowledge. 33.Google Scholar
- Timothée Ewart, Fabien Delalondre, and Felix Schürmann. 2014. Cyme: A library maximizing SIMD computation on user-defined containers. In Supercomputing, Julian Martin Kunkel, Thomas Ludwig, and Hans Werner Meuer (Eds.). Lecture Notes in Computer Science, Vol. 8488. Springer International Publishing, 440--449.Google Scholar
- Timothée Ewart, Stuart Yates, Francesco Cremonesi, Pramod Kumbhar, Felix Schürmann, and Fabien Delalondre. 2015. Performance evaluation of the IBM POWER8 architecture to support computational neuroscientific application using morphologically detailed neurons. In Proceedings of the 6th International Workshop on Performance Modeling, Benchmarking, and Simulation of High Performance Computing Systems (PMBS’15). ACM, New York, NY.Google ScholarDigital Library
- Richard J. Fateman. 2002. Code generation: Evaluating polynomials. University of California, Berkeley. Retrieved from http://people.eecs.berkeley.edu/~fateman/papers/polyval.pdf.Google Scholar
- Agner Fog. 1996-2016. The microarchitecture of Intel, AMD and VIA CPUs An optimization guide for assembly programmers and compiler makers. Retrieved from http://www.agner.org/optimize/microarchitecture.pdf.Google Scholar
- Agner Fog. 2018. Instruction tables. Retrieved from http://www.agner.org/optimize/instruction_tables.pdf.Google Scholar
- W. Fraser. 1965. A survey of methods of computing minimax and near-minimax polynomial approximations for functions of a single independent variable. J. ACM 12, 3 (July 1965), 295--314.Google ScholarDigital Library
- Curtis F. Gerald and Patrick O. Wheatley. 2004. Applied Numerical Analysis. Pearson/Addison-Wesley.Google Scholar
- David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (March 1991), 5--48.Google ScholarDigital Library
- Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34, 3 (May 2008).Google ScholarDigital Library
- HiPEAC 2015. Fast Exponential Computation on SIMD Architectures. HiPEAC.Google Scholar
- Intel. 2009--2012. Intel Architecture Code Analyser. Retrieved from https://software.intel.com/en-us/articles/intel-architecture-code-analyzer.Google Scholar
- Mioara Joldes, Jean-Michel Muller, and Valentina Popescu. 2017. Tight and rigorous error bounds for basic building blocks of double-word arithmetic. ACM Trans. Math. Softw. 44, 2 (Oct. 2017). DOI:https://doi.org/10.1145/3121432Google ScholarDigital Library
- W. Kahan. 2002. On the Cost of Floating-point Computation without Extra-precise Arithmetic. Retrieved from https://people.eecs.berkeley.edu/ wkahan/Qdrtcs.pdf.Google Scholar
- Felix Klein. 1932. Elementary Mathematics from an Advanced Standpoint. MacMillan and Co. Limited.Google Scholar
- Donald E. Knuth. 1962. Evaluation of polynomials by computer. Commun. ACM 5, 12 (1962), 595--599.Google ScholarDigital Library
- Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarDigital Library
- Monica S. Lam. 1990. Instruction scheduling for superscalar architectures. Annu. Rev. Comput. Sci. 4 (1990), 173--201.Google ScholarCross Ref
- C. Lauter. 2016. A new open-source SIMD vector libm fully implemented with high-level scalar C. In Proceedings of the 50th Asilomar Conference on Signals, Systems and Computers. 407--411.Google ScholarCross Ref
- Christoph Quirin Lauter. 2005. Basic Building Blocks for a Triple-double Intermediate Format. Technical Report RR-5702. INRIA. Retrieved from https://hal.inria.fr/inria-00070314.Google Scholar
- Richard J. Lipton and Larry J. Stockmeyer. 1978. Evaluation of polynomials with super-preconditioning. J. Comput. Syst. Sci. 16, 2 (1978), 124--139.Google ScholarCross Ref
- Sparsh Mittal. 2018. A Survey of Techniques for Dynamic Branch. Retrieved from https://arxiv.org/abs/1804.00261.Google Scholar
- S. L. Moshier. 2000. Cephes Math Library. Retrieved from http://www.moshier.net.Google Scholar
- Christophe Mouilleron and Guillaume Revy. 2011. Automatic generation of fast and certified code for polynomial evaluation. In Proceedings of the 20th IEEE Symposium on Computer Arithmetic (ARITH’11). IEEE, 233--242.Google ScholarDigital Library
- Jean-Michel Muller. 1997. Elementary Functions: Algorithms and Implementation. Birkhauser Boston, Inc., Secaucus, NJ.Google ScholarCross Ref
- Jean-Michel Muller. 2005. On the Definition of ulp(x). Retrieved from http://www.ens-lyon.fr/LIP/Pub/Rapports/RR/RR2005/RR2005-09.pdf.Google Scholar
- Jean-Michel Muller. 2006. Elementary Functions. Springer.Google Scholar
- A. C. R. Newbery. 1975. Polynomial evaluation schemes. Math. Comp. 29, 132 (1975), 1046--1050.Google ScholarCross Ref
- Richard E. Overill and Stephen Wilson. 1994. Performance of parallel algorithms for the evaluation of power series. Parallel Comput. 20, 8 (1994), 1205--1213.Google ScholarDigital Library
- Angela Pohl, Biagio Cosenza, Mauricio Alvarez Mesa, Chi Ching Chi, and Ben Juurlink. 2016. An evaluation of current SIMD programming models for C++. In Proceedings of the 3rd Workshop on Programming Models for SIMD/Vector Processing (WPMVP’16). ACM, New York, NY.Google ScholarDigital Library
- Michael O. Rabin and Shmuel Winograd. 1972. Fast evaluation of polynomials by rational preparation. Commun. Pure Appl. Math. 25, 4 (1972), 433--458.Google ScholarCross Ref
- Gavin S. Reynolds. 2010. Investigation of Different Methods of Fast Polynomial Evaluation. Master’s thesis. The University of Edinburgh.Google Scholar
- Hugues De Lassus Saint-Genies. 2018. Elementary Functions: Towards Automatically Generated, Efficient, and Vectorizable Implementations. Ph.D. Dissertation. Université de Perpignan.Google Scholar
- Naoki Shibata. 2010. Efficient evaluation methods of elementary functions suitable for SIMD computation. Comput. Sci. Res. Dev. 25, 1 (2010), 25--32.Google ScholarCross Ref
- Lol Software. 2012. Remez exchange toolbox. Retrieved from http://lolengine.net/wiki/doc/maths/remez.Google Scholar
- Ping-Tak Peter Tang. 1989. Table-driven implementation of the exponential function in IEEE floating-point arithmetic. ACM Trans. Math. Softw. 15, 2 (June 1989), 144--157.Google Scholar
- P. T. P. Tang. 1991. Table-lookup algorithms for elementary functions and their error analysis. In Proceedings of the 10th IEEE Symposium on Computer Arithmetic. 232--236.Google ScholarCross Ref
- David Vandevoorde and Nicolai M. Josuttis. 2002. C++ Templates: The Complete Guide (1st ed.). Addison-Wesley Professional.Google ScholarDigital Library
Index Terms
- Polynomial Evaluation on Superscalar Architecture, Applied to the Elementary Function ex
Recommendations
Exploiting selective instruction reuse and value prediction in a superscalar architecture
In our previously published research we discovered some very difficult to predict branches, called unbiased branches. Since the overall performance of modern processors is seriously affected by misprediction recovery, especially these difficult branches ...
Design and implementation of a 100 MHz centralized instruction window for a superscalar microprocessor
ICCD '95: Proceedings of the 1995 International Conference on Computer Design: VLSI in Computers and ProcessorsThe maxim of the superscalar architecture is that higher performance can be achieved by executing multiple instructions simultaneously. This can be realized on hardware by using a centralized instruction window. We present the design and implementation ...
Application of instruction analysis/scheduling techniques to resource allocation of superscalar processors
This paper presents the development of instruction analysis/scheduling CAD techniques to measure the distribution of functional-unit usage and the microoperation level parallelism (MLP), which together determine the proper functional-unit allocation for ...
Comments