Hostname: page-component-76fb5796d-wq484 Total loading time: 0 Render date: 2024-04-25T08:53:48.882Z Has data issue: false hasContentIssue false

Floating-point arithmetic

Published online by Cambridge University Press:  11 May 2023

Sylvie Boldo
Affiliation:
Université Paris Saclay, CNRS, ENS Paris Saclay, Inria, LMF, 91190 Gif-sur-Yvette, France E-mail: sylvie.boldo@inria.fr
Claude-Pierre Jeannerod
Affiliation:
Inria, ENS de Lyon, LIP, 69364 Lyon, France E-mail: claude-pierre.jeannerod@inria.fr
Guillaume Melquiond
Affiliation:
Université Paris Saclay, CNRS, ENS Paris Saclay, Inria, LMF, 91190 Gif-sur-Yvette, France E-mail: guillaume.melquiond@inria.fr
Jean-Michel Muller
Affiliation:
CNRS, ENS de Lyon, LIP, 69364 Lyon, France E-mail: jean-michel.muller@cnrs.fr
Rights & Permissions [Opens in a new window]

Abstract

Core share and HTML view are not available for this content. However, as you have access to this content, a full PDF is available via the ‘Save PDF’ action button.

Floating-point numbers have an intuitive meaning when it comes to physics-based numerical computations, and they have thus become the most common way of approximating real numbers in computers. The IEEE-754 Standard has played a large part in making floating-point arithmetic ubiquitous today, by specifying its semantics in a strict yet useful way as early as 1985. In particular, floating-point operations should be performed as if their results were first computed with an infinite precision and then rounded to the target format. A consequence is that floating-point arithmetic satisfies the ‘standard model’ that is often used for analysing the accuracy of floating-point algorithms. But that is only scraping the surface, and floating-point arithmetic offers much more.

In this survey we recall the history of floating-point arithmetic as well as its specification mandated by the IEEE-754 Standard. We also recall what properties it entails and what every programmer should know when designing a floating-point algorithm. We provide various basic blocks that can be implemented with floating-point arithmetic. In particular, one can actually compute the rounding error caused by some floating-point operations, which paves the way to designing more accurate algorithms. More generally, properties of floating-point arithmetic make it possible to extend the accuracy of computations beyond working precision.

Type
Research Article
Creative Commons
Creative Common License - CCCreative Common License - BY
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution, and reproduction in any medium, provided the original work is properly cited.
Copyright
© The Author(s), 2023. Published by Cambridge University Press

References

Agrawal, A., Mueller, S. M., Fleischer, B. M., Sun, X., Wang, N., Choi, J. and Gopalakrishnan, K. (2019), DLFloat: A 16-b floating point format designed for deep learning training and inference, in 26th IEEE Symposium on Computer Arithmetic, IEEE, pp. 9295.Google Scholar
Anderson, C. S., Zhang, J. and Cornea, M. (2018), Enhanced vector math support on the Intel®AVX-512 architecture, in 25th IEEE Symposium on Computer Arithmetic, pp. 120124.Google Scholar
Babuška, I. (1969), Numerical stability in mathematical analysis, in Proceedings of the 1968 IFIP Congress , Vol. 1, pp. 1123.Google Scholar
Barnes, R. C. M., Cooke-Yarborough, E. H. and Thomas, D. G. A. (1951), An electronic digital computor using cold cathode counting tubes for storage (Part 1), Electron. Engng 23, 286291.Google Scholar
Bartels, T., Fisikopoulos, V. and Weiser, M. (2022), Fast floating-point filters for robust predicates. Available at arXiv:2208.00497.Google Scholar
Baudin, M. and Smith, R. L. (2012), A robust complex division in Scilab. Available at arXiv:1210.4539.Google Scholar
Beebe, N. H. F. (2017), The Mathematical-Function Computation Handbook , Springer.CrossRefGoogle Scholar
Bertaccini, L., Paulin, G., Fischer, T., Mach, S. and Benini, L. (2022), MiniFloat-NN and ExSdotp: An ISA extension and a modular open hardware unit for low-precision training on RISC-V cores, in 29th IEEE Symposium on Computer Arithmetic.CrossRefGoogle Scholar
Blanchard, P., Higham, N. J. and Mary, T. (2020), A class of fast and accurate summation algorithms, SIAM J. Sci. Comput. 42, A1541A1557.CrossRefGoogle Scholar
Bohlender, G., Walter, W., Kornerup, P. and Matula, D. (1991), Semantics for exact floating point operations, in 10th IEEE Symposium on Computer Arithmetic, pp. 2226.Google Scholar
Boldo, S. (2006), Pitfalls of a full floating-point proof: Example on the formal proof of the Veltkamp/Dekker algorithms, in 3rd International Joint Conference on Automated Reasoning (Furbach, U. and Shankar, N., eds), Vol. 4130 of Lecture Notes in Computer Science, Springer, pp. 5266.Google Scholar
Boldo, S. (2009), Kahan’s algorithm for a correct discriminant computation at last formally proven, IEEE Trans. Comput. 58, 220225.CrossRefGoogle Scholar
Boldo, S. and Daumas, M. (2003), Representable correcting terms for possibly underflowing floating point operations, in 16th IEEE Symposium on Computer Arithmetic (Bajard, J.-C. and Schulte, M., eds), pp. 7986.Google Scholar
Boldo, S. and Melquiond, G. (2008), Emulation of a FMA and correctly rounded sums: Proved algorithms using rounding to odd, IEEE Trans. Comput. 57, 462471.CrossRefGoogle Scholar
Boldo, S. and Melquiond, G. (2017), Computer Arithmetic and Formal Proofs , ISTE Press / Elsevier.Google Scholar
Boldo, S. and Muller, J.-M. (2005), Some functions computable with a fused-mac, in 17th IEEE Symposium on Computer Arithmetic, pp. 5258.Google Scholar
Boldo, S. and Muller, J.-M. (2011), Exact and approximated error of the FMA, IEEE Trans. Comput. 60, 157164.CrossRefGoogle Scholar
Boldo, S., Graillat, S. and Muller, J.-M. (2017), On the robustness of the 2Sum and Fast2Sum algorithms, ACM Trans. Math. Softw . 44, 4:14:14.CrossRefGoogle Scholar
Boldo, S., Lauter, C. and Muller, J.-M. (2021), Emulating round-to-nearest ties-to-zero ‘augmented’ floating-point operations using round-to-nearest ties-to-even arithmetic, IEEE Trans. Comput. 70, 10461058.CrossRefGoogle Scholar
Borges, C. F. (2021), Algorithm 1014: An improved algorithm for Hypot $\left(x,y\right)$ , ACM Trans. Math. Softw. 47, 112.CrossRefGoogle Scholar
Borges, C. F., Jeannerod, C.-P. and Muller, J.-M. (2022), High-level algorithms for correctly-rounded reciprocal square roots, in 29th IEEE Symposium on Computer Arithmetic, pp. 1825.Google Scholar
Brent, R. P. (1973), On the precision attainable with various floating-point number systems, IEEE Trans. Comput. C-22, 601607.Google Scholar
Brent, R. P. (1978), Algorithm 524: MP, a Fortran multiple-precision arithmetic package [A1], ACM Trans. Math. Softw. 4, 7181.CrossRefGoogle Scholar
Brent, R., Percival, C. and Zimmermann, P. (2007), Error bounds on complex floating-point multiplication, Math. Comp. 76, 14691481.CrossRefGoogle Scholar
Brisebarre, N. and Chevillard, S. (2007), Efficient polynomial L-approximations, in 18th IEEE Symposium on Computer Arithmetic, pp. 169176.Google Scholar
Brisebarre, N. and Muller, J.-M. (2008), Correctly rounded multiplication by arbitrary precision constants, IEEE Trans. Comput. 57, 165174.CrossRefGoogle Scholar
Brisebarre, N., Hanrot, G. and Robert, O. (2017), Exponential sums and correctly-rounded functions, IEEE Trans. Comput. 66, 20442057.CrossRefGoogle Scholar
Brisebarre, N., Joldeş, M., Muller, J.-M., Nanes, A.-M. and Picot, J. (2020), Error analysis of some operations involved in the Cooley–Tukey fast Fourier transform, ACM Trans. Math. Softw . 46, 11:111:27.CrossRefGoogle Scholar
Brunie, N., de Dinechin, F., Kupriianova, O. and Lauter, C. (2015), Code generators for mathematical functions, in 22nd IEEE Symposium on Computer Arithmetic, pp. 6673.Google Scholar
Cameron, T. R. and Graillat, S. (2022), On a compensated Ehrlich–Aberth method for the accurate computation of all polynomial roots, Electron . Trans. Numer. Anal. 55, 401423.Google Scholar
Castaldo, A. M., Whaley, R. C. and Chronopoulos, A. T. (2009), Reducing floating point error in dot product using the superblock family of algorithms, SIAM J. Sci. Comput. 31, 11561174.CrossRefGoogle Scholar
Ceruzzi, P. E. (1981), The early computers of Konrad Zuse, 1935 to 1945, Ann. Hist. Comput. 3, 241262.CrossRefGoogle Scholar
Champagne, W. P. (1964), On finding roots of polynomials by hook or by crook. MSc thesis, University of Texas, Austin, TX.Google Scholar
Chevillard, S., Harrison, J., Joldeş, M. and Lauter, C. (2011), Efficient and accurate computation of upper bounds of approximation errors, Theoret . Comput. Sci. 412, 15231543.Google Scholar
Chevillard, S., Joldeş, M. and Lauter, C. (2010), Sollya: An environment for the development of numerical codes, in International Conference on Mathematical Software (Fukuda, K. et al., eds), Vol. 6327 of Lecture Notes in Computer Science, Springer, pp. 2831.Google Scholar
Chung, E., Fowers, J., Ovtcharov, K., Papamichael, M., Caulfield, A., Massengill, T., Liu, M., Lo, D., Alkalay, S., Haselman, M., Abeydeera, M., Adams, L., Angepat, H., Boehn, C., Chiou, D., Firestein, O., Forin, A., Gatlin, K. S., Ghandi, M., Heil, S., Holohan, K., Husseini, A. El, Juhasz, T., Kagi, K., Kovvuri, R. K., Lanka, S., van Megen, F., Mukhortov, D., Patel, P., Perez, B., Rapsang, A., Reinhardt, S., Rouhani, B., Sapek, A., Seera, R., Shekar, S., Sridharan, B., Weisz, G., Woods, L., Xiao, P. Yi, Zhang, D., Zhao, R. and Burger, D. (2018), Serving DNNs in real time at datacenter scale with project brainwave, IEEE Micro 38, 820.CrossRefGoogle Scholar
Cocke, J. and Markstein, V. (1990), The evolution of RISC technology at IBM, IBM J. Res. Dev. 34, 411.CrossRefGoogle Scholar
Cococcioni, M., Rossi, F., Ruffaldi, E. and Saponara, S. (2022), Small reals representations for deep learning at the edge: A comparison, in Next Generation Arithmetic (Gustafson, J. and Dimitrov, V., eds), Springer, pp. 117133.Google Scholar
Cody, W. J. and Waite, W. (1980), Software Manual for the Elementary Functions , Prentice-Hall.Google Scholar
Collange, C., Defour, D., Graillat, S. and Iakymchuk, R. (2015), Numerical reproducibility for the parallel reduction on multi- and many-core architectures, Parallel Comput. 49, 8397.CrossRefGoogle Scholar
Connolly, M. P. and Higham, N. J. (2022), Probabilistic rounding error analysis of Householder QR factorization. MIMS EPrint 2022.5, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths. manchester.ac.uk/2865/.Google Scholar
Connolly, M. P., Higham, N. J. and Mary, T. (2021), Stochastic rounding and its probabilistic backward error analysis, SIAM J. Sci. Comput. 43, A566A585.CrossRefGoogle Scholar
Connolly, M. P., Higham, N. J. and Pranesh, S. (2022), Randomized low rank matrix approximation: Rounding error analysis and a mixed precision algorithm. MIMS EPrint 2022.10, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/2863/.Google Scholar
Cornea-Hasegan, M. A., Golliver, R. A. and Markstein, P. (1999), Correctness proofs outline for Newton–Raphson based floating-point divide and square root algorithms, in 14th IEEE Symposium on Computer Arithmetic, pp. 96105.Google Scholar
Cornea, M., Harrison, J. and Tang, P. T. P. (2002), Scientific Computing on Itanium-based Systems, Intel Press.Google Scholar
Croci, M., Fasi, M., Higham, N. J., Mary, T. and Mikaitis, M. (2022), Stochastic rounding: implementation, error analysis and applications, Royal Soc . Open Sci. 9, 125.Google Scholar
Darcy, J. (2017), Restore always-strict floating-point semantics. Technical report JEP 306.Google Scholar
Daumas, M. (1999), Multiplications of floating point expansions, in 14th IEEE Symposium on Computer Arithmetic, pp. 250257.Google Scholar
Daumas, M., Rideau, L. and Théry, L. (2001), A generic library of floating-point numbers and its application to exact computing, in 14th International Conference on Theorem Proving in Higher Order Logics (Boulton, R. J. and Jackson, P. B., eds), Vol. 2152 of Lecture Notes in Computer Science, Springer, pp. 169184.Google Scholar
de Dinechin, F., Forget, L., Muller, J.-M. and Uguen, Y. (2019), Posits: The good, the bad and the ugly, in Conference on Next-Generation Arithmetic, ACM Press, pp. 110.Google Scholar
de Dinechin, F., Lauter, C. and Melquiond, G. (2011), Certifying the floating-point implementation of an elementary function using Gappa, IEEE Trans. Comput. 60, 242253.CrossRefGoogle Scholar
Dekker, T. J. (1971), A floating-point technique for extending the available precision, Numer . Math. 18, 224242.Google Scholar
Demmel, J. (1984), Underflow and the reliability of numerical software, SIAM J. Sci. Statist. Comput. 5, 887919.CrossRefGoogle Scholar
Demmel, J., Ahrens, P. and Nguyen, H. D. (2016), Efficient reproducible floating point summation and BLAS. Technical report UCB/EECS-2016-121, EECS Department, University of California, Berkeley.Google Scholar
Demmel, J. and Hida, Y. (2004), Fast and accurate floating point summation with application to computational geometry, Numer . Algorithms 37, 101112.CrossRefGoogle Scholar
Demmel, J. and Nguyen, H. D. (2015), Parallel reproducible summation, IEEE Trans. Comput. 64, 20602070.CrossRefGoogle Scholar
Demmel, J. and Riedy, J. (2021), A new IEEE 754 standard for floating-point arithmetic in an ever-changing world, SIAM News 54, 9.Google Scholar
Demmel, J., Dongarra, J., Gates, M., Henry, G., Langou, J., Li, X., Luszczek, P., Pereira, W., Riedy, J. and Rubio-González, C. (2022), Proposed consistent exception handling for the BLAS and LAPACK. Available at arXiv:2207.09281.CrossRefGoogle Scholar
El Arar, E.-M., Sohier, D., de Oliveira Castro, P. and Petit, E. (2022), The positive effects of stochastic rounding in numerical algorithms, in 29th IEEE Symposium on Computer Arithmetic, pp. 5865.Google Scholar
Fabiano, N., Muller, J.-M. and Picot, J. (2019), Algorithms for triple-word arithmetic, IEEE Trans. Comput. 68, 15731583.CrossRefGoogle Scholar
Fasi, M. and Mikaitis, M. (2020), CPFloat: A C library for simulating low-precision arithmetic. MIMS EPrint 2020.22, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/2873/.Google Scholar
Fasi, M., Higham, N. J., Mikaitis, M. and Pranesh, S. (2021), Numerical behavior of NVIDIA tensor cores, PeerJ Comput. Sci. 7, e330.CrossRefGoogle ScholarPubMed
Févotte, F. and Lathuilière, B. (2016), VERROU: Assessing floating-point accuracy without recompiling. Available at https://hal.archives-ouvertes.fr/hal-01383417.Google Scholar
Figueroa, S. A. (1995), When is double rounding innocuous?, ACM SIGNUM Newsletter 30, 2126.CrossRefGoogle Scholar
Flegg, G., Hay, C. and Moss, B. (1985), Nicolas Chuquet, Renaissance Mathematician: A Study With Extensive Translation of Chuquet’s Mathematical Manuscript Completed in 1484 , Springer.Google Scholar
Forsythe, G. E. (1959), Reprint of a note on rounding-off errors, SIAM Review 1, 6667.CrossRefGoogle Scholar
Fortune, S. and Van Wyk, C. J. (1993), Efficient exact arithmetic for computational geometry, in 9th Annual Symposium on Computational Geometry, ACM, pp. 163172.Google Scholar
Fousse, L., Hanrot, G., Lefèvre, V., Pélissier, P. and Zimmermann, P. (2007), MPFR: A multiple-precision binary floating-point library with correct rounding, ACM Trans. Math. Softw. 33, 13–es.CrossRefGoogle Scholar
Friedland, P. (1967), Algorithm 312: Absolute value and square root of a complex number, Commun . Assoc. Comput. Mach. 10, 665.Google Scholar
Gill, S. (1951), A process for the step-by-step integration of differential equations in an automatic digital computing machine, Math. Proc. Cambridge Philos. Soc. 47, 96108.CrossRefGoogle Scholar
Goldberg, D. (1991), What every computer scientist should know about floating-point arithmetic, ACM Computing Surveys 23, 548. Edited reprint available at https://docs.oracle.com/cd/E19059-01/fortec6u2/806-7996/806-7996.pdf from Sun’s Numerical Computation Guide; it contains an addendum Differences Among IEEE 754 Implementations, also available at http://www.validlab.com/goldberg/addendum.html.CrossRefGoogle Scholar
Goldberg, I. B. (1967), 27 bits are not enough for 8-digit accuracy, Commun . Assoc. Comput. Mach. 10, 105106.Google Scholar
Goualard, F. (2014), How do you compute the midpoint of an interval?, ACM Trans. Math. Softw. 40, 11:111:25.CrossRefGoogle Scholar
Goualard, F. (2022), Drawing random floating-point numbers from an interval, ACM Trans. Model. Comput. Simul. 32, 16:1–16:24.CrossRefGoogle Scholar
Graillat, S. and Ménissier-Morain, V. (2007), Error-free transformations in real and complex floating-point arithmetic, in 2007 International Symposium on Nonlinear Theory and its Applications, pp. 341344.Google Scholar
Graillat, S. and Ménissier-Morain, V. (2008), Compensated Horner scheme in complex floating point arithmetic, in 8th Conference on Real Numbers and Computer, pp. 133146.Google Scholar
Graillat, S. and Ménissier-Morain, V. (2012), Accurate summation, dot product and polynomial evaluation in complex floating-point arithmetic, Inform. Comput. 216, 5771.CrossRefGoogle Scholar
Graillat, S., Lefèvre, V. and Muller, J.-M. (2020), Alternative split functions and Dekker’s product, in 27th IEEE Symposium on Computer Arithmetic, pp. 4147.Google Scholar
Gregory, R. T. and Raney, J. L. (1964), Floating-point arithmetic with 84-bit numbers, Commun . Assoc. Comput. Mach. 7, 1013.Google Scholar
Gustafson, J. L. (2015), The End of Error: Unum Computing , Chapman & Hall / CRC.Google Scholar
Hallman, E. and Ipsen, I. C. F. (2022), Precision-aware deterministic and probabilistic error bounds for floating point summation. Available at arXiv:2203.15928.Google Scholar
Harrison, J. (1999), A machine-checked theory of floating point arithmetic, in 12th International Conference in Theorem Proving in Higher Order Logics (Bertot, Y. et al., eds), Vol. 1690 of Lecture Notes in Computer Science, Springer, pp. 113130.Google Scholar
Hauser, J. R. (1996), Handling floating-point exceptions in numeric programs, ACM Trans. Program. Lang. Syst. 18, 139174.CrossRefGoogle Scholar
He, Y. and Ding, C. H. Q. (2000), Using accurate arithmetics to improve numerical reproducibility and stability in parallel applications, in 14th International Conference on Supercomputing, ACM, pp. 225234.Google Scholar
Hennessy, J. L. and Patterson, D. A. (2012), Computer Architecture: A Quantitative Approach , fifth edition, Morgan Kaufman.Google Scholar
Henry, G., Tang, P. T. P. and Heinecke, A. (2019), Leveraging the bfloat16 artificial intelligence datatype for higher-precision computations, in 26th IEEE Symposium on Computer Arithmetic, pp. 6976.Google Scholar
Hida, Y., Li, X. S. and Bailey, D. H. (2001), Algorithms for quad-double precision floating-point arithmetic, in 15th IEEE Symposium on Computer Arithmetic, pp. 155162.Google Scholar
Higham, N. J. (1993), The accuracy of floating point summation, SIAM J. Sci. Comput. 14, 783799.CrossRefGoogle Scholar
Higham, N. J. (2002), Accuracy and Stability of Numerical Algorithms, second edition, SIAM.CrossRefGoogle Scholar
Higham, N. J. (2021a), The mathematics of floating-point arithmetic, LMS Newsletter 493, 3541.Google Scholar
Higham, N. J. (2021b), Numerical stability of algorithms at extreme scale and low precisions. MIMS EPrint 2021.14, Manchester Institute for Mathematical Sciences, The University of Manchester, UK. Available at http://eprints.maths.manchester.ac.uk/id/ eprint/2833.Google Scholar
Higham, N. J. and Mary, T. (2019), A new approach to probabilistic rounding error analysis, SIAM J. Sci. Comput. 41, A2815A2835.CrossRefGoogle Scholar
Higham, N. J. and Mary, T. (2020), Sharper probabilistic backward error analysis for basic linear algebra kernels with random data, SIAM J. Sci. Comput. 42, A3427A3446.CrossRefGoogle Scholar
Higham, N. J. and Mary, T. (2022), Mixed precision algorithms in numerical linear algebra, Acta Numer. 31, 347414.CrossRefGoogle Scholar
Higham, N. J. and Pranesh, S. (2019), Simulating low precision floating-point arithmetic, SIAM J. Sci. Comput. 41, C585C602.CrossRefGoogle Scholar
Hirshfeld, A. (2009), Eureka Man: The Life and Legacy of Archimedes, Walker & Company.Google Scholar
Hull, T. E., Fairgrieve, T. F. and Tang, P. T. P. (1994), Implementing complex elementary functions using exception handling, ACM Trans. Math. Softw. 20, 215244.CrossRefGoogle Scholar
IEEE (2015), IEEE Standard for Interval Arithmetic (IEEE Std 1788-2015), IEEE.Google Scholar
IEEE (2019), IEEE Standard for Floating-Point Arithmetic (IEEE Std 754-2019), IEEE.Google Scholar
Iffrah, G. (1999), The Universal History of Numbers: From Prehistory to the Invention of the Computer, Wiley.Google Scholar
Ikebe, Y. (1965), Note on triple-precision floating-point arithmetic with 132-bit numbers, Commun . Assoc. Comput. Mach. 8, 175177.Google Scholar
Innocente, V. and Zimmermann, P. (2022), Accuracy of mathematical functions in single, double, extended double and quadruple precision. Available at hal-03141101.Google Scholar
International Organization for Standardization (2010), Programming Languages – Fortran – Part 1: Base language, International Standard ISO/IEC 1539-1:2010.Google Scholar
International Organization for Standardization, Geneva, Switzerland (2011), Programming Languages – C, International Standard ISO/IEC 9899:2011.Google Scholar
Ipsen, I. C. F. and Zhou, H. (2020), Probabilistic error analysis for inner products, SIAM J. Matrix Anal. Appl. 41, 17261741.CrossRefGoogle ScholarPubMed
ISO/IEC (2022), C programming language – N3054, working draft of the standard (September 2022). https://en.wikipedia.org/wiki/C2.Google Scholar
Jeannerod, C.-P. (2016), A radix-independent error analysis of the Cornea–Harrison–Tang method, ACM Trans. Math. Softw. 42, 19:1–19:20.CrossRefGoogle Scholar
Jeannerod, C.-P. (2020), The relative accuracy of (x+y)*(x-y), J. Comput. Appl. Math. 369, 112613.CrossRefGoogle Scholar
Jeannerod, C.-P. and Muller, J.-M. (2017), On the relative error of computing complex square roots in floating-point arithmetic, in 51st Asilomar Conference on Signals, Systems, and Computers, IEEE, pp. 737740.Google Scholar
Jeannerod, C.-P. and Rump, S. M. (2018), On relative errors of floating-point operations: optimal bounds and applications, Math. Comp. 87, 803819.CrossRefGoogle Scholar
Jeannerod, C.-P., Kornerup, P., Louvet, N. and Muller, J.-M. (2017a), Error bounds on complex floating-point multiplication with an FMA, Math. Comp. 86, 881898.CrossRefGoogle Scholar
Jeannerod, C.-P., Louvet, N. and Muller, J.-M. (2013a), Further analysis of Kahan’s algorithm for the accurate computation of $2\times 2$ determinants, Math. Comp. 82, 2245–2264.CrossRefGoogle Scholar
Jeannerod, C.-P., Louvet, N. and Muller, J.-M. (2013b), On the componentwise accuracy of complex floating-point division with an FMA, in 21st IEEE Symposium on Computer Arithmetic (A. Nannarelli et al., eds), pp. 8390.CrossRefGoogle Scholar
Jeannerod, C.-P., Louvet, N., Muller, J.-M. and Plet, A. (2016), Sharp error bounds for complex floating-point inversion, Numer . Algorithms 73, 735760.CrossRefGoogle Scholar
Jeannerod, C.-P., Monat, C. and Thévenoux, L. (2017b), More accurate complex multiplication for embedded processors, in 12th IEEE International Symposium on Industrial Embedded Systems, pp. 14.CrossRefGoogle Scholar
Jeannerod, C.-P., Muller, J.-M. and Zimmermann, P. (2018), On various ways to split a floating-point number, in 25th IEEE Symposium on Computer Arithmetic, IEEE, pp. 5360.Google Scholar
Jiang, H., Graillat, S., Barrio, R. and Yang, C. (2016), Accurate, validated and fast evaluation of elementary symmetric functions and its application, Appl. Math. Comput. 273, 11601178.Google Scholar
Johansson, F. (2013), Arb: A C library for ball arithmetic, ACM Commun . Comput. Algebra 47, 166169.Google Scholar
Joldeş, M., Muller, J.-M. and Popescu, V. (2017), Tight and rigorous error bounds for basic building blocks of double-word arithmetic, ACM Trans. Math. Softw. 44, 127.CrossRefGoogle Scholar
Joldeş, M., Muller, J.-M., Popescu, V. and Tucker, W. (2016), CAMPARY: Cuda multiple precision arithmetic library and applications, in 5th International Congress on Mathematical Software (Greuel, G. M. et al., eds), Vol. 9725 of Lecture Notes in Computer Science, Springer, pp. 232240.Google Scholar
Kahan, W. (1965), Pracniques: Further remarks on reducing truncation errors, Commun . Assoc. Comput. Mach. 8, 40.Google Scholar
Kahan, W. (1981), Why do we need a floating-point arithmetic standard? Technical report, Computer Science, UC Berkeley. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/why-ieee.pdf.Google Scholar
Kahan, W. (1987), Branch cuts for complex elementary functions or much ado about nothing’s sign bit, in The State of the Art in Numerical Analysis (Iserles, A. and Powell, M. J. D., eds), Oxford University Press, pp. 165211.Google Scholar
Kahan, W. (1997), Lecture notes on the status of IEEE standard 754 for binary floating-point arithmetic. Available at http://www.cs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF.Google Scholar
Kahan, W. (1998), Matlab’s loss is nobody’s gain. Available at https://people.eecs.berkeley.edu/~wkahan/MxMulEps.pdf.Google Scholar
Kahan, W. (2004a), A logarithm too clever by half. Available at http://http.cs.berkeley.edu/~wkahan/LOG10HAF.TXT.Google Scholar
Kahan, W. (2004b), On the cost of floating-point computation without extra-precise arithmetic. Available at http://www.cs.berkeley.edu/~wkahan/Qdrtcs.pdf.Google Scholar
Kahan, W. and Thomas, J. W. (1991), Augmenting a programming language with complex arithmetic. Technical report UCB/CSD-92-667, EECS Department, University of California, Berkeley.Google Scholar
Karpinsky, R. (1985), PARANOIA: A floating-point benchmark, BYTE 10, 223.Google Scholar
Knuth, D. E. (1998), The Art of Computer Programming, Vol. 2, third edition, Addison-Wesley.Google Scholar
Kornerup, P., Lefèvre, V., Louvet, N. and Muller, J.-M. (2012), On the computation of correctly rounded sums, IEEE Trans. Comput. 61, 289298. A proof of Theorems 2 and 3 can be found at https://hal.inria.fr/inria-00475279.CrossRefGoogle Scholar
Kouya, T. (2019), Performance evaluation of an efficient double-double BLAS1 function with error-free transformation and its application to explicit extrapolation methods, in 26th IEEE Symposium on Computer Arithmetic, pp. 120123.Google Scholar
Kuki, H. and Cody, W. J. (1973), A statistical study of the accuracy of floating point number systems, Commun . Assoc. Comput. Mach. 16, 223230.Google Scholar
Kulisch, U. (1971), An axiomatic approach to rounded computations, Numer . Math. 18, 117.Google Scholar
Kulisch, U. (2013), Computer Arithmetic and Validity: Theory, Implementation, and Applications, Vol. 33 of Studies in Mathematics, De Gruyter.CrossRefGoogle Scholar
La Porte, M. and Vignes, J. (1974), Error analysis in computing, in Information Processing 74, North-Holland.Google Scholar
Lange, M. (2022), Toward accurate and fast summation, ACM Trans. Math. Softw. 48, 139.CrossRefGoogle Scholar
Lange, M. and Oishi, S. (2020), A note on Dekker’s FastTwoSum algorithm, Numer . Math. 145, 383403.Google Scholar
Lange, M. and Rump, S. M. (2017), Error estimates for the summation of real numbers with application to floating-point summation, BIT Numer. Math. 57, 927941.CrossRefGoogle Scholar
Lange, M. and Rump, S. M. (2019), Sharp estimates for perturbation errors in summations, Math. Comp. 88, 349368.CrossRefGoogle Scholar
Lange, M. and Rump, S. M. (2020), Faithfully rounded floating-point computations, ACM Trans. Math. Softw. 46, 120.CrossRefGoogle Scholar
Langlois, P. and Louvet, N. (2007), How to ensure a faithful polynomial evaluation with the compensated Horner algorithm, in 18th IEEE Symposium on Computer Arithmetic, pp. 141149.Google Scholar
Lawlor, O., Govind, H., Dooley, I., Breitenfeld, M. and Kale, L. (2005), Performance degradation in the presence of subnormal floating-point values, in International Workshop on Operating System Interference in High Performance Application.Google Scholar
Lefèvre, V. (2013), SIPE: Small Integer Plus Exponent, in 21th IEEE Symposium on Computer Arithmetic, pp. 99106.Google Scholar
Lefèvre, V. and Muller, J.-M. (2001), Worst cases for correct rounding of the elementary functions in double precision, in 15th IEEE Symposium on Computer Arithmetic, pp. 111118.Google Scholar
Lefèvre, V., Louvet, N., Muller, J.-M., Picot, J. and Rideau, L. (2022), Accurate calculation of Euclidean norms using double-word arithmetic, ACM Trans. Math. Softw. https://doi.org/10.1145/3568672.CrossRefGoogle Scholar
Li, X., Demmel, J., Bailey, D. H., Henry, G., Hida, Y., Iskandar, J., Kahan, W., Kapur, A., Martin, M., Tung, T. and Yoo, D. J. (2000), Design, implementation and testing of extended and mixed precision BLAS. Technical report 45991, Lawrence Berkeley National Laboratory. Available at https://netlib.org/lapack/lawnspdf/lawn149.pdf.Google Scholar
Lichtenau, C., Buyuktosunoglu, A., Bertran, R., Figuli, P., Jacobi, C., Papandreou, N., Pozidis, H., Saporito, A., Sica, A. and Tzortzatos, E. (2022), AI accelerator on IBM Telum processor: Industrial product, in 49th ACM International Symposium on Computer Architecture, ACM, pp. 10121028.Google Scholar
Lohner, R. J. (2001), On the ubiquity of the wrapping effect in the computation of error bounds, in Perspectives on Enclosure Methods (Kulisch, U. et al., eds), Springer, pp. 201216.Google Scholar
Lynch, T. and Swartzlander, E. (1992), A formalization for computer arithmetic, in Computer Arithmetic and Enclosure Methods (Atanassova, L. and Hertzberger, J., eds), Elsevier Science, pp. 137145.Google Scholar
Malcolm, M. A. (1971), On accurate floating-point summation, Commun . Assoc. Comput. Mach. 14, 731736.Google Scholar
Markstein, P. (1990), Computation of elementary functions on the IBM RISC System/6000 processor, IBM J. Res. Dev. 34, 111119.CrossRefGoogle Scholar
Mascarenhas, W. F. (2016), Floating point numbers are real numbers. Available at arXiv:1605.09202.Google Scholar
Matula, D. W. (1968), In-and-out conversions, Commun . Assoc. Comput. Mach. 11, 4750.Google Scholar
Melquiond, G. (2019), Formal verification for numerical computations, and the other way around. Habilitation à Diriger des Recherches, Université Paris Sud, Orsay.Google Scholar
Mezzarobba, M. (2010), NumGfun: A package for numerical and analytic computation with D-finite functions, in Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ACM, pp. 139145.Google Scholar
Mezzarobba, M. (2020), Rounding error analysis of linear recurrences using generating series. Available at arXiv:2011.00827.Google Scholar
Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., Mellempudi, N., Oberman, S., Shoeybi, M., Siu, M. and H, W (2022), FP8 formats for deep learning. Available at https://paperswithcode.com/ paper/fp8-formats-for-deep-learning.Google Scholar
Møller, O. (1965), Quasi double-precision in floating-point addition, BIT 5, 3750.CrossRefGoogle Scholar
Monniaux, D. (2008), The pitfalls of verifying floating-point computations, ACM Trans. Program. Lang. Syst. 30, 141.CrossRefGoogle Scholar
Moore, J. S., Lynch, T. and Kaufmann, M. (1998), A mechanically checked proof of the correctness of the kernel of the AMD5K86 floating point division algorithm, IEEE Trans. Comput. 47, 913926.CrossRefGoogle Scholar
Moore, R. E. (1979), Methods and Applications of Interval Analysis, SIAM Studies in Applied Mathematics, SIAM.Google Scholar
Moore, R. E., Kearfott, R. B. and Cloud, M. J. (2009), Introduction to Interval Analysis, SIAM.CrossRefGoogle Scholar
Muller, J.-M. (2015), On the error of computing $ab+ cd$ using Cornea, Harrison and Tang’s method, ACM Trans. Math. Softw. 41, 7:1–7:8.CrossRefGoogle Scholar
Muller, J.-M. (2016), Elementary Functions, Algorithms and Implementation, third edition, Birkhäuser.Google Scholar
Muller, J.-M. and Rideau, L. (2022), Formalization of double-word arithmetic, and comments on ‘Tight and rigorous error bounds for basic building blocks of double-word arithmetic’, ACM Trans. Math. Softw. 48, 124.CrossRefGoogle Scholar
Muller, J.-M., Brunie, N., de Dinechin, F., Jeannerod, C.-P., Joldeş, M., Lefèvre, V., Melquiond, G., Revol, N. and Torres, S. (2018), Handbook of Floating-Point Arithmetic, second edition, Birkhäuser.CrossRefGoogle Scholar
Neumaier, A. (1974), Rundungsfehleranalyse einiger Verfahren zur Summation endlicher Summen, ZAMM 54, 3951. In German.Google Scholar
Neumaier, A. (1990), Interval Methods for Systems of Equations, Cambridge University Press.Google Scholar
Nievergelt, Y. (2003), Scalar fused multiply-add instructions produce floating-point matrix arithmetic provably accurate to the penultimate digit, ACM Trans. Math. Softw. 29, 2748.CrossRefGoogle Scholar
Noune, B., Jones, P., Justus, D., Masters, D. and Luschi, C. (2022), 8-bit numerical formats for deep neural networks. Available at arXiv:2206.02915.Google Scholar
Ogita, T., Rump, S. M. and Oishi, S. (2005), Accurate sum and dot product, SIAM J. Sci. Comput. 26, 19551988.CrossRefGoogle Scholar
Olver, F. W. J. (1983), Error analysis of complex arithmetic, in Computational Aspects of Complex Analysis, Vol. 102 of NATO Science Series C, D. Reidel, pp. 279292.Google Scholar
Osorio, J., Armejach, A., Petit, E., Henry, G. and Casas, M. (2022), A BF16 FMA is all you need for DNN training, IEEE Trans. Emerg. Topics Comput. 10, 13021314.CrossRefGoogle Scholar
Overton, M. L. (2001), Numerical Computing with IEEE Floating Point Arithmetic, SIAM.CrossRefGoogle Scholar
Ozaki, K., Ogita, T. and Mukunoki, D. (2021), Interval matrix multiplication using fast low-precision arithmetic on GPU, in 9th International Workshop on Reliable Engineering Computing, pp. 419434.Google Scholar
Ozaki, K., Ogita, T., Oishi, S. and Rump, S. M. (2012), Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications, Numer . Algorithms 59, 95118.CrossRefGoogle Scholar
Parker, D. S., Pierce, B. and Eggert, P. R. (2000), Monte Carlo arithmetic: How to gamble with floating point and win, Comput. Sci. Engng 2, 5868.CrossRefGoogle Scholar
Pichat, M. (1972), Correction d’une somme en arithmétique à virgule flottante, Numer . Math. 19, 400406. In French .CrossRefGoogle Scholar
Pichat, M. (1976), Contribution à l’étude des erreurs d’arrondi en arithmétique à virgule flottante. PhD thesis, Université Scientifique et Médicale de Grenoble & Institut National Polytechnique de Grenoble.Google Scholar
Pion, S. (1999), De la géométrie algorithmique au calcul géométrique. PhD dissertation, Université Nice Sophia Antipolis.Google Scholar
Popescu, V. (2017), Towards fast and certified multiple-precision librairies. PhD dissertation, Université de Lyon, no. 2017LYSEN036.Google Scholar
Posit Working Group (2022), Standard for posit arithmetic. Available at https://posithub.org/docs/posit_standard-2.pdf.Google Scholar
Priest, D. M. (1991), Algorithms for arbitrary precision floating point arithmetic, in 10th IEEE Symposium on Computer Arithmetic, pp. 132143.Google Scholar
Priest, D. M. (1992), On properties of floating-point arithmetics: Numerical stability and the cost of accurate computations. PhD thesis, University of California at Berkeley.Google Scholar
Priest, D. M. (2004), Efficient scaling for complex division, ACM Trans. Math. Softw. 30, 389401.CrossRefGoogle Scholar
Revol, N. and Rouillier, F. (2005), Motivations for an arbitrary precision interval arithmetic and the MPFI library, Reliable Computing 11, 275290.CrossRefGoogle Scholar
Riedy, E. J. and Demmel, J. (2018), Augmented arithmetic operations proposed for IEEE-754 2018, in 25th IEEE Symposium on Computer Arithmetic, pp. 4552.Google Scholar
Roux, P. (2014), Innocuous double rounding of basic arithmetic operations, J. Formal. Reasoning 7, 131142.Google Scholar
Rump, S. M. (2009), Ultimately fast accurate summation, SIAM J. Sci. Comput. 31, 34663502.CrossRefGoogle Scholar
Rump, S. M. (2010), Verification methods: Rigorous results using floating-point arithmetic, Acta Numer. 19, 287449.CrossRefGoogle Scholar
Rump, S. M. (2012), Error estimation of floating-point summation and dot product, BIT Numer. Math. 52, 201220.CrossRefGoogle Scholar
Rump, S. M. (2015), Computable backward error bounds for basic algorithms in linear algebra, Nonlinear Theory Appl . IEICE 6, 360363.Google Scholar
Rump, S. M. (2017), IEEE754 precision-k base-β arithmetic inherited by precision-m base-β arithmetic for $k<m$ , ACM Trans. Math. Softw. 43, 20:1–20:15.Google Scholar
Rump, S. M. (2019), Error bounds for computer arithmetics, in 26th IEEE Symposium on Computer Arithmetic, pp. 114.Google Scholar
Rump, S. M., Ogita, T. and Oishi, S. (2008), Accurate floating-point summation, I: Faithful rounding, SIAM J. Sci. Comput. 31, 189224.CrossRefGoogle Scholar
Rump, S. M., Zimmermann, P., Boldo, S. and Melquiond, G. (2009), Computing predecessor and successor in rounding to nearest, BIT Numer. Math. 49, 419431.CrossRefGoogle Scholar
Severance, C. (1998), IEEE 754: An interview with William Kahan, Computer 31, 114115.CrossRefGoogle Scholar
Shewchuk, J. R. (1997), Adaptive precision floating-point arithmetic and fast robust geometric predicates, Discrete Comput. Geom. 18, 305363.CrossRefGoogle Scholar
Shibata, N. and Petrogalli, F. (2020), SLEEF: A portable vectorized library of C standard mathematical functions, IEEE Trans. Parallel Distrib. Syst. 31, 13161327.CrossRefGoogle Scholar
Sibidanov, A., Zimmermann, P. and Glondu, S. (2022), The CORE-MATH project, in 29th IEEE Symposium on Computer Arithmetic, pp. 2634.Google Scholar
Smith, R. L. (1962), Algorithm 116: Complex division, Commun . Assoc. Comput. Mach. 5, 435.Google Scholar
Steele, G. L. Jr and White, J. L. (2004), Retrospective: How to print floating-point numbers accurately, ACM SIGPLAN Notices 39, 372389.CrossRefGoogle Scholar
Sterbenz, P. H. (1974), Floating-Point Computation, Prentice-Hall.Google Scholar
Stewart, G. W. (1985), A note on complex division, ACM Trans. Math. Softw. 11, 238241.CrossRefGoogle Scholar
Strachey, C. (1959), On taking the square root of a complex number, Comput. J. 2, 89.CrossRefGoogle Scholar
Sun, X., Wang, N., Chen, C.-Y., Ni, J., Agrawal, A., Cui, X., Venkataramani, S., Maghraoui, K. El, Srinivasan, V. V. and Gopalakrishnan, K. (2020), Ultra-low precision 4-bit training of deep neural networks, in Advances in Neural Information Processing Systems 33 (Larochelle, H. et al., eds), Curran Associates, pp. 17961807.Google Scholar
Swartzlander, E. E. and Alexpoulos, A. G. (1975), The sign-logarithm number system, IEEE Trans. Comput. Reprinted in E. E. Swartzlander, Computer Arithmetic, Vol. 1, IEEE, 1990.Google Scholar
Uguen, Y. and de Dinechin, F. (2017), Design-space exploration for the Kulisch accumulator. Available at https://hal.archives-ouvertes.fr/hal-01488916hal-01488916.Google Scholar
Veltkamp, G. W. (1968), ALGOL procedures voor het berekenen van een inwendig product in dubbele precisie. Technical report 22, RC-Informatie, Technische Hogeschool Eindhoven.Google Scholar
Veltkamp, G. W. (1969), ALGOL procedures voor het rekenen in dubbele lengte. Technical report 21, RC-Informatie, Technische Hogeschool Eindhoven.Google Scholar
Wang, S. and Kanwar, P. (2019), Bfloat16: The secret to high performance on cloud TPUs. Available at https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus.Google Scholar
Whaley, R. C., Petitet, A. and Dongarra, J. J. (2001), Automated empirical optimizations of software and the ATLAS project, Parallel Comput. 27, 335.CrossRefGoogle Scholar
Wilkinson, J. H. (1960), Error analysis of floating-point computation, Numer . Math. 2, 319340.Google Scholar
Wilkinson, J. H. (1961), Error analysis of direct methods of matrix inversion, J. Assoc. Comput. Mach. 8, 281330.CrossRefGoogle Scholar
Wilkinson, J. H. (1963), Rounding Errors in Algebraic Processes, Notes on Applied Science no. 32, HMSO. Also published by Prentice-Hall. Reprinted by Dover, 1994.Google Scholar
Wilkinson, J. H. (1965), The Algebraic Eigenvalue Problem, Oxford University Press.Google Scholar
Wolfe, J. M. (1964), Reducing truncation errors by programming, Commun . Assoc. Comput. Mach. 7, 355356.Google Scholar
Yamazaki, I., Tomov, S. and Dongarra, J. (2015), Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs, SIAM J. Sci. Comput. 37, C307C330.CrossRefGoogle Scholar
Ziv, A. (1999), Sharp ULP rounding error bound for the hypotenuse function, Math. Comp. 68, 11431148.CrossRefGoogle Scholar