Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 75 additions & 122 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,162 +1,115 @@
# float_string_generation_benchmark
# Integer-to-String Benchmark

The goal of this project is to narrowly study the problem of generating
*shortest* number strings (e.g., `1.2122E4`) out of a decimal representation
(`significant * 10**power`). The implicit assumption is that you started
out from a binary floating point numbers (`float` or `double` in C/C++/Java/C#)
and you mapped it to a decimal form.
A benchmark suite for integer-to-string conversion algorithms,
containing our `Champagne-Lemire` AVX-512 algorithm.

The project might have applications to more conventional problems such as converting
integer values to decimal representation.
## Overview

In principle, converting `significant * 10**power` into a decimal string is not difficult.
You can compute `significant % 10` and get the last significant digits, and so forth.
This is the typical right-to-left approach: you write the least significant digit,
and the next least significant digit, and so forth.
You need to determine which way you go to the shortest string (do you write `0.1` or `1E-1`),
but that's not too difficult.

So why is this interesting? It is interesting because we noticed that a highly
optimized function in a [Ryu](https://github.com/ulfjack/ryu) float-to-string
implementation was much slower then the implementation in the
[Dragonbox](https://github.com/jk-jeon/dragonbox) float-to-string implementation.

What is challenging?
This project benchmarks various algorithms for converting 64-bit unsigned
integers to their decimal string representation. The focus is on high-throughput
batch conversion scenarios.

## Requirements

- AVX-512 capable machine (e.g., Zen 4 or better).
- A recent GCC or LLVM on a Linux system or the equivalent
- An x86 64-bit CPU
- AVX-512 support (for the `Champagne-Lemire` algorithm)
- A recent GCC or Clang on a Linux system
- CMake

## Knowing where to write

One challenge is that you want to write the characters at the right place
from the start. This is not trivial because you don't known initially how many digits you
need to write. If you consider the right-to-left approach, it requires you to start
writing *somewhere* implying that you know how many digits you have. Thankfully, there
are fast algorithms to count digits:

- Daniel Lemire, "Counting the digits of 64-bit integers," in Daniel Lemire's blog, January 7, 2025, https://lemire.me/blog/2025/01/07/counting-the-digits-of-64-bit-integers/.

Of course, you could write to a buffer and then copy over but that's likely more expensive than
counting the number of digits, at least in some cases.
## Building

The Dragonbox float-to-string implementation avoids this problem. The way it does it is that
it writes from left to right. So it writes the most significant digit first !!! It relies
on branching and assumes that the number of digits might be somewhat predictible, which could be true
in practice (or not).

## Storing the characters

Even if you have the digits (e.g., the integer 1) and you know where they should be written, you
still need to do something like:

```c++
buf[index] = '0' + value
```bash
cmake -B build
cmake --build build
```

And, once you figured out where the dot goes, you need to do
With a specific compiler, e.g., clang:

```c++
buf[index] = '.'
```bash
CC=clang CXX=clang++ cmake -B buildclang
cmake --build buildclang
```

These things add up. So one trick is to compute hundreds instead of tens. And you use a lookup table.
So you have precompted strings from `00`, `01`, `02`, up `99`.
For the dot, you can also avoid having separate store by precomputing the strings
`0.`, `1.`, `2.`,... or something equivalent.

Similarly, for the exponent, you could precompute strings. E.g., you could certainly precompute `e+` and `e-` and
not have two stores.

Obviously, going from tens to hundreds to tens of housands could speed things further, although we might need
a slightly larger table (40kB?). We'd like to avoid massive tables if there are more clever approaches.

There might be room for fancier strategies. See

- Daniel Lemire, "Converting integers to fix-digit representations quickly," in Daniel Lemire's blog, November 18, 2021, https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/.

## Computing the digit values

This is where the fun mathematics comes in. I give you an integer, how do you compute
quickly the remainder and the quotient? See the following paper.
## Usage

- Daniel Lemire, Colin Bartlett, Owen Kaser, [Integer Division by Constants: Optimal Bounds](https://arxiv.org/abs/2012.12369), Heliyon 7 (6), 2021
- Takahashi, D. (2023). Multiple Integer Divisions with an Invariant Dividend and Monotonically Increasing or Decreasing Divisors. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023. ICCSA 2023. Lecture Notes in Computer Science, vol 13957. Springer, Cham. https://doi.org/10.1007/978-3-031-36808-0_26
Run the benchmark:

Though the math is a bit tricky, we can often brute force a check for the solution.
```bash
./build/benchmark
```

Currently, we can *almost* bring it down to one multiplication per digit (where a digit could be a value in [0,99] in this context).
Quick validation mode:

## Overall challenge
```bash
./build/benchmark -q
```

How low can you go? By a rough estimation, the Ryu string generation algorithm might use 200 instructions
per float whereas Dragonbox can go under 100 instructions per float. That's excellent, but still
about 5 instructions per character produced.
With input data files:

## AVX-512 solution
```bash
./build/benchmark -f data/citm_catalog_integers.txt
./build/benchmark -f data/twitterjson_integers.txt
```

We have a sketch of an AVX-512 solution. It is sketch because it might be incorrect.
Performance counters may require privileged execution (`sudo`).

The results *might* be good with a caveat: we can rather significantly reduce the number of instructions,
but loading the additional constants take time. So we inline the parsing function. By doing so,
we hope that constants are loaded once.
## Algorithms Benchmarked

This simulate the use case where you need to write a lot of floating-point numbers at once. That's
a realistic and useful case.
- **Champagne-Lemire (homogeneous)**: AVX-512 algorithm; variant optimized for uniform digit counts
- **Champagne-Lemire (heterogeneous)**: AVX-512 algorithm; variant for varying digit counts
- **Champagne-Lemire (auto)**: Dynamically selects between homogeneous/heterogeneous variants
- **std::to_chars**: Standard library implementation
- **Abseil**: Google's `FastIntToBuffer` routine
- **jeaiii**: James Edward Anhalt III's algorithm
- **AppNexus**: From the AppNexus Common Framework library
- **yy_itoa**: Yao Yuan's implementation
- **Mathisen**: SSE4.1-based algorithm inspired by Mathisen's arithmetic approach
- **Muła**: Wojciech Muła's SSE-based algorithm
- **Hopman**: The `hopman_fast` algorithm (extended to 64-bit)
- **naive_onepass**: Classic algorithm with one-pass optimization

To write just one floating-point number, the AVX-512 solution might not be faster than conventional strategies.
## Key Techniques

## Usage
### Digit Counting

Currently, the benchmark is very approximative. The implementations are untested. This is at the demo stage.
We compare `champagne_lemire` which is something like the function from Ryu, `fast+champagne_lemire` which
is a slightly faster alternative and dragonbox (a very fast alternative). Some of the code is assuredly
wrong or makes assumptions not satisfied by the benchmark.
Fast algorithms to determine the number of decimal digits in a 64-bit integer:

```
cmake -B build
cmake --build build
./build/benchmark
```
- Daniel Lemire, "Counting the digits of 64-bit integers," January 2025, https://lemire.me/blog/2025/01/07/counting-the-digits-of-64-bit-integers/

To get performance counters, you might need to run the benchmark program in privileged mode (sudo).
### Lookup Tables

You can also feed in data files.
```
./build/benchmark data/canada.txt
./build/benchmark data/mesh.txt
```
Precomputed strings for digit pairs (`00` through `99`) reduce store operations by processing two digits at a time.

Consider also testing with LLVM/clang.
### Division by Constants

```
CXX=clang++ cmake -B buildclang
cmake --build buildclang
./buildclang/benchmark
```
Efficient computation of quotient and remainder using multiplication:

We definitively need more tests and better benchmarks including benchmarks on realistic data.
- Daniel Lemire et al., "Integer Division by Constants: Optimal Bounds," Heliyon 7(6), 2021, https://arxiv.org/abs/2012.12369

Further, the system archictecture is assuredly a factor.
### Fixed-Digit Representations

## Upcoming tasks
- Daniel Lemire, "Converting integers to fix-digit representations quickly," November 2021, https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/

- [ ] test, verify and correct the AVX-512 function (it is almost certainly incorrect)
- [ ] optimize the AVX-512 function for the case where we have short strings (with branching), the `mesh` data file is a good test case
- [ ] optionally, make sure that it builds under Visual Studio
- [ ] [investigate whether generating the constants](http://www.0x80.pl/notesen/2023-01-19-avx512-consts.html) might be faster
- [ ] build a fast SIMD function for the case where n < 100000000
## Datasets

## Further thoughts
See `data/DATASETS.md` for descriptions of the included integer datasets:

We solve the string generation from a DIY structure (mantissa + exponent), it is an interesting exercise in itself, but is this applicable? Could we plug our function instead a float-to-string function and get decent results?
- `citm_catalog_integers.txt` - Event catalog IDs (mostly 9-digit, homogeneous)
- `twitterjson_integers.txt` - Twitter API integers (heterogeneous distribution)
- `cit_patents_citing_integers.txt.gz` - US patent numbers (7-digit, homogeneous)
- `stackoverflow_unix_timestamps_integers.txt.gz` - Unix timestamps (10-digit, homogeneous)

It seems that a more interesting approach would be to do bulk processing. I am given a whole lot of floating-point values (maybe from an array) and
I need to write them out.
## Benchmark Metrics

## References
The benchmark reports the following metrics:

- Cassio Neri, Lorenz Schneider, Euclidean affine functions and their application to calendar algorithms, Software: Practice and Experience 53 (4), 2023.
- Daniel Lemire, Owen Kaser, Nathan Kurz, Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience, 49(6), 2019.
| Metric | Description |
| -------- | ------------- |
| `ns/n` | Nanoseconds per number (integer) |
| `GHz` | CPU frequency during benchmark |
| `c/n` | CPU cycles per number |
| `i/n` | Instructions per number |
| `B/n` | Branches per number |
| `BM/n` | Branch misses per number |
| `i/d` | Instructions per output digit |
| `i/c` | Instructions per cycle (IPC) |
Loading
Loading