fastfloat · jaja360 · Feb 2, 2026 · Feb 1, 2026 · Feb 1, 2026 · Feb 2, 2026
diff --git a/README.md b/README.md
@@ -1,162 +1,115 @@
-# float_string_generation_benchmark
+# Integer-to-String Benchmark
 
-The goal of this project is to narrowly study the problem of generating
-*shortest* number strings (e.g., `1.2122E4`) out of a decimal representation
-(`significant * 10**power`). The implicit assumption is that you started
-out from a binary floating point numbers (`float` or `double` in C/C++/Java/C#)
-and you mapped it to a decimal form.
+A benchmark suite for integer-to-string conversion algorithms,
+containing our `Champagne-Lemire` AVX-512 algorithm.
 
-The project might have applications to more conventional problems such as converting
-integer values to decimal representation.
+## Overview
 
-In principle, converting `significant * 10**power` into a decimal string is not difficult.
-You can compute `significant % 10` and get the last significant digits, and so forth.
-This is the typical right-to-left approach: you write the least significant digit,
-and the next least significant digit, and so forth.
-You need to determine which way you go to the shortest string (do you write `0.1` or `1E-1`),
-but that's not too difficult.
-
-So why is this interesting? It is interesting because we noticed that a highly
-optimized function in a [Ryu](https://github.com/ulfjack/ryu) float-to-string
-implementation was much slower then the implementation in the
-[Dragonbox](https://github.com/jk-jeon/dragonbox) float-to-string implementation.
-
-What is challenging? 
+This project benchmarks various algorithms for converting 64-bit unsigned
+integers to their decimal string representation. The focus is on high-throughput
+batch conversion scenarios.
 
 ## Requirements
 
-- AVX-512 capable machine (e.g., Zen 4 or better).
-- A recent GCC or LLVM on a Linux system or the equivalent
+- An x86 64-bit CPU
+- AVX-512 support (for the `Champagne-Lemire` algorithm)
+- A recent GCC or Clang on a Linux system
 - CMake
 
-## Knowing where to write
-
-One challenge is that you want to write the characters at the right place
-from the start. This is not trivial because you don't known initially how many digits you 
-need to write. If you consider the right-to-left approach, it requires you to start 
-writing *somewhere* implying that you know how many digits you have. Thankfully, there
-are fast algorithms to count digits:
-
-- Daniel Lemire, "Counting the digits of 64-bit integers," in Daniel Lemire's blog, January 7, 2025, https://lemire.me/blog/2025/01/07/counting-the-digits-of-64-bit-integers/.
-
-Of course, you could write to a buffer and then copy over but that's likely more expensive than
-counting the number of digits, at least in some cases.
+## Building
 
-The Dragonbox float-to-string implementation avoids this problem. The way it does it is that
-it writes from left to right. So it writes the most significant digit first !!! It relies
-on branching and assumes that the number of digits might be somewhat predictible, which could be true
-in practice (or not).
-
-## Storing the characters
-
-Even if you have the digits (e.g., the integer 1) and you know where they should be written, you
-still need to do something like:
-
-```c++
-buf[index] = '0' + value
+```bash
+cmake -B build
+cmake --build build
 ```
 
-And, once you figured out where the dot goes, you need to do
+With a specific compiler, e.g., clang:
 
-```c++
-buf[index] = '.'
+```bash
+CC=clang CXX=clang++ cmake -B buildclang
+cmake --build buildclang
 ```
 
-These things add up. So one trick is to compute hundreds instead of tens. And you use a lookup table.
-So you have precompted strings from `00`, `01`, `02`, up `99`.
-For the dot, you can also avoid having separate store by precomputing the strings
-`0.`, `1.`, `2.`,... or something equivalent.
-
-Similarly, for the exponent, you could precompute strings. E.g., you could certainly precompute `e+` and `e-` and
-not have two stores.
-
-Obviously, going from tens to hundreds to tens of housands could speed things further, although we might need
-a slightly larger table (40kB?). We'd like to avoid massive tables if there are more clever approaches.
-
-There might be room for fancier strategies. See
-
--  Daniel Lemire, "Converting integers to fix-digit representations quickly," in Daniel Lemire's blog, November 18, 2021, https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/.
-
-## Computing the digit values
-
-This is where the fun mathematics comes in. I give you an integer, how do you compute
-quickly the remainder and the quotient? See the following paper.
+## Usage
 
-- Daniel Lemire, Colin Bartlett, Owen Kaser, [Integer Division by Constants: Optimal Bounds](https://arxiv.org/abs/2012.12369),  Heliyon 7 (6), 2021
-- Takahashi, D. (2023). Multiple Integer Divisions with an Invariant Dividend and Monotonically Increasing or Decreasing Divisors. In: Gervasi, O., et al. Computational Science and Its Applications – ICCSA 2023. ICCSA 2023. Lecture Notes in Computer Science, vol 13957. Springer, Cham. https://doi.org/10.1007/978-3-031-36808-0_26
+Run the benchmark:
 
-Though the math is a bit tricky, we can often brute force a check for the solution.
+```bash
+./build/benchmark
+```
 
-Currently, we can *almost* bring it down to one multiplication per digit (where a digit could be a value in [0,99] in this context).
+Quick validation mode:
 
-## Overall challenge
+```bash
+./build/benchmark -q
+```
 
-How low can you go? By a rough estimation, the Ryu string generation algorithm might use 200 instructions
-per float whereas Dragonbox can go under 100 instructions per float. That's excellent, but still 
-about 5 instructions per character produced.
+With input data files:
 
-## AVX-512 solution
+```bash
+./build/benchmark -f data/citm_catalog_integers.txt
+./build/benchmark -f data/twitterjson_integers.txt
+```
 
-We have a sketch of an AVX-512 solution. It is sketch because it might be incorrect.
+Performance counters may require privileged execution (`sudo`).
 
-The results *might* be good with a caveat: we can rather significantly reduce the number of instructions,
-but loading the additional constants take time. So we inline the parsing function. By doing so, 
-we hope that constants are loaded once.
+## Algorithms Benchmarked
 
-This simulate the use case where you need to write a lot of floating-point numbers at once. That's
-a realistic and useful case.
+- **Champagne-Lemire (homogeneous)**: AVX-512 algorithm; variant optimized for uniform digit counts
+- **Champagne-Lemire (heterogeneous)**: AVX-512 algorithm; variant for varying digit counts
+- **Champagne-Lemire (auto)**: Dynamically selects between homogeneous/heterogeneous variants
+- **std::to_chars**: Standard library implementation
+- **Abseil**: Google's `FastIntToBuffer` routine
+- **jeaiii**: James Edward Anhalt III's algorithm
+- **AppNexus**: From the AppNexus Common Framework library
+- **yy_itoa**: Yao Yuan's implementation
+- **Mathisen**: SSE4.1-based algorithm inspired by Mathisen's arithmetic approach
+- **Muła**: Wojciech Muła's SSE-based algorithm
+- **Hopman**: The `hopman_fast` algorithm (extended to 64-bit)
+- **naive_onepass**: Classic algorithm with one-pass optimization
 
-To write just one floating-point number, the AVX-512 solution might not be faster than conventional strategies.
+## Key Techniques
 
-## Usage
+### Digit Counting
 
-Currently, the benchmark is very approximative. The implementations are untested. This is at the demo stage.
-We compare `champagne_lemire` which is something like the function from Ryu, `fast+champagne_lemire` which
-is a slightly faster alternative and dragonbox (a very fast alternative). Some of the code is assuredly
-wrong or makes assumptions not satisfied by the benchmark.
+Fast algorithms to determine the number of decimal digits in a 64-bit integer:
 
-```
-cmake -B build
-cmake --build build
-./build/benchmark
-```
+- Daniel Lemire, "Counting the digits of 64-bit integers," January 2025, https://lemire.me/blog/2025/01/07/counting-the-digits-of-64-bit-integers/
 
-To get performance counters, you might need to run the benchmark program in privileged mode (sudo).
+### Lookup Tables
 
-You can also feed in data files.
-```
-./build/benchmark data/canada.txt
-./build/benchmark data/mesh.txt 
-```
+Precomputed strings for digit pairs (`00` through `99`) reduce store operations by processing two digits at a time.
 
-Consider also testing with LLVM/clang.
+### Division by Constants
 
-```
-CXX=clang++ cmake -B buildclang
-cmake --build buildclang
-./buildclang/benchmark
-```
+Efficient computation of quotient and remainder using multiplication:
 
-We definitively need more tests and better benchmarks including benchmarks on realistic data.
+- Daniel Lemire et al., "Integer Division by Constants: Optimal Bounds," Heliyon 7(6), 2021, https://arxiv.org/abs/2012.12369
 
-Further, the system archictecture is assuredly a factor.
+### Fixed-Digit Representations
 
-## Upcoming tasks
+- Daniel Lemire, "Converting integers to fix-digit representations quickly," November 2021, https://lemire.me/blog/2021/11/18/converting-integers-to-fix-digit-representations-quickly/
 
-- [ ] test, verify and correct the AVX-512 function (it is almost certainly incorrect)
-- [ ] optimize the AVX-512 function for the case where we have short strings (with branching), the `mesh` data file is a good test case
-- [ ] optionally, make sure that it builds under Visual Studio
-- [ ] [investigate whether generating the constants](http://www.0x80.pl/notesen/2023-01-19-avx512-consts.html) might be faster
-- [ ] build a fast SIMD function for the case where  n < 100000000
+## Datasets
 
-## Further thoughts
+See `data/DATASETS.md` for descriptions of the included integer datasets:
 
-We solve the string generation from a DIY structure (mantissa + exponent), it is an interesting exercise in itself, but is this applicable? Could we plug our function instead a float-to-string function and get decent results? 
+- `citm_catalog_integers.txt` - Event catalog IDs (mostly 9-digit, homogeneous)
+- `twitterjson_integers.txt` - Twitter API integers (heterogeneous distribution)
+- `cit_patents_citing_integers.txt.gz` - US patent numbers (7-digit, homogeneous)
+- `stackoverflow_unix_timestamps_integers.txt.gz` - Unix timestamps (10-digit, homogeneous)
 
-It seems that a more interesting approach would be to do bulk processing. I am given a whole lot of floating-point values (maybe from an array) and
-I need to write them out.
+## Benchmark Metrics
 
-## References
+The benchmark reports the following metrics:
 
-- Cassio Neri, Lorenz Schneider, Euclidean affine functions and their application to calendar algorithms, Software: Practice and Experience 53 (4), 2023.
-- Daniel Lemire, Owen Kaser, Nathan Kurz, Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience, 49(6), 2019.
+| Metric   | Description                      |
+| -------- | -------------                    |
+| `ns/n`   | Nanoseconds per number (integer) |
+| `GHz`    | CPU frequency during benchmark   |
+| `c/n`    | CPU cycles per number            |
+| `i/n`    | Instructions per number          |
+| `B/n`    | Branches per number              |
+| `BM/n`   | Branch misses per number         |
+| `i/d`    | Instructions per output digit    |
+| `i/c`    | Instructions per cycle (IPC)     |