apache · phongn · Jun 3, 2026 · Jun 3, 2026 · Jun 3, 2026
diff --git a/doc/developer-guide/cache-architecture/ram-cache.en.rst b/doc/developer-guide/cache-architecture/ram-cache.en.rst
@@ -107,34 +107,99 @@ with the CLOCK rate of the Cached and History Lists.
 Cached List
 ===========
 
-The *Cached List* contains objects actually in memory. The basic operation is
-LRU with new entries inserted into a FIFO queue and hits causing objects to be
-reinserted. The interesting bit comes when an object is being considered for
-insertion. A check is first made against the Object Hash to see if the object
-is in the Cached List or History. Hits result in updating the ``hit`` field and
-reinsertion of the object. History hits result in the ``hit`` field being
-updated and a comparison to see if this object should be kept in memory. The
-comparison is against the least recently used members of the Cache List, and
-is based on a weighted frequency::
-
-   CACHE_VALUE = hits / (size + overhead)
-
-A new object must be enough bytes worth of currently cached objects to cover
-itself. Each time an object is considered for replacement the CLOCK moves
-forward. If the History object has a greater value then it is inserted into the
-Cached List and the replaced objects are removed from memory and their list
-entries are inserted into the History List. If the History object has a lesser
-value it is reinserted into the History List. Objects considered for replacement
-(at least one) but not replaced have their ``hits`` field set to ``0`` and are
-reinserted into the Cached List. This is the CLOCK operation on the Cached List.
+The *Cached List* (``_lru[0]`` in ``RamCacheCLFUS.cc``) holds the objects
+actually resident in memory. New entries are inserted into a FIFO queue and a
+hit reinserts the object at the tail. The interesting work happens when an
+object is considered for insertion (a *Put*, after a read from secondary
+storage). A check is first made against the object hash to see if the object is
+already in the Cached List or the History List.
+
+Each object is ranked by a weighted frequency, its *value*::
+
+   CACHE_VALUE = (hits + 1) / (size + ENTRY_OVERHEAD)
+
+Smaller and more frequently used objects rank higher, which is what is meant by
+least frequently used *by size*. The value of a candidate is compared against
+``_average_value``, an exponential moving average of the value of the objects
+passed over for replacement -- in effect a floating admission bar.
+
+.. note::
+
+   ``CACHE_VALUE`` must be evaluated in floating point. Because ``hits`` is
+   small and ``size`` is large, computing ``(hits + 1) / (size + overhead)`` in
+   integer arithmetic truncates to ``0`` for every normal object, which silently
+   collapses CLFUS to FIFO. This was the regression introduced in GitHub PR
+   #11733; the division is now forced to floating point.
+
+When a *Put* finds the incoming object in the History List, its value is
+compared against the least recently used members of the Cached List. The
+candidate must be worth at least as many bytes of currently cached objects as it
+displaces. Each time an object is considered for replacement the CLOCK advances.
+If the candidate wins it is moved into the Cached List and the objects it
+displaces are removed from memory, their (data-less) list entries moving to the
+History List; if it loses it is returned to the History List. Objects passed
+over for replacement (at least one) have their ``hits`` reduced and are
+reinserted -- this is the CLOCK (second chance) on the Cached List.
+
+Aging the cached list
+---------------------
+
+Frequency counts on resident objects would otherwise only ever grow, so an
+object that was hot days ago keeps winning replacement long after it has gone
+cold, and the cache cannot follow a changing working set. To prevent this, once
+per *turnover* (one *Put* for every resident object) ``_age_resident()`` halves
+every resident ``hits`` count *and* halves ``_average_value`` in the same pass.
+Halving the bar matters: ``_average_value`` is otherwise updated only inside the
+replacement loop, which a low-value candidate never reaches, so without it the
+decayed counts would be invisible to the admission decision and a warming
+working set could never break in.
 
 History List
 ============
 
-Each CLOCK, the least recently used entry in the History List is dequeued and
-if the ``hits`` field is not greater than ``1`` (it was hit at least once in
-the History or Cached List) it is deleted. Otherwise, the ``hits`` is set to
-``0`` and it is requeued on the History List.
+The *History List* (``_lru[1]``) is a bounded record of keys recently evicted
+from, or considered for, the Cached List. Its entries carry no data (the
+``IOBufferData`` pointer is null); they exist so that an object requested again
+soon after eviction can be cheaply re-admitted, and so a newly seen object can
+accumulate enough value to earn admission before it is forgotten.
+
+Each CLOCK tick (``_tick()``, run once per eviction) ages the oldest History
+entry -- halving its ``hits`` -- and *keeps* it, freeing entries only to hold the
+list at its target size, ``_objects / HISTORY_DIVISOR + HISTORY_HYSTERIA``. An
+earlier version freed an entry the moment its aged ``hits`` reached ``0``, which
+held the list nearly empty and denied re-requested objects their second chance.
+
+The list is deliberately capped well below a full cache-worth
+(``HISTORY_DIVISOR`` defaults to 4): it only needs to remember recent candidates
+long enough to be requested again, and a full cache-worth of history entries is a
+large memory cost (see `Memory overhead`_) for caches holding many small objects.
+
+Following a shifting working set
+================================
+
+The combination of the bounded, persistent History List and resident aging is
+what lets CLFUS track a working set that changes over time. New objects survive
+in history long enough to prove themselves and be admitted, while the frequency
+advantage of the previous working set decays until its members fall below the
+admission bar and are evicted. The ``ram_cache_adaptivity`` (an abrupt change of
+the entire hot set) and ``ram_cache_drift`` (a gradually rolling working set)
+regression tests in ``CacheTest.cc`` exercise this and compare CLFUS against the
+simpler LRU RAM cache.
+
+Memory overhead
+===============
+
+Every object in the Cached List has a resident list entry; this per-object
+overhead (roughly ``ENTRY_OVERHEAD``) is counted against
+``proxy.config.cache.ram_cache.size``. Every object in the History List has a
+list entry too (about 88 bytes), but these are **not** counted against the
+configured size -- they are memory the process uses in addition to it.
+
+Because the overhead is per object, it is largest for a big cache holding many
+small objects. ``HISTORY_DIVISOR`` bounds the History List to roughly
+``_objects / 4`` entries to keep this cost modest: for example a 32 GB cache of
+1 KB objects holds about 32 million resident objects and therefore about 8
+million history entries (~700 MB), rather than ~2.8 GB at a full cache-worth.
 
 Compression and Decompression
 =============================

diff --git a/src/iocore/cache/CacheTest.cc b/src/iocore/cache/CacheTest.cc
@@ -684,3 +684,184 @@ REGRESSION_TEST(ram_cache)(RegressionTest *t, int level, int *pstatus)
     }
   }
 }
+
+// Measures how well a RAM cache adapts when the hot working set shifts. Phase 1 warms set A to
+// (most of) the cache; phase 2 shifts every reference to a disjoint set B of the same size. A
+// frequency policy that never ages resident hit counts keeps the now-cold A pinned and starves
+// B (low B hit rate, high A retention); a recency or properly-aged policy releases A.
+struct RamCacheAdaptResult {
+  double b_hit_rate = 0.0;
+  int    a_retained = 0;
+  int    a_total    = 0;
+};
+
+static RamCacheAdaptResult
+test_RamCache_adaptivity(RamCache *cache, int64_t cache_size, StripeSM *stripe)
+{
+  cache->init(cache_size, stripe);
+
+  int const      obj    = BUFFER_SIZE_FOR_INDEX(BUFFER_SIZE_INDEX_16K);
+  int const      nhot   = static_cast<int>((cache_size / obj) * 7 / 8); // working set ~7/8 of capacity
+  int const      p1     = 20;                                           // rounds warming A
+  int const      p2     = 26;                                           // rounds referencing B
+  uint64_t const a_base = 1;
+  uint64_t const b_base = 1000000;
+
+  std::vector<Ptr<IOBufferData>> keep;
+  auto                           access = [&](uint64_t k) -> bool {
+    CryptoHash hash;
+    hash.u64[0] = (k << 32) + k;
+    hash.u64[1] = (k << 32) + k;
+    Ptr<IOBufferData> got;
+    if (cache->get(&hash, &got)) {
+      return true;
+    }
+    IOBufferData *d = THREAD_ALLOC(ioDataAllocator, this_thread());
+    d->alloc(BUFFER_SIZE_INDEX_16K);
+    memset(d->data(), 0, d->block_size());
+    keep.push_back(make_ptr(d));
+    cache->put(&hash, d, d->block_size());
+    return false;
+  };
+
+  for (int r = 0; r < p1; r++) { // warm A to the cache
+    for (int i = 0; i < nhot; i++) {
+      access(a_base + i);
+    }
+    keep.clear();
+  }
+
+  int b_hits = 0, b_total = 0;
+  for (int r = 0; r < p2; r++) { // shift all references to B
+    for (int i = 0; i < nhot; i++) {
+      bool hit = access(b_base + i);
+      if (r >= p2 / 2) { // measure once B has had a chance to establish
+        b_total++;
+        b_hits += hit ? 1 : 0;
+      }
+    }
+    keep.clear();
+  }
+
+  int a_ret = 0;
+  for (int i = 0; i < nhot; i++) {
+    CryptoHash hash;
+    hash.u64[0] = ((a_base + i) << 32) + (a_base + i);
+    hash.u64[1] = ((a_base + i) << 32) + (a_base + i);
+    Ptr<IOBufferData> got;
+    if (cache->get(&hash, &got)) {
+      a_ret++;
+    }
+  }
+  keep.clear();
+
+  RamCacheAdaptResult res;
+  res.b_hit_rate = b_total ? static_cast<double>(b_hits) / b_total : 0.0;
+  res.a_retained = a_ret;
+  res.a_total    = nhot;
+  return res;
+}
+
+REGRESSION_TEST(ram_cache_adaptivity)(RegressionTest *t, int level, int *pstatus)
+{
+  if (REGRESSION_TEST_NIGHTLY > level) {
+    *pstatus = REGRESSION_TEST_PASSED;
+    return;
+  }
+  if (cacheProcessor.IsCacheEnabled() != CacheInitState::INITIALIZED) {
+    rprintf(t, "cache not initialized");
+    *pstatus = REGRESSION_TEST_FAILED;
+    return;
+  }
+
+  CacheKey  key;
+  StripeSM *stripe     = theCache->key_to_stripe(&key, "example.com"sv);
+  int64_t   cache_size = 1LL << 21; // 2 MB
+
+  RamCacheAdaptResult lru   = test_RamCache_adaptivity(new_RamCacheLRU(), cache_size, stripe);
+  RamCacheAdaptResult clfus = test_RamCache_adaptivity(new_RamCacheCLFUS(), cache_size, stripe);
+
+  rprintf(t, "RamCache adaptivity after working-set shift (higher B-hit-rate / lower A-retained is better)\n");
+  rprintf(t, "RamCache LRU   B-hit-rate %.3f  A-retained %d/%d\n", lru.b_hit_rate, lru.a_retained, lru.a_total);
+  rprintf(t, "RamCache CLFUS B-hit-rate %.3f  A-retained %d/%d\n", clfus.b_hit_rate, clfus.a_retained, clfus.a_total);
+
+  // With the F2 fixes CLFUS must follow the shift: serve the new working set and release the stale one.
+  *pstatus = (clfus.b_hit_rate >= 0.90 && clfus.a_retained <= clfus.a_total / 3) ? REGRESSION_TEST_PASSED : REGRESSION_TEST_FAILED;
+}
+
+// Gradual-drift adaptivity: a rolling working set. Each round accesses a window of keys (a few
+// times each, so they stay hot and get admitted) and slides the window forward by a few keys.
+// Keys that roll off the trailing edge go cold while still carrying high hit counts; a policy
+// that never ages resident counts keeps that stale trailing edge and starves the leading edge.
+// Returns the hit rate on the current window over the second half of the run.
+static double
+test_RamCache_drift(RamCache *cache, int64_t cache_size, StripeSM *stripe)
+{
+  cache->init(cache_size, stripe);
+
+  int const cap    = static_cast<int>(cache_size / BUFFER_SIZE_FOR_INDEX(BUFFER_SIZE_INDEX_16K));
+  int const win    = cap * 3 / 4; // active working-set window (fits with room)
+  int const reps   = 2;           // accesses per key per round (keeps the window hot/admitted)
+  int const slide  = 3;           // keys retired and introduced per round
+  int const rounds = 40;
+
+  std::vector<Ptr<IOBufferData>> keep;
+  auto                           access = [&](uint64_t k) -> bool {
+    CryptoHash hash;
+    hash.u64[0] = (k << 32) + k;
+    hash.u64[1] = (k << 32) + k;
+    Ptr<IOBufferData> got;
+    if (cache->get(&hash, &got)) {
+      return true;
+    }
+    IOBufferData *d = THREAD_ALLOC(ioDataAllocator, this_thread());
+    d->alloc(BUFFER_SIZE_INDEX_16K);
+    memset(d->data(), 0, d->block_size());
+    keep.push_back(make_ptr(d));
+    cache->put(&hash, d, d->block_size());
+    return false;
+  };
+
+  int hits = 0, total = 0;
+  for (int r = 0; r < rounds; r++) {
+    uint64_t base = static_cast<uint64_t>(r) * slide; // window = [base, base + win)
+    for (int rep = 0; rep < reps; rep++) {
+      for (int i = 0; i < win; i++) {
+        bool hit = access(base + i);
+        if (r >= rounds / 2) {
+          total++;
+          hits += hit ? 1 : 0;
+        }
+      }
+    }
+    keep.clear();
+  }
+  return total ? static_cast<double>(hits) / total : 0.0;
+}
+
+REGRESSION_TEST(ram_cache_drift)(RegressionTest *t, int level, int *pstatus)
+{
+  if (REGRESSION_TEST_NIGHTLY > level) {
+    *pstatus = REGRESSION_TEST_PASSED;
+    return;
+  }
+  if (cacheProcessor.IsCacheEnabled() != CacheInitState::INITIALIZED) {
+    rprintf(t, "cache not initialized");
+    *pstatus = REGRESSION_TEST_FAILED;
+    return;
+  }
+
+  CacheKey  key;
+  StripeSM *stripe     = theCache->key_to_stripe(&key, "example.com"sv);
+  int64_t   cache_size = 1LL << 21; // 2 MB
+
+  double lru   = test_RamCache_drift(new_RamCacheLRU(), cache_size, stripe);
+  double clfus = test_RamCache_drift(new_RamCacheCLFUS(), cache_size, stripe);
+
+  rprintf(t, "RamCache gradual-drift current-window hit rate (higher is better)\n");
+  rprintf(t, "RamCache LRU   drift-hit-rate %.3f\n", lru);
+  rprintf(t, "RamCache CLFUS drift-hit-rate %.3f\n", clfus);
+
+  // With the F2 fixes CLFUS must track a rolling working set, not freeze on the initial cohort.
+  *pstatus = (clfus >= 0.80) ? REGRESSION_TEST_PASSED : REGRESSION_TEST_FAILED;
+}