Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
113 changes: 89 additions & 24 deletions doc/developer-guide/cache-architecture/ram-cache.en.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,34 +107,99 @@ with the CLOCK rate of the Cached and History Lists.
Cached List
===========

The *Cached List* contains objects actually in memory. The basic operation is
LRU with new entries inserted into a FIFO queue and hits causing objects to be
reinserted. The interesting bit comes when an object is being considered for
insertion. A check is first made against the Object Hash to see if the object
is in the Cached List or History. Hits result in updating the ``hit`` field and
reinsertion of the object. History hits result in the ``hit`` field being
updated and a comparison to see if this object should be kept in memory. The
comparison is against the least recently used members of the Cache List, and
is based on a weighted frequency::

CACHE_VALUE = hits / (size + overhead)

A new object must be enough bytes worth of currently cached objects to cover
itself. Each time an object is considered for replacement the CLOCK moves
forward. If the History object has a greater value then it is inserted into the
Cached List and the replaced objects are removed from memory and their list
entries are inserted into the History List. If the History object has a lesser
value it is reinserted into the History List. Objects considered for replacement
(at least one) but not replaced have their ``hits`` field set to ``0`` and are
reinserted into the Cached List. This is the CLOCK operation on the Cached List.
The *Cached List* (``_lru[0]`` in ``RamCacheCLFUS.cc``) holds the objects
actually resident in memory. New entries are inserted into a FIFO queue and a
hit reinserts the object at the tail. The interesting work happens when an
object is considered for insertion (a *Put*, after a read from secondary
storage). A check is first made against the object hash to see if the object is
already in the Cached List or the History List.

Each object is ranked by a weighted frequency, its *value*::

CACHE_VALUE = (hits + 1) / (size + ENTRY_OVERHEAD)

Smaller and more frequently used objects rank higher, which is what is meant by
least frequently used *by size*. The value of a candidate is compared against
``_average_value``, an exponential moving average of the value of the objects
passed over for replacement -- in effect a floating admission bar.

.. note::

``CACHE_VALUE`` must be evaluated in floating point. Because ``hits`` is
small and ``size`` is large, computing ``(hits + 1) / (size + overhead)`` in
integer arithmetic truncates to ``0`` for every normal object, which silently
collapses CLFUS to FIFO. This was the regression introduced in GitHub PR
#11733; the division is now forced to floating point.

When a *Put* finds the incoming object in the History List, its value is
compared against the least recently used members of the Cached List. The
candidate must be worth at least as many bytes of currently cached objects as it
displaces. Each time an object is considered for replacement the CLOCK advances.
If the candidate wins it is moved into the Cached List and the objects it
displaces are removed from memory, their (data-less) list entries moving to the
History List; if it loses it is returned to the History List. Objects passed
over for replacement (at least one) have their ``hits`` reduced and are
reinserted -- this is the CLOCK (second chance) on the Cached List.

Aging the cached list
---------------------

Frequency counts on resident objects would otherwise only ever grow, so an
object that was hot days ago keeps winning replacement long after it has gone
cold, and the cache cannot follow a changing working set. To prevent this, once
per *turnover* (one *Put* for every resident object) ``_age_resident()`` halves
every resident ``hits`` count *and* halves ``_average_value`` in the same pass.
Halving the bar matters: ``_average_value`` is otherwise updated only inside the
replacement loop, which a low-value candidate never reaches, so without it the
decayed counts would be invisible to the admission decision and a warming
working set could never break in.

History List
============

Each CLOCK, the least recently used entry in the History List is dequeued and
if the ``hits`` field is not greater than ``1`` (it was hit at least once in
the History or Cached List) it is deleted. Otherwise, the ``hits`` is set to
``0`` and it is requeued on the History List.
The *History List* (``_lru[1]``) is a bounded record of keys recently evicted
from, or considered for, the Cached List. Its entries carry no data (the
``IOBufferData`` pointer is null); they exist so that an object requested again
soon after eviction can be cheaply re-admitted, and so a newly seen object can
accumulate enough value to earn admission before it is forgotten.

Each CLOCK tick (``_tick()``, run once per eviction) ages the oldest History
entry -- halving its ``hits`` -- and *keeps* it, freeing entries only to hold the
list at its target size, ``_objects / HISTORY_DIVISOR + HISTORY_HYSTERIA``. An
earlier version freed an entry the moment its aged ``hits`` reached ``0``, which
held the list nearly empty and denied re-requested objects their second chance.

The list is deliberately capped well below a full cache-worth
(``HISTORY_DIVISOR`` defaults to 4): it only needs to remember recent candidates
long enough to be requested again, and a full cache-worth of history entries is a
large memory cost (see `Memory overhead`_) for caches holding many small objects.

Following a shifting working set
================================

The combination of the bounded, persistent History List and resident aging is
what lets CLFUS track a working set that changes over time. New objects survive
in history long enough to prove themselves and be admitted, while the frequency
advantage of the previous working set decays until its members fall below the
admission bar and are evicted. The ``ram_cache_adaptivity`` (an abrupt change of
the entire hot set) and ``ram_cache_drift`` (a gradually rolling working set)
regression tests in ``CacheTest.cc`` exercise this and compare CLFUS against the
simpler LRU RAM cache.

Memory overhead
===============

Every object in the Cached List has a resident list entry; this per-object
overhead (roughly ``ENTRY_OVERHEAD``) is counted against
``proxy.config.cache.ram_cache.size``. Every object in the History List has a
list entry too (about 88 bytes), but these are **not** counted against the
configured size -- they are memory the process uses in addition to it.

Because the overhead is per object, it is largest for a big cache holding many
small objects. ``HISTORY_DIVISOR`` bounds the History List to roughly
``_objects / 4`` entries to keep this cost modest: for example a 32 GB cache of
1 KB objects holds about 32 million resident objects and therefore about 8
million history entries (~700 MB), rather than ~2.8 GB at a full cache-worth.

Compression and Decompression
=============================
Expand Down
181 changes: 181 additions & 0 deletions src/iocore/cache/CacheTest.cc
Original file line number Diff line number Diff line change
Expand Up @@ -684,3 +684,184 @@ REGRESSION_TEST(ram_cache)(RegressionTest *t, int level, int *pstatus)
}
}
}

// Measures how well a RAM cache adapts when the hot working set shifts. Phase 1 warms set A to
// (most of) the cache; phase 2 shifts every reference to a disjoint set B of the same size. A
// frequency policy that never ages resident hit counts keeps the now-cold A pinned and starves
// B (low B hit rate, high A retention); a recency or properly-aged policy releases A.
struct RamCacheAdaptResult {
double b_hit_rate = 0.0;
int a_retained = 0;
int a_total = 0;
};

static RamCacheAdaptResult
test_RamCache_adaptivity(RamCache *cache, int64_t cache_size, StripeSM *stripe)
{
cache->init(cache_size, stripe);

int const obj = BUFFER_SIZE_FOR_INDEX(BUFFER_SIZE_INDEX_16K);
int const nhot = static_cast<int>((cache_size / obj) * 7 / 8); // working set ~7/8 of capacity
int const p1 = 20; // rounds warming A
int const p2 = 26; // rounds referencing B
uint64_t const a_base = 1;
uint64_t const b_base = 1000000;

std::vector<Ptr<IOBufferData>> keep;
auto access = [&](uint64_t k) -> bool {
CryptoHash hash;
hash.u64[0] = (k << 32) + k;
hash.u64[1] = (k << 32) + k;
Ptr<IOBufferData> got;
if (cache->get(&hash, &got)) {
return true;
}
IOBufferData *d = THREAD_ALLOC(ioDataAllocator, this_thread());
d->alloc(BUFFER_SIZE_INDEX_16K);
memset(d->data(), 0, d->block_size());
keep.push_back(make_ptr(d));
cache->put(&hash, d, d->block_size());
return false;
};

for (int r = 0; r < p1; r++) { // warm A to the cache
for (int i = 0; i < nhot; i++) {
access(a_base + i);
}
keep.clear();
}

int b_hits = 0, b_total = 0;
for (int r = 0; r < p2; r++) { // shift all references to B
for (int i = 0; i < nhot; i++) {
bool hit = access(b_base + i);
if (r >= p2 / 2) { // measure once B has had a chance to establish
b_total++;
b_hits += hit ? 1 : 0;
}
}
keep.clear();
}

int a_ret = 0;
for (int i = 0; i < nhot; i++) {
CryptoHash hash;
hash.u64[0] = ((a_base + i) << 32) + (a_base + i);
hash.u64[1] = ((a_base + i) << 32) + (a_base + i);
Ptr<IOBufferData> got;
if (cache->get(&hash, &got)) {
a_ret++;
}
}
keep.clear();

RamCacheAdaptResult res;
res.b_hit_rate = b_total ? static_cast<double>(b_hits) / b_total : 0.0;
res.a_retained = a_ret;
res.a_total = nhot;
return res;
}

REGRESSION_TEST(ram_cache_adaptivity)(RegressionTest *t, int level, int *pstatus)
{
if (REGRESSION_TEST_NIGHTLY > level) {
*pstatus = REGRESSION_TEST_PASSED;
return;
}
if (cacheProcessor.IsCacheEnabled() != CacheInitState::INITIALIZED) {
rprintf(t, "cache not initialized");
*pstatus = REGRESSION_TEST_FAILED;
return;
}

CacheKey key;
StripeSM *stripe = theCache->key_to_stripe(&key, "example.com"sv);
int64_t cache_size = 1LL << 21; // 2 MB

RamCacheAdaptResult lru = test_RamCache_adaptivity(new_RamCacheLRU(), cache_size, stripe);
RamCacheAdaptResult clfus = test_RamCache_adaptivity(new_RamCacheCLFUS(), cache_size, stripe);

rprintf(t, "RamCache adaptivity after working-set shift (higher B-hit-rate / lower A-retained is better)\n");
rprintf(t, "RamCache LRU B-hit-rate %.3f A-retained %d/%d\n", lru.b_hit_rate, lru.a_retained, lru.a_total);
rprintf(t, "RamCache CLFUS B-hit-rate %.3f A-retained %d/%d\n", clfus.b_hit_rate, clfus.a_retained, clfus.a_total);

// With the F2 fixes CLFUS must follow the shift: serve the new working set and release the stale one.
*pstatus = (clfus.b_hit_rate >= 0.90 && clfus.a_retained <= clfus.a_total / 3) ? REGRESSION_TEST_PASSED : REGRESSION_TEST_FAILED;
}

// Gradual-drift adaptivity: a rolling working set. Each round accesses a window of keys (a few
// times each, so they stay hot and get admitted) and slides the window forward by a few keys.
// Keys that roll off the trailing edge go cold while still carrying high hit counts; a policy
// that never ages resident counts keeps that stale trailing edge and starves the leading edge.
// Returns the hit rate on the current window over the second half of the run.
static double
test_RamCache_drift(RamCache *cache, int64_t cache_size, StripeSM *stripe)
{
cache->init(cache_size, stripe);

int const cap = static_cast<int>(cache_size / BUFFER_SIZE_FOR_INDEX(BUFFER_SIZE_INDEX_16K));
int const win = cap * 3 / 4; // active working-set window (fits with room)
int const reps = 2; // accesses per key per round (keeps the window hot/admitted)
int const slide = 3; // keys retired and introduced per round
int const rounds = 40;

std::vector<Ptr<IOBufferData>> keep;
auto access = [&](uint64_t k) -> bool {
CryptoHash hash;
hash.u64[0] = (k << 32) + k;
hash.u64[1] = (k << 32) + k;
Ptr<IOBufferData> got;
if (cache->get(&hash, &got)) {
return true;
}
IOBufferData *d = THREAD_ALLOC(ioDataAllocator, this_thread());
d->alloc(BUFFER_SIZE_INDEX_16K);
memset(d->data(), 0, d->block_size());
keep.push_back(make_ptr(d));
cache->put(&hash, d, d->block_size());
return false;
};

int hits = 0, total = 0;
for (int r = 0; r < rounds; r++) {
uint64_t base = static_cast<uint64_t>(r) * slide; // window = [base, base + win)
for (int rep = 0; rep < reps; rep++) {
for (int i = 0; i < win; i++) {
bool hit = access(base + i);
if (r >= rounds / 2) {
total++;
hits += hit ? 1 : 0;
}
}
}
keep.clear();
}
return total ? static_cast<double>(hits) / total : 0.0;
}

REGRESSION_TEST(ram_cache_drift)(RegressionTest *t, int level, int *pstatus)
{
if (REGRESSION_TEST_NIGHTLY > level) {
*pstatus = REGRESSION_TEST_PASSED;
return;
}
if (cacheProcessor.IsCacheEnabled() != CacheInitState::INITIALIZED) {
rprintf(t, "cache not initialized");
*pstatus = REGRESSION_TEST_FAILED;
return;
}

CacheKey key;
StripeSM *stripe = theCache->key_to_stripe(&key, "example.com"sv);
int64_t cache_size = 1LL << 21; // 2 MB

double lru = test_RamCache_drift(new_RamCacheLRU(), cache_size, stripe);
double clfus = test_RamCache_drift(new_RamCacheCLFUS(), cache_size, stripe);

rprintf(t, "RamCache gradual-drift current-window hit rate (higher is better)\n");
rprintf(t, "RamCache LRU drift-hit-rate %.3f\n", lru);
rprintf(t, "RamCache CLFUS drift-hit-rate %.3f\n", clfus);

// With the F2 fixes CLFUS must track a rolling working set, not freeze on the initial cohort.
*pstatus = (clfus >= 0.80) ? REGRESSION_TEST_PASSED : REGRESSION_TEST_FAILED;
}
Loading