Skip to content

Append is not cancel safe #52

@fulmicoton

Description

@fulmicoton

If append_record is not "polled to its end", for instance if it is wrapped in tokio timeout, or if the task running it is cancelled,
then we can end up in a corrupted state.

The code looks as follows

- get next record position by looking at the last record in RAM.
- write on disk
- (C) write on RAM

If the task is stopped in the middle of (C), we end up in a state where what is on disk does not match what is in RAM.
In particular, on the next add, we will use a record position that might actually already be on disk.

As we reload the mrecordlog from disk, this is identified as a corruption.
This has been observed in prod.

A second case also observed is a straight panic.
Here the cause the preemption is assumed to have happened after we appended the record metas
and before we had populated the concatenated_records rolling buffer.

        self.record_metas.push(record_meta);
        self.concatenated_records.extend(payload);

The panic reported is

2024-02-28T23:41:54Z app[7816406b969758] iad [info]thread 'tokio-runtime-worker' panicked at /usr/local/cargo/git/checkouts/mrecordlog-34aad39ce3e0e659/bc6a998/src/mem/queue.rs:87:46:
2024-02-28T23:41:54Z app[7816406b969758] iad [info]slice index starts at 928 but ends at 0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions