Skip to content

Fix garbled non-ASCII text in Compare with Server#322

Open
Mosh-K wants to merge 2 commits into
Power-Maverick:mainfrom
Mosh-K:fix/decode-base64-utf8
Open

Fix garbled non-ASCII text in Compare with Server#322
Mosh-K wants to merge 2 commits into
Power-Maverick:mainfrom
Mosh-K:fix/decode-base64-utf8

Conversation

@Mosh-K
Copy link
Copy Markdown

@Mosh-K Mosh-K commented May 19, 2026

Problem

Compare with Server shows garbled text in the server pane for any web resource containing non-ASCII characters (Hebrew, Arabic, CJK, accented Latin, etc.). Because every multi-byte character is corrupted into different bytes, the diff editor falsely reports every line containing non-ASCII as changed — even when local and server are identical. This makes the compare feature unusable for any localized web resource.

Before / After

Before — server pane (left) shows double-encoded mojibake; the diff falsely reports every line as changed even though the files are identical:

before

After — server pane decodes correctly; the diff now reflects real differences only:

after

Root cause

In src/utils/ExtensionMethods.ts:

export const decodeFromBase64 = (str: string): string =>
    Buffer.from(str, "base64").toString("binary");

The Dataverse Web API returns the web resource content as base64-encoded UTF-8 bytes. Decoding with "binary" (an alias for Latin-1) maps each byte to one char, so a multi-byte UTF-8 sequence like 0xD7 0x90 (א) becomes the two Latin-1 chars × + \x90.

The result is then handed to fs.writeFileSync(path, data) with no encoding, which defaults to "utf8" — re-encoding those Latin-1 chars as UTF-8. Net effect: the original UTF-8 bytes are wrapped inside another layer of UTF-8 encoding — classic double-encoding / mojibake.

ASCII-only resources are unaffected because Latin-1 and UTF-8 agree on bytes 0x000x7F, which is why this has gone unnoticed since the file was introduced in 2021.

Fix

One-line change — decode as UTF-8 instead of Latin-1:

-export const decodeFromBase64 = (str: string): string => Buffer.from(str, "base64").toString("binary");
+export const decodeFromBase64 = (str: string): string => Buffer.from(str, "base64").toString("utf8");

Why encodeToBase64 was intentionally not changed

encodeToBase64 has the mirror "binary" argument, but it is harmless dead-code on every call path in the repo. The only callers are:

encodeToBase64(readFileSync(fullPath))

…and readFileSync (in src/utils/FileSystem.ts) calls fs.readFileSync(source) with no encoding, which returns a Buffer. When the first argument to Buffer.from is already a Buffer, the encoding argument is ignored per Node's docs — the bytes are simply copied. So uploads round-trip correctly today.

Changing encodeToBase64 would be a no-op for current callers and risks silently changing behavior if someone in the future starts passing a string instead of a Buffer. Keeping the PR strictly to the broken decode path.

Test plan

  • Compared a web resource containing Hebrew before/after the fix — see screenshots above.
  • Verified an ASCII-only web resource still diffs identically to before (no false diffs, no encoding changes).
  • Confirmed upload path (uploadWebResourceencodeToBase64(readFileSync(...))) is unchanged and still produces correct base64 for both ASCII and non-ASCII content.

decodeFromBase64 was decoding base64 using "binary" (Latin-1), but the
Dataverse Web API returns UTF-8 bytes. Each multi-byte UTF-8 sequence
became one Latin-1 char per byte, and writeFileSync then re-encoded
those chars as UTF-8 — producing double-encoded mojibake in the server
pane of the diff view for any web resource containing non-ASCII content
(Hebrew, Arabic, CJK, accented Latin, etc.).

Switching the decode to "utf8" preserves the original bytes. ASCII-only
resources are unaffected because Latin-1 and UTF-8 agree on 0x00-0x7F,
which is why the bug went unnoticed for years.

encodeToBase64 is intentionally left untouched: it is only ever called
with a Buffer (readFileSync without an encoding returns a Buffer), and
Buffer.from ignores the encoding argument when the input is already a
Buffer — so the "binary" there is harmless dead-code on that path.
@Mosh-K Mosh-K requested a review from Power-Maverick as a code owner May 19, 2026 11:05
Comment thread src/utils/ExtensionMethods.ts Outdated
Add decodeFromBase64ToUTF8 alongside the existing decodeFromBase64 and
switch the Compare-with-Server path to it. The existing function is kept
unchanged for symmetry with encodeToBase64.

Root cause: the Dataverse Web API returns web resource content as
base64-encoded UTF-8 bytes. decodeFromBase64 decoded with "binary"
(Latin-1), turning each multi-byte UTF-8 sequence into one Latin-1 char
per byte; fs.writeFileSync then re-encoded those chars as UTF-8 — net
result was double-encoded mojibake in the server pane of the diff,
making every line containing non-ASCII appear as a false diff.

ASCII-only resources were unaffected because Latin-1 and UTF-8 agree
on bytes 0x00-0x7F, which is why the bug went unnoticed since 2021.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants