Fix garbled non-ASCII text in Compare with Server#322
Open
Mosh-K wants to merge 2 commits into
Open
Conversation
decodeFromBase64 was decoding base64 using "binary" (Latin-1), but the Dataverse Web API returns UTF-8 bytes. Each multi-byte UTF-8 sequence became one Latin-1 char per byte, and writeFileSync then re-encoded those chars as UTF-8 — producing double-encoded mojibake in the server pane of the diff view for any web resource containing non-ASCII content (Hebrew, Arabic, CJK, accented Latin, etc.). Switching the decode to "utf8" preserves the original bytes. ASCII-only resources are unaffected because Latin-1 and UTF-8 agree on 0x00-0x7F, which is why the bug went unnoticed for years. encodeToBase64 is intentionally left untouched: it is only ever called with a Buffer (readFileSync without an encoding returns a Buffer), and Buffer.from ignores the encoding argument when the input is already a Buffer — so the "binary" there is harmless dead-code on that path.
Power-Maverick
requested changes
May 19, 2026
Add decodeFromBase64ToUTF8 alongside the existing decodeFromBase64 and switch the Compare-with-Server path to it. The existing function is kept unchanged for symmetry with encodeToBase64. Root cause: the Dataverse Web API returns web resource content as base64-encoded UTF-8 bytes. decodeFromBase64 decoded with "binary" (Latin-1), turning each multi-byte UTF-8 sequence into one Latin-1 char per byte; fs.writeFileSync then re-encoded those chars as UTF-8 — net result was double-encoded mojibake in the server pane of the diff, making every line containing non-ASCII appear as a false diff. ASCII-only resources were unaffected because Latin-1 and UTF-8 agree on bytes 0x00-0x7F, which is why the bug went unnoticed since 2021.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Compare with Servershows garbled text in the server pane for any web resource containing non-ASCII characters (Hebrew, Arabic, CJK, accented Latin, etc.). Because every multi-byte character is corrupted into different bytes, the diff editor falsely reports every line containing non-ASCII as changed — even when local and server are identical. This makes the compare feature unusable for any localized web resource.Before / After
Before — server pane (left) shows double-encoded mojibake; the diff falsely reports every line as changed even though the files are identical:
After — server pane decodes correctly; the diff now reflects real differences only:
Root cause
In
src/utils/ExtensionMethods.ts:The Dataverse Web API returns the web resource content as base64-encoded UTF-8 bytes. Decoding with
"binary"(an alias for Latin-1) maps each byte to one char, so a multi-byte UTF-8 sequence like0xD7 0x90(א) becomes the two Latin-1 chars×+\x90.The result is then handed to
fs.writeFileSync(path, data)with no encoding, which defaults to"utf8"— re-encoding those Latin-1 chars as UTF-8. Net effect: the original UTF-8 bytes are wrapped inside another layer of UTF-8 encoding — classic double-encoding / mojibake.ASCII-only resources are unaffected because Latin-1 and UTF-8 agree on bytes
0x00–0x7F, which is why this has gone unnoticed since the file was introduced in 2021.Fix
One-line change — decode as UTF-8 instead of Latin-1:
Why
encodeToBase64was intentionally not changedencodeToBase64has the mirror"binary"argument, but it is harmless dead-code on every call path in the repo. The only callers are:…and
readFileSync(insrc/utils/FileSystem.ts) callsfs.readFileSync(source)with no encoding, which returns aBuffer. When the first argument toBuffer.fromis already aBuffer, the encoding argument is ignored per Node's docs — the bytes are simply copied. So uploads round-trip correctly today.Changing
encodeToBase64would be a no-op for current callers and risks silently changing behavior if someone in the future starts passing a string instead of a Buffer. Keeping the PR strictly to the broken decode path.Test plan
uploadWebResource→encodeToBase64(readFileSync(...))) is unchanged and still produces correct base64 for both ASCII and non-ASCII content.