Skip to content

Feature/pretrain multilingual ready#59

Merged
ranieri merged 9 commits into
mainfrom
feature/pretrain-multilingual-ready
May 19, 2026
Merged

Feature/pretrain multilingual ready#59
ranieri merged 9 commits into
mainfrom
feature/pretrain-multilingual-ready

Conversation

@fleurvanl
Copy link
Copy Markdown
Contributor

Zie ook waarneming

@ranieri ranieri changed the base branch from bugfix/off-by-one-operand-search to main May 19, 2026 14:09
Comment thread asmtransformers/scripts/pretrain.py Outdated
Comment thread asmtransformers/scripts/pretrain.py Outdated

# Load the tokenizer and model
tokenizer = BertTokenizer.from_pretrained(tokenizer)
tokenizer = ASMTokenizer.from_pretrained(tokenizer)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no vocab, so we're getting only the five special tokens. In practice, this means all the data items will be processed to UNK and PAD.

A small (temporary) vocab would be better.

Comment thread asmtransformers/asmtransformers/models/multilingual_asmbert/tokenizer_config.json Outdated
@ranieri ranieri merged commit f0b17b5 into main May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants