Feature/pretrain multilingual ready by fleurvanl · Pull Request #59 · NetherlandsForensicInstitute/asmtransformers

fleurvanl · 2026-05-19T12:36:42Z

Zie ook waarneming

…uded as it is not definitive

…ady' into feature/pretrain-multilingual-ready

ranieri · 2026-05-19T14:22:49Z


    # Load the tokenizer and model
-    tokenizer = BertTokenizer.from_pretrained(tokenizer)
+    tokenizer = ASMTokenizer.from_pretrained(tokenizer)


There's no vocab, so we're getting only the five special tokens. In practice, this means all the data items will be processed to UNK and PAD.

A small (temporary) vocab would be better.

…kenizer argument handling

…n script

fleurvanl added 4 commits May 19, 2026 14:26

pretrain.py multilingual-ready

fa9fdc0

create tokenizer for multilingual training NOTE vocab.txt is NOT incl…

a758d4f

…uded as it is not definitive

create tokenizer for multilingual training NOTE vocab.txt is NOT incl…

b55ae36

…uded as it is not definitive

Merge remote-tracking branch 'origin/feature/pretrain-multilingual-re…

81a758c

…ady' into feature/pretrain-multilingual-ready

ranieri changed the base branch from bugfix/off-by-one-operand-search to main May 19, 2026 14:09

ranieri approved these changes May 19, 2026

View reviewed changes

Refactor model config loading to use packaged resources and update to…

a1b7fc5

…kenizer argument handling

ranieri approved these changes May 19, 2026

View reviewed changes

ranieri added 4 commits May 19, 2026 16:47

Add configuration files for arm64bert and multilingual_asmbert models

3824b5d

Update tokenizer class to ASMTokenizer in multilingual_asmbert config

9a98959

Add basic vocab.txt for multilingual

c08abe2

Update default model and tokenizer to multilingual_asmbert in pretrai…

459b885

…n script

ranieri merged commit f0b17b5 into main May 19, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/pretrain multilingual ready#59

Feature/pretrain multilingual ready#59
ranieri merged 9 commits into
mainfrom
feature/pretrain-multilingual-ready

fleurvanl commented May 19, 2026

Uh oh!

Uh oh!

ranieri May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fleurvanl commented May 19, 2026

Uh oh!

Uh oh!

ranieri May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants