Skip to content

Model loading doesn't work for SentencePieceTokenizer #12

@siin-lab

Description

@siin-lab

This code fails with an error:

import tkseem as tk

tokenizer_path = 'model.pl'
tokenizer = tk.SentencePieceTokenizer()
tokenizer.train(dataset_file)

# save the tokenizer to a file
tokenizer.save_model(tokenizer_path)

# load the tokenizer from a file
tokenizer = tk.SentencePieceTokenizer()
tokenizer.load_model(tokenizer_path)

# test the tokenizer
a = tokenizer.tokenize("السلام عليكم")

Error message is:

Traceback (most recent call last):
  File "/Users/user/Desktop/Projects/train-tokenizer.py", line 15, in <module>
    a = tokenizer.tokenize("السلام عليكم")
  File "/Users/user/.pyenv/versions/3.10.0/lib/python3.10/site-packages/tkseem/sentencepiece_tokenizer.py", line 50, in tokenize
    return self.sp.encode(text, out_type=str)
AttributeError: 'bool' object has no attribute 'encode'

The solution to this issue is updating the "load_model" to:

    def load_model(self, file_path):
        """Load a saved sp model

        Args:
            file_path (str): file path of the trained model
        """
        self.sp = spm.SentencePieceProcessor(model_proto=open(file_path, "rb").read())

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions