Commit 0d1b102
committed
fix: enable addLeadingSpace for SentencePiece unigram models
SetSentencePiece(true) now sets addLeadingSpace=true as a persistent
field on BPETokenizer, matching llama.cpp / SentencePiece default
behavior. Previously addLeadingSpace was only a parameter passed
through the call chain — making it a field ensures the first word
always gets the ▁ prefix prepended, so tokens like ▁What are found
by the Viterbi DP instead of falling back to character-level tokens.
Also adds SetAddLeadingSpace() for GGUF models that override the
default via tokenizer.ggml.add_space_prefix metadata.1 parent 45a5ae0 commit 0d1b102
2 files changed
Lines changed: 90 additions & 3 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
40 | 44 | | |
41 | 45 | | |
42 | 46 | | |
| |||
92 | 96 | | |
93 | 97 | | |
94 | 98 | | |
95 | | - | |
96 | | - | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
97 | 102 | | |
98 | 103 | | |
99 | 104 | | |
100 | 105 | | |
101 | 106 | | |
102 | 107 | | |
103 | 108 | | |
104 | | - | |
| 109 | + | |
105 | 110 | | |
106 | 111 | | |
107 | 112 | | |
| |||
276 | 281 | | |
277 | 282 | | |
278 | 283 | | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
279 | 293 | | |
280 | 294 | | |
281 | 295 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
938 | 938 | | |
939 | 939 | | |
940 | 940 | | |
| 941 | + | |
| 942 | + | |
| 943 | + | |
| 944 | + | |
| 945 | + | |
| 946 | + | |
| 947 | + | |
| 948 | + | |
| 949 | + | |
| 950 | + | |
| 951 | + | |
| 952 | + | |
| 953 | + | |
| 954 | + | |
| 955 | + | |
| 956 | + | |
| 957 | + | |
| 958 | + | |
| 959 | + | |
| 960 | + | |
| 961 | + | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
| 994 | + | |
| 995 | + | |
| 996 | + | |
| 997 | + | |
| 998 | + | |
| 999 | + | |
| 1000 | + | |
| 1001 | + | |
| 1002 | + | |
| 1003 | + | |
| 1004 | + | |
| 1005 | + | |
| 1006 | + | |
| 1007 | + | |
| 1008 | + | |
| 1009 | + | |
| 1010 | + | |
| 1011 | + | |
| 1012 | + | |
| 1013 | + | |
941 | 1014 | | |
942 | 1015 | | |
943 | 1016 | | |
| |||
0 commit comments