1
Expert questions / Re: TikToken
« Last post by John Z on Today at 10:45:25 AM »For OpenAI a token is a group of three or four characters. The solution I have made is to divide the length of each word by three and add 1 word length is greather than three.
OpenAI tokenizer https://platform.openai.com/tokenizer
One of the methods to get better performance (at least using English language) is to look for both common prefixes and common suffixes and break the word there initially. This creates more efficient tokens as I understand that.
To that end the following link might be useful for the most common of each -
https://www.scholastic.com/content/dam/teachers/lesson-plans/migrated-files-in-body/prefixes_suffixes.pdf
John Z