NO

Author Topic: TikToken  (Read 6445 times)

Offline HellOfMice

  • Member
  • *
  • Posts: 105
  • Never be pleased, always improve
TikToken
« on: May 13, 2024, 06:34:21 AM »
Hello,


Is it possible to interface TikToken with Pelle's C. I have searched but did not find anything.
TikToken is written in python.
In the case it is possible, how to do it, please.


Thank you for your help
--------------------------------
Kenavo

Offline frankie

  • Global Moderator
  • Member
  • *****
  • Posts: 2113
Re: TikToken
« Reply #1 on: May 14, 2024, 12:27:12 PM »
"It is better to be hated for what you are than to be loved for what you are not." - Andre Gide

Offline Vortex

  • Member
  • *
  • Posts: 864
    • http://www.vortex.masmcode.com
Re: TikToken
« Reply #2 on: May 14, 2024, 10:11:23 PM »
Hello,

Are you referring to this project?

https://github.com/openai/tiktoken
Code it... That's all...

Offline HellOfMice

  • Member
  • *
  • Posts: 105
  • Never be pleased, always improve
Re: TikToken
« Reply #3 on: May 16, 2024, 12:36:27 PM »
Yes. It computes the tokens.
--------------------------------
Kenavo

Offline HellOfMice

  • Member
  • *
  • Posts: 105
  • Never be pleased, always improve
Re: TikToken
« Reply #4 on: May 16, 2024, 12:48:26 PM »
For OpenAI a token is a group of three or four characters. The solution I have made is to divide the length of each word by three and add 1 word length is greather than three.


OpenAI tokenizer https://platform.openai.com/tokenizer

--------------------------------
Kenavo

Offline WiiLF23

  • Member
  • *
  • Posts: 89
Re: TikToken
« Reply #5 on: May 16, 2024, 09:52:22 PM »
I would love to convert this, just to stick it to Python.

I’m not a fan of it, however given the use of vectors and a range of “modules”, I would just grab the bindings and cave in.

A pure rewrite would utilize AVX/AVX2 or the SSE instructions (with CPU vendor detection of course). So that alone is worth considering if desiring a scratch implementation in C. Pelles has vector support, you will find this in the project settings.

Basically, you would need the API documentation and the rest is up to the C programming to align with the OpenAI API documentation.

It looks like some work outside of the Python C bindings.
« Last Edit: May 16, 2024, 09:54:23 PM by WiiLF23 »

Offline John Z

  • Member
  • *
  • Posts: 860
Re: TikToken
« Reply #6 on: May 18, 2024, 10:45:25 AM »
For OpenAI a token is a group of three or four characters. The solution I have made is to divide the length of each word by three and add 1 word length is greather than three.


OpenAI tokenizer https://platform.openai.com/tokenizer

One of the methods to get better performance (at least using English language) is to look for both common prefixes and common suffixes and break the word there initially.  This creates more efficient tokens as I understand that.

To that end the following link might be useful for the most common of each - 

https://www.scholastic.com/content/dam/teachers/lesson-plans/migrated-files-in-body/prefixes_suffixes.pdf

John Z

Offline HellOfMice

  • Member
  • *
  • Posts: 105
  • Never be pleased, always improve
Re: TikToken
« Reply #7 on: May 18, 2024, 01:21:19 PM »
I have used the web interface and created a database with more than 174 000 english words and their tokens.
Thank you everybody for your help

--------------------------------
Kenavo