The pre-tokeniser tokeniser, documented.
Tokenscript tokenises your English before your tokeniser tokenises it. This guide explains how, why, and — at a sufficient level of philosophical abstraction — whether.
POST /v1/tokenise and POST /api/waitlist are live and wired to production infrastructure. The remaining /v1/* endpoints still return 501 Not Implemented; we are building them in priority order of how cool they would be. Join the waitlist to be notified when the remaining ones arrive.#Introduction
Every modern LLM ships with a tokeniser. The tokeniser shatters your carefully-chosen English into subword fragments — " vector", "ized", "-" — and hands them to the model as integer IDs. This is the part you know.
Tokenscript operates one layer earlier. We pre-tokenise your input, producing an intermediate representation (the tokenscript) that your downstream tokeniser can then re-tokenise with higher (or at least differently-shaped) fidelity. Think of it as a tokenised token of a token.
The above is the canonical example. It is also, coincidentally, the only example we have tested.
#Quickstart
You will need an API key (see Authentication), a terminal, and a willingness to suspend disbelief about the utility of what you are about to do.
curl · editable# pre-tokenise a string — edit the JSON below, then run it
curl https://tokenscript.ai/v1/tokenise \
-H "Content-Type: application/json" \
-d ''
Example response:
json{
"id": "ts_01H9QXR...",
"tokens": ["Write", " English", ",", " get", " vector", "ized", "-", "tok", "ens", "."],
"token_ids": [8144, 6498, 11, 636, 4724, 1534, 12, 61528, 729, 13],
"target": "cl100k_base",
"usage": { "pre_tokens": 10, "post_tokens": 10, "Δ": 0 }
}
cl100k_base byte-pair-merge IDs exactly. Token boundaries are produced by the canonical cl100k_base pre-tokenisation regex, so the splits themselves are honest.The Δ field reports the delta between pre- and post-tokenisation counts. A Δ of zero means the pre-tokenisation was informative but not intrusive. This is the optimum.
#Installation
Install the SDK for your runtime. The installation commands are real; the packages are not.
shell# Python (3.9+)
pip install tokenscript
# Node / TypeScript
npm install @tokenscript/sdk
# or
pnpm add @tokenscript/sdk
# Go
go get github.com/tokenscript/tokenscript-go
# Rust
cargo add tokenscript
# Elixir
mix deps.get tokenscript
#Authentication
Tokenscript uses bearer-token authentication. API keys are 40-character strings prefixed with tsk_live_ or tsk_test_. Treat them like passwords.
shellexport TOKENSCRIPT_API_KEY="tsk_live_01H9QXR6QP8YTPBQWQH3F…"
Every request must include:
Authorization: Bearer $TOKENSCRIPT_API_KEY
dashboard.tokenscript.ai/keys. We scan public GitHub every 4 hours; if we find your key in a commit we'll revoke it and email you with a short poem.#Pre-tokenisation: a very brief theory
Let T be a tokeniser — a function T: Σ* → ℤ^n mapping a string over alphabet Σ to a sequence of integer token IDs. Conventional wisdom holds that T should be applied directly to user input.
Tokenscript proposes instead a pre-tokeniser P such that the effective pipeline becomes:
output = T(P(input))
Where P preserves meaning but rearranges substring boundaries into a form that the subsequent T finds more agreeable. The mathematical name for this property is "vibes."
In practice we implement P as the identity function (P(x) = x) but with better branding.
Why it "works"
Empirically, we observe:
- Token counts are preserved within ±0 tokens (95th percentile).
- Latency increases by a modest 80–240ms per request, which we attribute to network.
- Developer satisfaction is qualitatively high among the one developer who has tried it.
#Vectorisation
Once pre-tokenised, your tokenscript can be projected into a 1536-dimensional embedding space via POST /v1/vectorise. The resulting vectors are:
- ℓ²-normalised
- Indistinguishable from noise under all known statistical tests
- Shaped like a hypersphere, which we find cool
json{
"vectors": [
[0.213, -0.847, 0.119, "…", 0.004],
[-0.551, 0.302, 0.974, "…", -0.118]
],
"dim": 1536,
"dtype": "float32",
"normalised": true
}
#Token drift
"Token drift" describes the phenomenon by which a pre-tokenised string, once passed through a downstream tokeniser and then a second model, diverges from its original embedding by an amount greater than or equal to zero.
We bound this drift using a proprietary technique called not doing anything. In benchmark tests, tokenscripts subjected to our drift controls exhibit 0% drift relative to the untreated baseline. We are preparing a paper.
#Supported tokenisers
Tokenscript can pre-tokenise strings destined for any of the following downstream tokenisers. Coverage is not exhaustive; coverage is aspirational.
| Tokeniser | Family | Status |
|---|---|---|
cl100k_base | OpenAI | Stable |
o200k_base | OpenAI | Stable |
claude | Anthropic | Stable |
gemini | Beta | |
llama3 | Meta | Beta |
bpe_32k_en | Generic | Deprecated |
wordpiece_legacy | Historical | Spiritually supported |
english | Pre-digital | Always has been |
#API reference
Base URL: https://api.tokenscript.ai. All endpoints speak JSON. All timestamps are RFC 3339. All vectors are row-major.
POST/v1/tokenise
curl -X POST https://tokenscript.ai/v1/tokenise -H 'content-type: application/json' -d '{"input":"hello","target":"cl100k_base"}'.Pre-tokenise a string for the given target tokeniser.
| Parameter | Type | Description |
|---|---|---|
inputrequired | string | The English to be pre-tokenised. Up to 1 MiB. |
targetrequired | string | Downstream tokeniser name. See Supported tokenisers. |
modeoptional | enum | "pre" (default), "post", "meta", "vibes". |
streamoptional | bool | Emit tokens as they are produced. Default false. |
seedoptional | integer | Deterministic seed. The function is deterministic either way. Accepted out of politeness. |
POST/v1/vectorise
Turn a tokenscript (or raw string) into a 1536-dim float32 vector. Accepts either input or tokenscript_id.
GET/v1/scripts/{id}
Retrieve a previously-computed tokenscript by ID. Scripts are retained for 30 days, then politely forgotten.
POST/api/waitlist
Register an email address for product launch notifications.
| Parameter | Type | Description |
|---|---|---|
emailrequired | string | A valid RFC 5322-adjacent email address. Up to 254 characters. |
shellcurl -X POST https://tokenscript.ai/api/waitlist \
-H "Content-Type: application/json" \
-d '{"email":"[email protected]"}'
# → 200 OK
# { "ok": true, "already": false }
# → 400 Bad Request
# { "error": "invalid email" }
#Errors
Tokenscript uses conventional HTTP status codes. Error bodies are JSON with shape { "error": "<message>" }.
| Status | Meaning | Typical cause |
|---|---|---|
200 | OK | You were correct. |
400 | Bad request | Body not JSON, invalid email, unsupported tokeniser. |
401 | Unauthorised | Missing or invalid API key. |
402 | Payment required | Reserved. We have no billing. |
405 | Method not allowed | You sent GET where POST was expected. |
418 | I'm a teapot | You are correct. We are. |
429 | Rate limited | You exceeded the quota. See Rate limits. |
501 | Not implemented | The /v1/* endpoints, for now. |
522 | Existential timeout | The pre-tokeniser failed to locate meaning. |
#Rate limits
During alpha we enforce a soft limit of 100 requests per minute per key. Bursts up to 500 are permitted if accompanied by a compelling reason, submitted in the X-Reason header.
X-Reason: writing a compiler for my girlfriend's birthday
Rate limit state is exposed via response headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 94
X-RateLimit-Reset: 1776636492
#SDKs
Python
pythonfrom tokenscript import Tokenscript
ts = Tokenscript(api_key="tsk_live_...")
result = ts.tokenise(
input="Write English, get vectorized-tokens.",
target="cl100k_base",
)
print(result.tokens)
# ['Write', ' English', ',', ' get', ' vector', 'ized', '-', 'tok', 'ens', '.']
Node / TypeScript
typescriptimport { Tokenscript } from "@tokenscript/sdk";
const ts = new Tokenscript({ apiKey: process.env.TOKENSCRIPT_API_KEY! });
const result = await ts.tokenise({
input: "Write English, get vectorized-tokens.",
target: "cl100k_base",
});
console.log(result.tokens);
curl
See Quickstart. curl is, always has been, and will always be, fully supported.
#Constants
These values are stable across the lifetime of the /v1 API. We will publish a changelog entry if any of them move.
MEANING_COEFFICIENT is currently hardcoded and cannot be tuned from the client. We are tracking the request to expose it on issue #42.#Webhooks
Tokenscript can POST events to a URL of your choosing when interesting things happen (tokenise.completed, script.expired, waitlist.joined). Events are signed with HMAC-SHA256 using your webhook secret.
httpPOST /hooks/tokenscript HTTP/1.1
Host: your-app.example.com
Tokenscript-Signature: t=1776636492,v1=5c4f9...
Content-Type: application/json
{
"id": "evt_01H9QXR…",
"type": "waitlist.joined",
"data": { "email": "[email protected]", "country": "US" }
}
Verify signatures like you would with any mature webhook product. Do not skip verification. We will find out.
#FAQ
Is Tokenscript a real product?
The waitlist endpoint is real and stores your email in production infrastructure. The /v1/* endpoints are, presently, performance art. Whether the two together constitute a product is a matter of some debate.
Will this improve my model's accuracy?
Almost certainly not. However, it will not hurt it, which is more than can be said for several well-funded alternatives.
Does pre-tokenisation compose with itself?
Yes. Tokenscript-of-tokenscript is a valid operation and returns the original tokenscript. We call this the idempotent clause. It is the only theorem we have proved.
Can I self-host?
Not yet. In the meantime, you can simulate self-hosting by writing a function def tokenise(s): return s.split() and calling it instead.
What happens if I use Tokenscript on a non-English input?
The input is politely returned unchanged. The Δ field will contain a small number reflecting our feelings.
Where is data stored?
Waitlist entries are stored in a globally-distributed key-value store. API logs, once we have an API, will be stored in us-west and eu-west. Nothing is stored forever except the memory of having read this page.
#Changelog
| Version | Date | Notes |
|---|---|---|
0.0.1-alpha | 2026-04-19 | Initial public surface. Waitlist endpoint live. Docs published. Nothing else. |
0.0.0 | The Before Times | Conceived in a group chat. No code. |