Universal Script Encoding (USE)

A compact binary encoding for multilingual text. Instead of one massive shared codepoint space (like Unicode/UTF), USE segments text by encoding context. Each segment declares which encoding it uses, then lists characters as compact integers within that encoding's symbol set.

Stream Format

A USE byte stream is a sequence of encoding blocks. Each block starts with an encoding ID, followed by character codes, followed by a stop code (0x00) that signals the end of that block's characters.

<encoding VLI><char VLI><char VLI>...<0x00><encoding VLI><char VLI>...<0x00>

Encoding IDs and character codes both use the same self-delimiting variable-length integer (VLI) format. The continuation bit (MSB) of each byte tells the decoder where one integer ends and the next begins. The stop code 0x00 (a single zero byte) tells the decoder that the current encoding block is finished and the next VLI is a new encoding ID. Character codes start from 1, since 0 is reserved as the stop code.

Variable-Length Integer Encoding

Both encoding identifiers and character codes use the same variable-length integer scheme. Each byte contributes 7 bits of value. The most significant bit (MSB) of each byte is a continuation flag.

bit layout per byte:
[C][V6][V5][V4][V3][V2][V1][V0]

C = continuation flag
    1 = another byte follows
    0 = this is the final byte
V = 7 value bits

Bytes are ordered big-endian. The first byte in the stream holds the most significant 7 bits. The last byte (where C=0) holds the least significant 7 bits.

Decoding Algorithm

Read bytes until one has C=0. Collect all the 7-bit payloads. The first payload read is the most significant.

value = 0
for each byte in sequence:
  value = (value << 7) | (byte & 0x7F)

Examples

Single-byte value (1):

00000001
C=0, payload=0000001 = 1
value = 1

Single-byte value (127):

01111111
C=0, payload=1111111 = 127
value = 127

Two-byte value (129):

10000001 00000001
byte 1: C=1, payload=0000001 = 1
byte 2: C=0, payload=0000001 = 1
value = (1 << 7) | 1 = 128 + 1 = 129

Two-byte value (128):

10000001 00000000
byte 1: C=1, payload=0000001 = 1
byte 2: C=0, payload=0000000 = 0
value = (1 << 7) | 0 = 128

Capacity Per Byte Count

Bytes	Range
1	0 to 127
2	128 to 16,383
3	16,384 to 2,097,151

Reserved Values

The byte 0x00 is the stop code. It marks the end of a character sequence within an encoding block. Character codes start from 1. Encoding IDs also start from 1 (encoding ID 0 is unused).

Local Encoding Map

The encoding identifier in the byte stream is a local integer, not a global ID. Each document or rendering context defines a map from local IDs to global encoding system IDs. This keeps inline encoding tags compact regardless of how many encoding systems exist globally.

document header (out of band):
  local 1 -> global "latin-ule-v1"    (script_encoding_system.id = 42)
  local 2 -> global "arabic-ule-v1"   (script_encoding_system.id = 87)
  local 3 -> global "katakana-ule-v1" (script_encoding_system.id = 203)

The stream then uses 1, 2, 3 as encoding IDs. A rendering engine resolves these through the map to find the correct symbol table. This means you never run out of encoding keys in the stream. Even if thousands of encoding systems exist globally, a single document only needs local IDs for the few it actually uses.

Full Stream Example

Text containing Latin and Arabic characters, using local encoding IDs 1 and 2.

byte stream:
00000001 00000101 10000111 01101000 00001111 00000000
00000010 00000001 00101100 00000011

breakdown:
  00000001                encoding 1 (Latin)
  00000101                char 5
  10000111 01101000       char 1000 (C=1 on first byte, two-byte VLI)
  00001111                char 15
  00000000                stop code
  00000010                encoding 2 (Arabic)
  00000001                char 1
  00101100                char 44
  00000011                char 3

The two-byte character (1000) works as follows:

First byte 10000111: C=1 (continue), payload = 0000111 = 7
Second byte 01101000: C=0 (final), payload = 1101000 = 104
Value: (7 << 7) | 104 = 896 + 104 = 1000

Total: 10 bytes for 6 characters across 2 encodings. The final block does not need a trailing stop code if it is at the end of the stream, but one can be included for consistency.

Comparison to UTF

Property	UTF-8/16/32	USE
Codepoint space	Shared, 1.1M+	Per-encoding, compact
Bytes per Latin char	1	1 (+ encoding header)
Bytes per CJK char	3	1-2 (if < 16K symbols)
Bytes per Arabic char	2	1 (if < 128 symbols)
Script switching cost	None (implicit)	1 stop byte + 1 encoding VLI
Extensibility	Unicode consortium	Any encoding system

USE is more compact for text that stays within one script for extended runs. It is less compact for text that switches scripts every few characters, due to the stop code and encoding ID overhead (2 bytes minimum per switch).

Advantages Over Unicode

Unicode allocates a single shared codepoint space across all scripts. Every script gets a reserved block of integers, and a centralized consortium decides which ranges belong to which script. This creates several structural problems that USE avoids.

No codepoint ceiling. Unicode has a hard cap of 1,114,112 codepoints (17 planes of 65,536). That number was chosen decades ago and is already partially exhausted. USE has no global cap. Each encoding system has its own independent integer space. A script with 50 symbols uses codes 0-49. A script with 80,000 symbols uses codes 0-79,999. New scripts never compete with existing ones for space.

No central authority required. Adding a script to Unicode requires a formal proposal to the Unicode Consortium, review cycles, and versioned releases. A new script cannot be encoded until the consortium approves it. USE lets anyone define an encoding system for any script at any time. Register a script_encoding_system row with a slug and it exists. Minority scripts, historical scripts, and constructed scripts get first-class support immediately.

No wasted range allocations. Unicode pre-allocates large contiguous blocks even for scripts with few characters, because blocks cannot overlap and must leave room for future additions. Latin has scattered blocks across multiple planes. CJK Unified Ideographs alone occupy over 90,000 codepoints. USE assigns compact sequential codes per encoding system, so there are no gaps and no wasted ranges.

Compact per-script representation. In UTF-8, a Latin character is 1 byte, but an Arabic character is 2 bytes and a CJK character is 3 bytes, because their Unicode codepoints fall in higher ranges. In USE, every script's characters start from code 0. Arabic characters (28 base letters) fit in 1 byte each. CJK characters (even with 80,000+ symbols) fit in 2 bytes each (codes 0-16,383 fit in 2 VLI bytes). The byte cost reflects the script's actual symbol count, not its position in a global numbering scheme.

Infinite extensibility. New scripts, new versions of existing scripts, and specialized encodings (e.g. a compact encoding for the 500 most common Kanji) can be added without any coordination. The VLI format for encoding IDs supports billions of encoding systems. The local encoding map means a document only carries the cost of the encodings it actually uses.

Version-safe. Unicode codepoint assignments are permanent. If a mistake is made, the codepoint is stuck. USE encoding systems are versioned by slug (e.g. "latin-use-v1", "latin-use-v2"). A new version can reorder codes for better compression or fix errors without breaking documents that reference the old version.

Rendering Engine

A rendering engine (browser, JavaScript runtime, native application) interprets a USE byte stream by combining three pieces of data: the byte stream itself, a local encoding map, and a set of symbol tables that map USE codes to font glyphs.

Data Model

/**
 * Maps a USE integer code to the information needed to render it.
 * One entry per symbol in an encoding system. Loaded from the
 * script_symbol_encoding table or shipped as a static asset.
 */

type SymbolEntry = {
  code: number
  unicodeCode: number
  name: string | null
}

/**
 * A resolved encoding system ready for rendering. Contains the
 * symbol lookup table (USE code -> Unicode codepoint) and the font
 * family to use for rendering glyphs in this script.
 */

type ResolvedEncoding = {
  systemId: number
  scriptSlug: string
  fontFamily: string
  symbols: Map<number, SymbolEntry>
}

/**
 * The rendering context for a document. Maps local encoding IDs
 * (the small integers that appear in the byte stream) to resolved
 * encoding systems with their symbol tables and font assignments.
 */

type RenderingContext = {
  encodings: Map<number, ResolvedEncoding>
}

/**
 * A decoded character ready for rendering. Produced by the decoder
 * from the raw byte stream.
 */

type DecodedCharacter = {
  unicodeCode: number
  fontFamily: string
  encodingId: number
  useCode: number
}

Decoding and Rendering Pipeline

The rendering engine reads VLIs from the byte stream one at a time. It alternates between two states: expecting an encoding ID, and reading characters. When it reads a stop code (0x00), it transitions back to expecting an encoding ID.

/**
 * Read one variable-length integer from the byte stream starting
 * at the given offset. Returns the decoded value and the new offset.
 */

function readVLI(
  bytes: Uint8Array,
  offset: number,
): { value: number; offset: number } {
  let value = 0
  let byte: number

  do {
    byte = bytes[offset++]
    value = (value << 7) | (byte & 0x7f)
  } while (byte & 0x80)

  return { value, offset }
}

/**
 * Decode a USE byte stream into renderable characters using the
 * provided rendering context. The context maps local encoding IDs
 * to resolved encoding systems with symbol tables and font families.
 *
 * The stream alternates between encoding IDs and character codes.
 * A stop code (0x00) ends each character sequence and signals that
 * the next VLI is an encoding ID.
 */

function decode(
  bytes: Uint8Array,
  context: RenderingContext,
): Array<DecodedCharacter> {
  const result: Array<DecodedCharacter> = []
  let offset = 0
  let currentEncodingId = -1
  let currentEncoding: ResolvedEncoding | null = null
  let expectingEncoding = true

  while (offset < bytes.length) {
    const vli = readVLI(bytes, offset)
    offset = vli.offset

    if (expectingEncoding) {
      // This VLI is an encoding ID.
      currentEncodingId = vli.value
      currentEncoding = context.encodings.get(vli.value) ?? null
      expectingEncoding = false
      continue
    }

    // Stop code: end of character sequence for this encoding block.
    if (vli.value === 0) {
      expectingEncoding = true
      continue
    }

    // Character code in the current encoding.
    if (!currentEncoding) {
      continue
    }

    const symbol = currentEncoding.symbols.get(vli.value)

    if (symbol) {
      result.push({
        unicodeCode: symbol.unicodeCode,
        fontFamily: currentEncoding.fontFamily,
        encodingId: currentEncodingId,
        useCode: vli.value,
      })
    }
  }

  return result
}

Building the Rendering Context

The rendering context is constructed before decoding begins. The application fetches the symbol tables for each encoding system referenced in the document header, and assigns a font family to each one.

/**
 * Build a rendering context from a document's encoding map and the
 * symbol data from the database (or a static asset bundle).
 */

function buildRenderingContext(
  localMap: Array<{ localId: number; systemSlug: string }>,
  systems: Map<
    string,
    {
      systemId: number
      scriptSlug: string
      symbols: Array<SymbolEntry>
    }
  >,
  fontMap: Map<string, string>,
): RenderingContext {
  const encodings = new Map<number, ResolvedEncoding>()

  for (const entry of localMap) {
    const system = systems.get(entry.systemSlug)

    if (!system) {
      continue
    }

    const symbolLookup = new Map<number, SymbolEntry>()

    for (const symbol of system.symbols) {
      symbolLookup.set(symbol.code, symbol)
    }

    encodings.set(entry.localId, {
      systemId: system.systemId,
      scriptSlug: system.scriptSlug,
      fontFamily: fontMap.get(system.scriptSlug) ?? 'sans-serif',
      symbols: symbolLookup,
    })
  }

  return { encodings }
}

Font Resolution

Each encoding system maps to a script, and each script maps to one or more fonts. The fontMap parameter in buildRenderingContext provides this mapping. In a browser context, these are CSS font family names. The rendering engine sets the font family per character (or per run of characters in the same encoding) so the correct glyphs are displayed.

// Example font map: script slug -> CSS font family
const fontMap = new Map<string, string>([
  ['latin', 'Inter, sans-serif'],
  ['arabic', 'Noto Naskh Arabic, serif'],
  ['katakana', 'Noto Sans JP, sans-serif'],
  ['devanagari', 'Noto Sans Devanagari, sans-serif'],
  ['han', 'Noto Sans SC, sans-serif'],
])

The decoded characters carry their font family, so the renderer can produce styled spans or text nodes grouped by font.

// Group decoded characters into runs by font for efficient rendering.
function groupByFont(
  chars: Array<DecodedCharacter>,
): Array<{ fontFamily: string; text: string }> {
  const runs: Array<{ fontFamily: string; text: string }> = []
  let currentFont = ''
  let currentChars: Array<string> = []

  for (const char of chars) {
    if (char.fontFamily !== currentFont) {
      if (currentChars.length > 0) {
        runs.push({
          fontFamily: currentFont,
          text: currentChars.join(''),
        })
      }
      currentFont = char.fontFamily
      currentChars = []
    }
    currentChars.push(String.fromCodePoint(char.unicodeCode))
  }

  if (currentChars.length > 0) {
    runs.push({ fontFamily: currentFont, text: currentChars.join('') })
  }

  return runs
}

Each run becomes a <span> with the appropriate font-family style, or a text node in a native rendering context with the matching font applied.

lancejpollard/spec.md

Select an option

No results found