Skip to content

Instantly share code, notes, and snippets.

@basilesimon
Last active February 12, 2026 12:58
Show Gist options
  • Select an option

  • Save basilesimon/31a60abafd403fc848a50369a1017d00 to your computer and use it in GitHub Desktop.

Select an option

Save basilesimon/31a60abafd403fc848a50369a1017d00 to your computer and use it in GitHub Desktop.
Extracting data with Webrecorder browser extension

On ChatGPT (web)

Key Findings

  • ✅ Webrecorder's browser extension effectively preserves ChatGPT web conversations.
  • All data is available for each actively viewed conversations
  • Full conversation structure with parent/child relationships
  • Chronological ordering via timestamps, temporal anchors with microsecond precision

Limitations

  • ❌ Replay does not work in replayweb.page
  • ❌ Privacy leak: Conversation list is always captured (titles, IDs, timestamps)
  • Must create separate archives for each conversation of interest
  • Timestamps come from the client, and capture takes place in browser. Worth checking change-time-on-local-machine attack.

Method: Browser extension. In extension settings, opt-in to "archive cookies" and "archive local storage"

Important Notes

  1. ⚠️ Careful: From Webrecorder's UI: "Sharing content created with this setting enabled may compromise your login credentials. Archived items created with this settings should generally be kept private!"

  2. A conversation's detailed messages are only captured if the archive was created while actively viewing that specific conversation. Otherwise, only metadata appears in the conversation list.

  3. Fortunately, if you have not explicitly browsed to a conversation, its content is not included in the archive – only metadata e.g. title and time.


Locations

File: archive/data.warc.gz → decompressed to data.warc

API Endpoint: https://chatgpt.com/backend-api/conversations and https://chatgpt.com/backend-api/conversation/{conversation_id}

{
  "items": [
    {
      "id": "conversation-uuid",
      "title": "Conversation Title",
      "create_time": "2026-01-30T15:14:58.016910Z",
      "update_time": "2026-01-30T15:18:33.779223Z",
      "is_archived": false,
      ...
    }
  ],
  "total": 29,
  "limit": 28,
  "offset": 0
}

Complete conversation structure:

{
  "conversation_id": "uuid",
  "title": "Conversation Title",
  "create_time": 1234567890.123,
  "update_time": 1234567891.456,
  "mapping": {
    "node-id-1": {
      "id": "node-id-1",
      "message": {
        "id": "message-uuid",
        "author": {
          "role": "user"
        },
        "content": {
          "content_type": "text",
          "parts": [
            "This is the actual message text"
          ]
        },
        "create_time": 1234567890.123
      },
      "parent": "parent-node-id",
      "children": ["child-node-id"]
    }
  }
}

Key Fields

  • mapping: Dictionary of conversation nodes (messages and system nodes)
  • message.author.role: "user" (your prompts) or "assistant" (ChatGPT responses)
  • message.content.parts: Array containing the actual message text
  • create_time: Unix timestamp for chronological ordering

Alternative: Pages Metadata

Location

File: pages/pages.jsonl (JSONL format - one JSON object per line)

Structure

{
  "title": "ChatGPT",
  "url": "https://chatgpt.com/",
  "id": "page-session-id",
  "ts": "2026-02-12T09:09:40.044Z",
  "text": "Extracted text from page including:\nYou said:\nShow me a cat picture\nChatGPT said:\n..."
}

Data Integrity & Authenticity Anchors

WACZ archives contain multiple layers of cryptographic and temporal anchors that can be used to verify the authenticity and integrity of conversations and individual messages.

  • WARC-Level Integrity (Per HTTP Transaction). Each WARC record contains cryptographic digests and metadata for verification

  • File-Level Integrity (WACZ Package): datapackage.json contains SHA-256 hashes of all archive components

  • Conversation-Level Anchors:

    {
      "conversation_id": "697ccad2-0994-832f-a69b-0e2a1456e747",
      "title": "Banking Choices in Germany",
      "create_time": 1769786098.01691,
      "update_time": 1769786315.00338,
      "current_node": "4428eb41-7d52-4427-a3cc-ca43441ce839",
      "default_model_slug": "auto"
    }
  • Message-Level Anchors: Each message contains multiple unique identifiers and timestamps:

    {
      "id": "36c27e0d-7917-49ff-a4d9-fc77f26dd29d",
      "author": {
        "role": "user",
        "name": null,
        "metadata": {}
      },
      "create_time": 1769786097.538748,
      "update_time": null,
      "status": "finished_successfully",
      "metadata": {
        "request_id": "247a8438-e77a-4d4d-b5c5-4362cd7ab256",
        "turn_exchange_id": "27723209-488e-46a5-b593-0767b8c61dc1",
        "message_source": null,
        "triggered_by_system_hint_suggestion": false
      },
      "content": {
        "content_type": "text",
        "parts": ["Message text"]
      }
    }

Unique Identifiers:

  • message.id: Unique UUID for this specific message
  • metadata.request_id: Backend request UUID (tracks API call)
  • metadata.turn_exchange_id: UUID linking user prompt to assistant response
  • **Node ID **: The key in the mapping dictionary (often same as message.id)

Timestamps:

  • create_time: Message creation (Unix timestamp with microsecond precision)
  • update_time: If message was edited (null if never edited)

Authorship:

  • author.role: "user", "assistant", "system", or "tool"
  • author.name: User identifier (if available)
  • status: Message processing status ("finished_successfully", etc.)

Parent-Child Relationship Chain

The conversation tree structure provides internal consistency verification:

{
  "node-id-1": {
    "id": "36c27e0d-7917-49ff-a4d9-fc77f26dd29d",
    "parent": "89d462ab-2d11-48bb-af9c-d9331c666a2a",
    "children": ["03004e31-4833-4012-8cd8-ce39a52ee775"],
    "message": { ... }
  }
}

On Whatsapp web app

❌ Webrecorder WACZ archives are NOT suitable for preserving WhatsApp Web conversations.

The architecture mismatch between WebSocket/IndexedDB vs HTTP means that:

  • Message content is not captured
  • No structured data available
  • No verification mechanisms
  • High privacy risk with minimal benefit

My recommendation remains to use WhatsApp's official export feature or mobile backup methods instead.

Limitations

  • ❌ Replay looks like an empty web app
  • Some privacy leak: Only conversation titles and last message are captured

WhatsApp Web appears to use WebSockets for real-time message synchronization, not HTTP REST APIs. Furthermore, Whatsapp messages appear to be stored in browser IndexedDB/LocalStorage, not fetched via HTTP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment