Signature Extraction From PDFs

Related Links: Link #1 Link #2 Link #3

The core principle behind extracting a signature from a PDF involves identifying and isolating the signature's visual representation, whether it's a digitally embedded signature or a scanned image of a handwritten signature.

Digital signatures

They are embedded within the PDF structure as data objects, often associated with a signature handler that uses Public/Private Key cryptography. These data objects include information about what parts of the document were signed, the signer’s identity, and a cryptographic hash. The extraction involves locating these fields, reading their content, and interpreting the digital certificate and signature metadata. Digital signatures are generally interactive with form field of type /Sig that contains a cryptographic signature object. The PDF’s AcroForm may have a /SigFlags field indicating presence of signatures (SigFlags = 1 means at least one signature exists). Each signature field holds a binary signature blob PKCS#7 in its /Contents entry, along with metadata like signing time /M, reason, location, etc. The certificate and signer identity are embedded within the PKCS#7 content. The PKCS#7 data includes which hash algorithm was used (e.g., SHA256) and the signature algorithm (e.g., RSA).

To detect if a signature is digitally signed,

from pypdf import PdfReader

reader = PdfReader("document.pdf")
fields = reader.get_fields()           # Get all form fields
sig_fields = [f for f in fields.values() if f.field_type == "/Sig"]
print(f"Found {len(sig_fields)} digital signature fields.")
for sig in sig_fields:
    print("Signature field name:", sig.name)

This will find signature fields if present. Under the hood, pypdf reads the PDF structure and allows filtering fields by type. Another approach is using pikepdf (a QPDF based library) to scan the PDF’s objects for any with /Type: /Sig

Once a signature field is found, we can extract details like the signer's name, signing time, and certificate info. The raw /Contents is a PKCS#7 signature container (binary DER data) which must be parsed to get signer information. Python’s cryptography library doesn’t directly parse PKCS#7 signatures (its PKCS7 APIs are limited), but asn1crypto or PyOpenSSL can parse the ASN.1 structure. A recommended approach (as of 2023) is to use asn1crypto.cms.ContentInfo to parse the signature blob

from pypdf import PdfReader
from asn1crypto import cms

reader = PdfReader("document.pdf")
for field in reader.get_fields().values():
    if field.field_type == "/Sig":
        sig_dict = field.value               # Signature field value (dictionary)
        raw_pkcs7 = sig_dict['/Contents']    # Raw PKCS#7 bytes
        # Parse PKCS#7/CMS data
        content_info = cms.ContentInfo.load(raw_pkcs7).native
        signer_infos = content_info['content']['signer_infos']
        certs = content_info['content']['certificates']
        # Assuming one signer:
        signer = signer_infos[0]
        # Extract signing time from signed attributes if present
        signing_time = None
        for attr in signer.get('signed_attrs', []):
            if attr['type'] == 'signing_time':
                signing_time = attr['values'][0]
        # Extract signer's certificate and name
        cert_data = certs[0]['tbs_certificate']
        signer_name = cert_data['subject'].native.get('common_name')
        print("Signer Name:", signer_name)
        print("Signing Time (from signature):", signing_time)
        print("Digest Algorithm:", signer['digest_algorithm']['algorithm'])

In many cases, the PDF also stores the signing date in the field’s `/M` (modifying time) key (in a `D:YYYYMMDDHHMMSS` format). It can be parsed as the signing timestamp as well.

from dateutil import parser
sig_time_str = sig_dict.get("/M")  # e.g., "D:20230601122345+05'30'"
if sig_time_str:
    # Strip the leading "D:" and format timezone `+HH'mm'`
    sig_time = parser.parse(sig_time_str[2:].replace("'", ":"))
    print("Signing Time (from PDF):", sig_time)

The signer's certificate contains the identity (subject Common Name, email, etc.) and validity period. These can be extracted from the parsed data. E.g, the signer's Common Name might be the person’s name or organization.

Signature: ... signature_type=ETSI.CAdES.detached, signature_handler=Adobe.PPKLite Signer: EIDAS CERTIFICADO PRUEBAS – 99999999R Signing time: 2024-02-03 08:33:38+01:00 Signer's certificate: ... common_name=EIDAS CERTIFICADO PRUEBAS – 99999999R – not before: 2020-11-05 13:04:20+00:00 – not after: 2024-11-05 13:04:20+00:00 – issuer: ... common_name=AC FNMT Usuarios

The following is the typical output from extracting a pdf's signature details. This shows the signer’s name (from the certificate), the signing timestamp, and the certificate validity period

The popular package PYPDF2 falls short when signatures come into play. They can atmost detect signatures but cannot read their metadata. In order to extract signatures and potentially verify them, packages like endesive, pyHanko, pikepdf, pdfminer.six can be used. For a higher-level solution, PyHanko is a dedicated library for PDF signatures. It can validate signatures and extract metadata with minimal code.

Image Signatures

A good article article pt.2 outlines how to extract signatures (that have been used as images) from the pdf. Another article shows how we can use OpenCV to detect and extract signatures as image. The program flow is roughly this: Parse the PDF page content -> Identify all embedded images -> Locate the likely signature image( often small and near the bottom of a page) -> Save in any image format. A signature image is typically a squiggly, mostly black (or blue) on white background image. If it can be loaded (e.g., via Pillow), we could check if it has a white background and dark strokes. Approaches like checking the fraction of dark pixels could help identify a signature vs. a photograph or logo.

Packages like PyMuPDF, pdfplumber, pdfminer.six can be used for this purpose.

Example (using PyMuPDF):

import fitz  # PyMuPDF
doc = fitz.open("signed_doc.pdf")
page = doc.load_page(0)  # examine first page (index 0)
page_dict = page.get_text("dict")          # get all content as dict
img_blocks = [b for b in page_dict["blocks"] if b["type"] == 1]  # type 1 = image
for img in img_blocks:
    x0, y0, x1, y1 = img["bbox"]             # image location on page (rect coords)
    width_px = img["width"]                 # image pixel width
    height_px = img["height"]               # image pixel height
    img_bytes = img["image"]                # image binary data (e.g. JPEG/PNG bytes)
    img_ext = img["ext"]                    # image format ('png', 'jpeg', etc.)
    print(f"Found image at {img['bbox']} with size {width_px}x{height_px}px")
    # Save the image bytes to file (optional)
    with open(f"extracted_sig_{x0:.0f}_{y0:.0f}.{img_ext}", "wb") as f:
        f.write(img_bytes)

The bbox gives the position of the image on the page in PDF coordinate space (usually in points, where 72 points = 1 inch). In the example above, bbox might output something like (100.0, 135.88, 300.0, 364.12) which corresponds to the rectangular region of the image. The code saves each image to a file (with a name derived from its coordinates) which can then be inspected for further processing.

Pros/Cons of Packages

PyPDF2/PyPDF: easy field access but no built-in sig parsing.
PyHanko: powerful for signature data and validation but heavier dependency.
pikepdf: great for low-level PDF tasks but requires manual parsing for signatures.
PyMuPDF: fast extraction of images and their coordinates but does not classify content.
pdfplumber: high-level, can get positions of all elements but slightly slower and requires pdfminer’s complex objects for image extraction.
pdfminer.six: underlying engine for pdfplumber which is flexible and highly customizable but low-level hence complex.
OpenCV/Pillow: needed if further image analysis or signature shape recognition is required; operates on image data after extraction.

Issues

No way to determine if the pdf contains an image signature or a digital signature or both.
No optimization techniques discovered yet. Individually scanning through each page and extracting the required signature can be a system and time hungry task.
No way to verify authenticity of image signatures.
No well-defined way to determine if a signature is missing at the end of a clause that should have a signature.

Conclusion

Paid services like aspose-pdf exist but due to the high availability of free, potentially open source alternatives, paid ones are not required. Extracting the signature can be a bit tricky but properly manipulating the image signatures can give us consistent accurate results.

AalbatrossGuy/extract_sig.md

Select an option

No results found