Skip to content

Expand --sanitize to mask SSNs, bank account numbers, and broader sensitive identifiers #550

@codeAtNectworks

Description

@codeAtNectworks

The current --sanitize option is useful for masking contact-style sensitive data such as emails, URLs, and phone numbers. It would be valuable to expand sanitization coverage to other common sensitive identifiers found in PDFs, especially financial and government ID fields.

Problem

Many real-world PDFs contain sensitive values beyond emails, URLs, and phone numbers, for example:

  • Social Security Numbers: 123-45-6789
  • Bank account numbers: Account: 000123456789
  • Routing numbers: Routing: 091000019
  • Credit card-like numbers
  • IBAN / SWIFT / BIC values
  • Tax IDs / EINs
  • Passport numbers
  • Driver license numbers
  • National IDs such as Aadhaar, PAN, etc.
  • Medical record / patient IDs
  • API keys, tokens, or secret-looking values

When --sanitize is enabled, users may reasonably expect these fields to be masked as well, especially when preparing extracted PDF output for LLM/RAG workflows.

Proposed Behavior

When --sanitize is enabled, OpenDataLoader should optionally mask additional sensitive patterns in all text-based outputs:

  • JSON
  • Markdown
  • HTML
  • Plain text

Example input:

Name: Jane Example
SSN: 123-45-6789
Account: 000123456789
Routing: 091000019
Email: jane@example.com
Phone: +1 (415) 555-0198
Payment URL: https://example.com/pay?token=secret

Example sanitized output:

Name: Jane Example
SSN: [SSN]
Account: [BANK_ACCOUNT]
Routing: [ROUTING_NUMBER]
Email: [EMAIL]
Phone: [PHONE]
Payment URL: [URL]

Suggested Design
A configurable sanitization profile would be helpful so users can avoid over-masking:

opendataloader-pdf file.pdf --sanitize
opendataloader-pdf file.pdf --sanitize --sanitize-profile contact,financial,national-id
opendataloader-pdf file.pdf --sanitize --sanitize-profile all

Possible profiles:

  • contact: emails, URLs, phone numbers
  • financial: bank accounts, routing numbers, IBAN, SWIFT/BIC, credit cards
  • national-id: SSN, Aadhaar, PAN, passport, driver license, tax IDs
  • healthcare: medical record numbers, patient IDs
  • secrets: API keys, bearer tokens, access tokens
  • all: all supported patterns

Acceptance Criteria

  • --sanitize continues to support the current documented behavior.
  • Additional sensitive identifiers can be masked via either the default sanitize behavior or explicit sanitize profiles.
  • Sanitization applies consistently across JSON, Markdown, HTML, and text artifacts.
  • Placeholder names are stable and machine-readable.
  • False positives are minimized, especially for normal invoice numbers, dates, quantities, and totals.
  • Documentation clearly states which field types are supported.

Why This Matters
PDF extraction is often used before sending content into LLMs, vector databases, or downstream automation. Expanding sanitization would reduce accidental leakage of financial, government ID, and other high-risk data while preserving document structure for parsing and RAG use cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions