The current --sanitize option is useful for masking contact-style sensitive data such as emails, URLs, and phone numbers. It would be valuable to expand sanitization coverage to other common sensitive identifiers found in PDFs, especially financial and government ID fields.
Problem
Many real-world PDFs contain sensitive values beyond emails, URLs, and phone numbers, for example:
- Social Security Numbers:
123-45-6789
- Bank account numbers:
Account: 000123456789
- Routing numbers:
Routing: 091000019
- Credit card-like numbers
- IBAN / SWIFT / BIC values
- Tax IDs / EINs
- Passport numbers
- Driver license numbers
- National IDs such as Aadhaar, PAN, etc.
- Medical record / patient IDs
- API keys, tokens, or secret-looking values
When --sanitize is enabled, users may reasonably expect these fields to be masked as well, especially when preparing extracted PDF output for LLM/RAG workflows.
Proposed Behavior
When --sanitize is enabled, OpenDataLoader should optionally mask additional sensitive patterns in all text-based outputs:
- JSON
- Markdown
- HTML
- Plain text
Example input:
Name: Jane Example
SSN: 123-45-6789
Account: 000123456789
Routing: 091000019
Email: jane@example.com
Phone: +1 (415) 555-0198
Payment URL: https://example.com/pay?token=secret
Example sanitized output:
Name: Jane Example
SSN: [SSN]
Account: [BANK_ACCOUNT]
Routing: [ROUTING_NUMBER]
Email: [EMAIL]
Phone: [PHONE]
Payment URL: [URL]
Suggested Design
A configurable sanitization profile would be helpful so users can avoid over-masking:
opendataloader-pdf file.pdf --sanitize
opendataloader-pdf file.pdf --sanitize --sanitize-profile contact,financial,national-id
opendataloader-pdf file.pdf --sanitize --sanitize-profile all
Possible profiles:
- contact: emails, URLs, phone numbers
- financial: bank accounts, routing numbers, IBAN, SWIFT/BIC, credit cards
- national-id: SSN, Aadhaar, PAN, passport, driver license, tax IDs
- healthcare: medical record numbers, patient IDs
- secrets: API keys, bearer tokens, access tokens
- all: all supported patterns
Acceptance Criteria
- --sanitize continues to support the current documented behavior.
- Additional sensitive identifiers can be masked via either the default sanitize behavior or explicit sanitize profiles.
- Sanitization applies consistently across JSON, Markdown, HTML, and text artifacts.
- Placeholder names are stable and machine-readable.
- False positives are minimized, especially for normal invoice numbers, dates, quantities, and totals.
- Documentation clearly states which field types are supported.
Why This Matters
PDF extraction is often used before sending content into LLMs, vector databases, or downstream automation. Expanding sanitization would reduce accidental leakage of financial, government ID, and other high-risk data while preserving document structure for parsing and RAG use cases.
The current
--sanitizeoption is useful for masking contact-style sensitive data such as emails, URLs, and phone numbers. It would be valuable to expand sanitization coverage to other common sensitive identifiers found in PDFs, especially financial and government ID fields.Problem
Many real-world PDFs contain sensitive values beyond emails, URLs, and phone numbers, for example:
123-45-6789Account: 000123456789Routing: 091000019When
--sanitizeis enabled, users may reasonably expect these fields to be masked as well, especially when preparing extracted PDF output for LLM/RAG workflows.Proposed Behavior
When
--sanitizeis enabled, OpenDataLoader should optionally mask additional sensitive patterns in all text-based outputs:Example input:
Example sanitized output:
Suggested Design
A configurable sanitization profile would be helpful so users can avoid over-masking:
Possible profiles:
Acceptance Criteria
Why This Matters
PDF extraction is often used before sending content into LLMs, vector databases, or downstream automation. Expanding sanitization would reduce accidental leakage of financial, government ID, and other high-risk data while preserving document structure for parsing and RAG use cases.