Expand `--sanitize` to mask SSNs, bank account numbers, and broader sensitive identifiers

The current `--sanitize` option is useful for masking contact-style sensitive data such as emails, URLs, and phone numbers. It would be valuable to expand sanitization coverage to other common sensitive identifiers found in PDFs, especially financial and government ID fields.

### Problem

Many real-world PDFs contain sensitive values beyond emails, URLs, and phone numbers, for example:

- Social Security Numbers: `123-45-6789`
- Bank account numbers: `Account: 000123456789`
- Routing numbers: `Routing: 091000019`
- Credit card-like numbers
- IBAN / SWIFT / BIC values
- Tax IDs / EINs
- Passport numbers
- Driver license numbers
- National IDs such as Aadhaar, PAN, etc.
- Medical record / patient IDs
- API keys, tokens, or secret-looking values

When `--sanitize` is enabled, users may reasonably expect these fields to be masked as well, especially when preparing extracted PDF output for LLM/RAG workflows.

### Proposed Behavior

When `--sanitize` is enabled, OpenDataLoader should optionally mask additional sensitive patterns in all text-based outputs:

- JSON
- Markdown
- HTML
- Plain text

Example input:

```text
Name: Jane Example
SSN: 123-45-6789
Account: 000123456789
Routing: 091000019
Email: jane@example.com
Phone: +1 (415) 555-0198
Payment URL: https://example.com/pay?token=secret
```

Example sanitized output:
```
Name: Jane Example
SSN: [SSN]
Account: [BANK_ACCOUNT]
Routing: [ROUTING_NUMBER]
Email: [EMAIL]
Phone: [PHONE]
Payment URL: [URL]
```
Suggested Design
A configurable sanitization profile would be helpful so users can avoid over-masking:

```
opendataloader-pdf file.pdf --sanitize
opendataloader-pdf file.pdf --sanitize --sanitize-profile contact,financial,national-id
opendataloader-pdf file.pdf --sanitize --sanitize-profile all
```
Possible profiles:

- contact: emails, URLs, phone numbers
- financial: bank accounts, routing numbers, IBAN, SWIFT/BIC, credit cards
- national-id: SSN, Aadhaar, PAN, passport, driver license, tax IDs
- healthcare: medical record numbers, patient IDs
- secrets: API keys, bearer tokens, access tokens
- all: all supported patterns

Acceptance Criteria

- _--sanitize_ continues to support the current documented behavior.
- Additional sensitive identifiers can be masked via either the default sanitize behavior or explicit sanitize profiles.
- Sanitization applies consistently across JSON, Markdown, HTML, and text artifacts.
- Placeholder names are stable and machine-readable.
- False positives are minimized, especially for normal invoice numbers, dates, quantities, and totals.
- Documentation clearly states which field types are supported.

Why This Matters
PDF extraction is often used before sending content into LLMs, vector databases, or downstream automation. Expanding sanitization would reduce accidental leakage of financial, government ID, and other high-risk data while preserving document structure for parsing and RAG use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand `--sanitize` to mask SSNs, bank account numbers, and broader sensitive identifiers #550

Problem

Proposed Behavior

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Expand --sanitize to mask SSNs, bank account numbers, and broader sensitive identifiers #550

Description

Problem

Proposed Behavior

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Expand `--sanitize` to mask SSNs, bank account numbers, and broader sensitive identifiers #550