Overview • Installation • Quickstart • Feature Overview • Core Capabilities • Advanced Features • Project File Organization • Architecture • Configuration • API Reference • CLI Reference • Usage Examples • Extension • Troubleshooting • License
Beta (v0.5.2): SemanticQAGen is functional and actively maintained. The public API is largely stable, but some details may still change ahead of the 1.0 release.
SemanticQAGen is a powerful Python library for generating high-quality question-answer pairs from text documents. It uses advanced semantic understanding to intelligently process content, analyze information density, and create diverse questions across multiple cognitive levels.
SemanticQAGen features enhanced semantic chunking, dynamic question generation, validation of questions and answers, and flexible LLM routing capabilities. You can run all tasks locally on an OpenAI-compatible server, run them via a remote API, or split specific tasks (e.g., validation, analysis, generation) between local and remote servers. The library is designed with a "for Humans" philosophy - simple for basic use cases while providing advanced capabilities for power users.
pip install semantic-qa-genThe core install is enough to process plain-text and Markdown documents and to talk to any OpenAI-compatible LLM endpoint — OpenAI, Azure OpenAI, OpenRouter, or a local server such as Ollama — over the library's built-in HTTP client.
Install extras for additional document formats and capabilities:
# PDF reading
pip install semantic-qa-gen[pdf]
# DOCX reading
pip install semantic-qa-gen[docx]
# OCR for scanned PDFs (also requires the Tesseract binary; see System Dependencies)
pip install semantic-qa-gen[ocr]
# Advanced NLP analysis (scikit-learn, numpy)
pip install semantic-qa-gen[advanced]
# All document-format support (PDF + DOCX + OCR + advanced analysis)
pip install semantic-qa-gen[formats]
# Sentence-aware chunking helpers (NLTK)
pip install semantic-qa-gen[nlp]
# Everything above, plus the official OpenAI SDK and Rich console output
pip install semantic-qa-gen[full]
# Development tools (tests, linting, docs)
pip install semantic-qa-gen[dev]A note on LLM providers: You do not need a provider-specific extra to use hosted or local OpenAI-compatible APIs. The library communicates with them directly over its built-in HTTP client, so OpenAI, Azure OpenAI, OpenRouter, and local servers all work with the core install.
A couple of features rely on libraries outside of pip:
- File-type detection uses
python-magic, which requires the systemlibmagiclibrary. On Debian/Ubuntu:sudo apt-get install libmagic1. On macOS:brew install libmagic. On Windows, installpython-magic-bininstead of relying on the system library. - OCR (
[ocr]) requires the Tesseract OCR engine on yourPATH. On Debian/Ubuntu:sudo apt-get install tesseract-ocr. On macOS:brew install tesseract.
- Python 3.10 or higher
- Core Python dependencies install automatically; the system libraries above are only needed for their corresponding features.
from semantic_qa_gen import SemanticQAGen
# Initialize with default settings
qa_gen = SemanticQAGen()
# Process a document
result = qa_gen.process_document("path/to/document.txt")
# Save the questions to a JSON file
qa_gen.save_questions(result, "output")# Generate questions from a document with default settings
semantic-qa-gen process document.pdf -o questions_output
# Create a config file interactively
semantic-qa-gen init-config config.yml --interactive
# Process with a specific configuration
semantic-qa-gen process document.txt --config config.yml --format jsonSemanticQAGen offers a comprehensive set of features designed to produce high-quality question and answer sets:
| Feature Category | Capability | Status |
|---|---|---|
| Document Processing | Document format support: TXT, PDF, DOCX, MD | ✅ |
| Automatic document type detection | ✅ | |
| Cross-page content handling | ✅ | |
| Header/footer detection and removal | ✅ | |
| Content Analysis | Semantic document chunking | ✅ |
| Information density analysis | ✅ | |
| Topic coherence evaluation | ✅ | |
| Key concept extraction | ✅ | |
| Question Generation | Multi-level cognitive questions (factual, inferential, conceptual) | ✅ |
| Adaptive generation based on content quality | ✅ | |
| Question diversity enforcement | ✅ | |
| Custom question categories | ✅ | |
| Answer Validation | Faithfulness verification (accuracy + completeness) | ✅ |
| Standalone / decontextualization rewriting | ✅ | |
| Answer-leakage filtering | ✅ | |
| Diversity filtering | ✅ | |
| LLM Integration | OpenAI-compatible API support (OpenAI, Azure, OpenRouter) | ✅ |
| Local LLM support (Ollama, etc.) | ✅ | |
| Hybrid task routing | ✅ | |
| Automatic fallback mechanisms | ✅ | |
| Processing Control | Checkpoint and resume capability | ✅ |
| Concurrent processing | ✅ | |
| Progress tracking and reporting | ✅ | |
| Output Options | Multiple export formats (JSON, CSV, JSONL) | ✅ |
| Metadata inclusion | ✅ | |
| Statistics and analytics | ✅ | |
| Extensibility | Custom document loaders | ✅ |
| Custom chunking strategies | ✅ | |
| Custom validators | ✅ |
SemanticQAGen can read and process a variety of document formats including plain text, PDF, Markdown, and DOCX. Each format is handled by specialized loaders that extract content while preserving document structure. (PDF support requires the [pdf] extra and DOCX support requires the [docx] extra; TXT and Markdown work with the core install.)
# Process different file types the same way
result_txt = qa_gen.process_document("document.txt")
result_pdf = qa_gen.process_document("document.pdf")
result_md = qa_gen.process_document("document.md")
result_docx = qa_gen.process_document("document.docx")Process multiple files from a directory:
# Process all files in a directory
batch_results = qa_gen.process_input_directory()The system automatically detects document types using both file extensions and content analysis, ensuring the correct loader is used even when file extensions are missing or incorrect.
For PDF documents, the system intelligently handles sentences and paragraphs that span across page boundaries, creating a seamless text flow for better semantic analysis.
Automatic detection and optional removal of repeating headers and footers in PDF documents, preventing them from being included in generated questions.
Documents are intelligently broken down into semantically coherent chunks based on content structure rather than arbitrary size limits. This preserves context and produces more meaningful question-answer pairs.
# Configure chunking strategy
config = {
"chunking": {
"strategy": "semantic", # Options: semantic, fixed_size
"target_chunk_size": 1500,
"preserve_headings": True
}
}Each chunk is analyzed for information density - how rich in facts and teachable content it is. This analysis guides question generation to focus on content-rich sections.
The system evaluates how well each chunk maintains a coherent topic or theme, which helps ensure generated questions relate to a consistent subject area.
Important concepts, terms, and ideas are automatically identified in each chunk, forming the basis for targeted question generation.
The system generates questions across three cognitive domains:
- Factual: Direct recall of information stated in the content
- Inferential: Questions requiring connecting multiple pieces of information
- Conceptual: Higher-order questions about principles, implications, or broader understanding
# Configure question categories
config = {
"question_generation": {
"categories": {
"factual": {"min_questions": 3, "weight": 1.0},
"inferential": {"min_questions": 2, "weight": 1.2},
"conceptual": {"min_questions": 1, "weight": 1.5}
}
}
}The number and types of questions generated adapt automatically based on content quality. Information-dense chunks yield more questions, while sparse chunks yield fewer.
To avoid repetitive or overly similar questions, the system enforces diversity by comparing newly generated questions with existing ones and filtering out duplicates.
Users can define custom question categories beyond the standard factual/inferential/conceptual to target specific learning objectives.
Generated answers are scored for faithfulness against the source content — both factual accuracy and completeness — to ensure they do not contain errors or hallucinations and that they fully address the question.
Question-answer pairs are evaluated for whether they make sense without the source passage, and can be rewritten to stand alone while preserving their grounded meaning. This produces self-contained pairs suitable for fine-tuning datasets.
A dedicated filter detects and removes questions that inadvertently reveal their answer (or that restate the source verbatim), keeping the generated set genuinely question-shaped.
Newly generated questions are compared against the existing set so near-duplicates are filtered out, keeping the output varied.
SemanticQAGen talks to OpenAI-compatible chat-completion endpoints over its own built-in HTTP client, with optimized prompting strategies for each task in the pipeline. This covers OpenAI, Azure OpenAI, OpenRouter, and any other service that exposes an OpenAI-compatible API — no provider SDK is required.
Support for local LLM deployment via Ollama and similar OpenAI-compatible servers, allowing use of models like Mistral, running on your own hardware without requiring external API access.
Intelligently route different tasks to the most appropriate LLM based on task complexity and model capability. For example, use a strong remote model for complex question generation but a local model for simple validation tasks.
config = {
"llm_services": {
"local": {
"enabled": True,
"url": "http://localhost:11434",
"model": "mistral:7b",
"preferred_tasks": ["validation"]
},
"remote": {
"enabled": True,
"provider": "openai",
"model": "gpt-4o",
"preferred_tasks": ["analysis", "generation"]
}
}
}If a primary LLM service fails, the system automatically tries fallback services, ensuring robustness in production environments.
Processing can be interrupted and resumed later using a checkpoint system. This is essential for large documents or when processing must be paused.
config = {
"processing": {
"enable_checkpoints": True,
"checkpoint_dir": "./checkpoints",
"checkpoint_interval": 10 # Save every 10 chunks
}
}Multi-threaded processing of chunks with configurable concurrency levels to maximize throughput on multi-core systems.
Detailed progress reporting during processing, with support for both simple console output and rich interactive displays (when installed with Rich, e.g. via the [full] extra).
Export question-answer pairs in various formats including JSON, CSV, and JSONL with customizable formatting options.
# Save questions in different formats
qa_gen.save_questions(result, "questions_output", format_name="json")
qa_gen.save_questions(result, "questions_output", format_name="csv")
qa_gen.save_questions(result, "questions_output", format_name="jsonl")Include rich metadata about source documents, generation parameters, and validation results with the generated questions.
Comprehensive statistics about generated questions, including category distribution, validation success rates, and content coverage.
.
├── pyproject.toml
├── tox.ini
├── README.md
└── src
├── main.py
└── semantic_qa_gen
├── __init__.py
├── version.py
├── semantic_qa_gen.py
├── chunking
│ ├── __init__.py
│ ├── analyzer.py
│ ├── engine.py
│ └── strategies
│ ├── __init__.py
│ ├── base.py
│ ├── fixed_size.py
│ ├── nlp_helpers.py
│ └── semantic.py
├── cli
│ ├── __init__.py
│ └── commands.py
├── config
│ ├── __init__.py
│ ├── manager.py
│ └── schema.py
├── document
│ ├── __init__.py
│ ├── models.py
│ ├── processor.py
│ └── loaders
│ ├── __init__.py
│ ├── base.py
│ ├── docx.py
│ ├── markdown.py
│ ├── pdf.py
│ └── text.py
├── llm
│ ├── __init__.py
│ ├── router.py
│ ├── adapters
│ │ ├── __init__.py
│ │ ├── base.py
│ │ └── openai_adapter.py
│ └── prompts
│ ├── __init__.py
│ ├── manager.py
│ └── templates
│ ├── analysis_prompts.yaml
│ ├── generation_prompts.yaml
│ ├── validation_prompts.yaml
│ └── decontextualize_prompts.yaml
├── output
│ ├── __init__.py
│ ├── formatter.py
│ └── adapters
│ ├── __init__.py
│ ├── csv.py
│ ├── json.py
│ └── jsonl.py
├── pipeline
│ ├── __init__.py
│ └── semantic.py
├── question
│ ├── __init__.py
│ ├── generator.py
│ ├── processor.py
│ ├── filters
│ │ ├── __init__.py
│ │ └── leak_filter.py
│ └── validation
│ ├── __init__.py
│ ├── base.py
│ ├── decontextualizer.py
│ ├── diversity.py
│ ├── engine.py
│ └── factual.py
└── utils
├── __init__.py
├── checkpoint.py
├── error.py
├── logging.py
├── progress.py
└── project.py
Separately from the package source above, SemanticQAGen can manage a QAGenProject working directory — created by semantic-qa-gen create-project or by passing project_path= to the constructor. Relative input, output, checkpoint, and config paths are resolved against this directory:
QAGenProject/
├── config/ # Configuration files (system.yaml by default)
├── prompts/ # Optional project-local prompt template overrides
├── input/ # Place source documents here
├── output/ # Generated Q&A files are written here
├── checkpoints/ # Checkpoint files for resumable runs
├── logs/ # Log files
└── temp/ # Temporary working files
If no project is supplied, the library uses the current working directory and only auto-creates a QAGenProject scaffold when neither a config file nor a config dictionary is provided.
SemanticQAGen implements a modular pipeline architecture with clearly defined components and interfaces:
ARCHITECTURE OVERVIEW
┌───────────────────────────────────────────────────────────────────────────────────┐
│ ┌─────────────────────────────┐ │
│ │ SemanticQAGen │ Main user interface │
│ │ (sync + async entrypoints) │ (process_document, │
│ └──────────────┬──────────────┘ save_questions, ...) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────┐ │
│ │ SemanticPipeline │ Orchestrates the │
│ │ (orchestrator) │ staged workflow │
│ └──────────────┬──────────────┘ │
│ │ │
│ STAGE 1 ┌──────────────────▼──────────────────┐ │
│ Load + sectionize │ DocumentProcessor + Loaders │ │
│ │ (TXT · Markdown · PDF · DOCX) │ │
│ └──────────────────┬──────────────────┘ │
│ STAGE 2 ┌──────────────────▼──────────────────┐ │
│ Chunk │ ChunkingEngine │ │
│ │ (semantic · fixed_size strategies)│ │
│ └──────────────────┬──────────────────┘ │
│ STAGE 3 ┌──────────────────▼──────────────────┐ ┌───────────────┐ │
│ Analyze (per │ SemanticAnalyzer │──▶│ │ │
│ chunk, batched) │ density · coherence · concepts │ │ │ │
│ └──────────────────┬──────────────────┘ │ TaskRouter │ │
│ STAGE 4 ┌──────────────────▼──────────────────┐ │ + │ │
│ Generate + │ QuestionProcessor │◀─▶│ PromptManager │ │
│ validate + │ ┌───────────────────────────────┐ │ │ │ │
│ repair │ │ 1. QuestionGenerator │ │ │ routes tasks │ │
│ (per chunk, in │ │ (factual·inferential· │ │ │ to services: │ │
│ concurrency │ │ conceptual + custom) │ │ │ ┌───────────┐ │ │
│ waves) │ └──────────────┬────────────────┘ │ │ │ Remote │ │ │
│ │ ┌──────────────▼────────────────┐ │ │ │ LLM │ │ │
│ │ │ 2. ValidationEngine │ │ │ │ (OpenAI, │ │ │
│ │ │ • leak filter (no LLM) │ │ │ │ Azure, │ │ │
│ │ │ • faithfulness: │ │ │ │ OpenRtr) │ │ │
│ │ │ factual + completeness │ │ │ └───────────┘ │ │
│ │ │ • standalone judge │ │ │ ┌───────────┐ │ │
│ │ │ • diversity (no LLM) │ │ │ │ Local LLM │ │ │
│ │ └──────────────┬────────────────┘ │ │ │ (Ollama, │ │ │
│ │ ┌──────────────▼────────────────┐ │ │ │ LM Studio│ │ │
│ │ │ 3. Decontextualizer (repair) │ │ │ │ …) │ │ │
│ │ │ rewrite → re-validate → │ │ │ └───────────┘ │ │
│ │ │ keep-better-of-two merge │ │ └───────────────┘ │
│ │ └───────────────────────────────┘ │ │
│ └──────────────────┬──────────────────┘ │
│ STAGE 5 ┌──────────────────▼──────────────────┐ │
│ Format + save │ OutputFormatter │ │
│ │ ┌────────┐ ┌────────┐ ┌─────────┐ │ │
│ │ │ JSON │ │ CSV │ │ JSONL │ │ │
│ │ └────────┘ └────────┘ └─────────┘ │ │
│ └──────────────────┬──────────────────┘ │
│ ┌─────────────▼────────────────┐ │
│ │ Output Results │ • Questions & answers │
│ │ │ • Document metadata │
│ └──────────────────────────────┘ • Statistics │
│ │
│ Cross-cutting: ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ CheckpointManager │ │ ProgressReporter │ │
│ │ (save / resume) │ │ (console feedback) │ │
│ └──────────────────────┘ └──────────────────────┘ │
└───────────────────────────────────────────────────────────────────────────────────┘
- Document Processor & Loaders: Load documents (TXT, Markdown, PDF, DOCX) and split them into structural sections.
- Chunking Engine: Splits documents into semantically coherent chunks using the configured strategy (
semanticorfixed_size). - Semantic Analyzer: Evaluates each chunk for information density, topic coherence, and key concepts to guide generation.
- Task Router & Prompt Manager: Route each task (analysis, generation, validation) to the preferred LLM service and supply the matching prompt templates.
- Question Generator: Creates diverse questions across cognitive categories based on the chunk analysis.
- Validation Engine: Runs the deterministic leak filter plus the faithfulness (factual accuracy + completeness), standalone, and diversity validators, and aggregates a single verdict per question.
- Decontextualizer: An optional repair stage that rewrites flagged or non-standalone questions and re-validates them, keeping the better of the original and the rewrite.
- Question Processor: Orchestrates generation → validation → repair for each chunk and produces the per-chunk statistics.
- Output Formatter: Formats and exports the surviving Q&A pairs as JSON, CSV, or JSONL.
- Checkpoint Manager & Progress Reporter: Cross-cutting services for resumable processing and progress feedback.
Document → Sections → Chunks → Analysis → (Generate → Validate → Repair) → Output
Internally the pipeline runs five ordered stages, each tracked for timing and progress: Loading (load the document and extract sections), Chunking, Analysis (chunks analyzed concurrently in batches), Question Generation (each chunk's questions are generated, validated, and optionally repaired, with chunks processed in concurrency-sized waves), and Completion (final statistics, checkpoint, and result assembly). Checkpoints are loaded before chunking when enabled and saved at the configured interval, so an interrupted run resumes from the last completed wave of chunks.
SemanticQAGen uses a hierarchical YAML configuration system with schema validation.
# SemanticQAGen configuration
version: 1.0
# Document processing settings
document:
loaders:
text:
enabled: true
encoding: utf-8
pdf:
enabled: true
extract_images: false
ocr_enabled: false
detect_headers_footers: true
markdown:
enabled: true
extract_metadata: true
docx:
enabled: true
extract_images: false
# Chunking settings
chunking:
strategy: semantic
target_chunk_size: 1500
overlap_size: 150
preserve_headings: true
min_chunk_size: 500
max_chunk_size: 2500
# LLM services configuration.
# Define 'local', 'remote', or both. A task may be preferred by only one
# service — the same task cannot appear in both 'preferred_tasks' lists.
llm_services:
local:
enabled: true
# Base URL of an OpenAI-compatible local server (Ollama, LM Studio, etc.).
url: "http://localhost:11434"
model: "mistral:7b"
preferred_tasks: [validation]
timeout: 60
remote:
enabled: true
provider: openai # openai | azure | openrouter | anthropic | ...
model: gpt-4o
api_key: ${OPENAI_API_KEY}
# api_base: "..." # required for Azure and some other providers
preferred_tasks: [analysis, generation]
timeout: 120
max_retries: 3 # retries on transient errors
initial_delay: 1.0 # seconds before the first retry
max_delay: 60.0 # cap on exponential backoff
# Question generation settings
question_generation:
max_questions_per_chunk: 10
adaptive_generation: true
categories:
factual:
min_questions: 2
weight: 1.0
inferential:
min_questions: 2
weight: 1.2
conceptual:
min_questions: 1
weight: 1.5
# Validation settings.
# A deterministic leak filter always runs first (no LLM). The validators below
# are the faithfulness pair (factual_accuracy + answer_completeness), the
# 'standalone' judge — configured here under 'question_clarity' — and a
# diversity check. A question must pass all enabled validators to be kept.
validation:
factual_accuracy:
enabled: true
threshold: 0.7
answer_completeness:
enabled: true
threshold: 0.7
question_clarity: # controls the source-free 'standalone' judge
enabled: true
threshold: 0.7
diversity:
enabled: true
threshold: 0.85 # similarity above which a near-duplicate is rejected
# Decontextualization (repair) settings.
# When enabled, questions that the leak filter only FLAGged or that failed the
# standalone judge are rewritten and re-validated; the better of the original
# and the rewrite is kept.
decontextualization:
enabled: true
max_attempts: 1 # rewrite attempts per question (1-3)
mode: rewrite # 'rewrite' (single-pass); 'qa_rewrite' reserved
# Processing settings
processing:
concurrency: 3
enable_checkpoints: true
checkpoint_interval: 10
checkpoint_dir: "./checkpoints"
log_level: "INFO"
debug_mode: false
# Output settings
output:
format: json
include_metadata: true
include_statistics: true
output_dir: "./output"
fine_tuning_format: "default"
json_indent: 2
json_ensure_ascii: false
csv_delimiter: ","
csv_quotechar: "\""Configuration values can be specified using environment variables:
llm_services:
remote:
api_key: ${OPENAI_API_KEY}Configuration is resolved in the following order:
- Default values
- Configuration file
- Environment variables
- Command-line arguments
- Programmatic overrides
class SemanticQAGen:
"""Main interface for generating question-answer pairs from text documents."""
def __init__(self, config_path: Optional[str] = None,
config_dict: Optional[Dict[str, Any]] = None,
verbose: bool = False,
project_path: Optional[str] = None):
"""Initialize SemanticQAGen with optional configuration."""
def process_document(self, document_path: str) -> Dict[str, Any]:
"""
Process a document to generate question-answer pairs.
Args:
document_path: Path to the document. May be absolute or relative
to the project's input directory.
Returns:
Dictionary containing questions, statistics, and metadata.
Raises:
RuntimeError: If called from within a running event loop (e.g.
Jupyter, FastAPI). Use `process_document_async` instead.
"""
async def process_document_async(self, document_path: str) -> Dict[str, Any]:
"""
Async variant of `process_document`. Use this when calling from inside
an existing event loop (Jupyter, FastAPI, etc.).
Args:
document_path: Path to the document. May be absolute or relative
to the project's input directory.
Returns:
Dictionary containing questions, statistics, and metadata.
"""
def process_input_directory(self, output_format: Optional[str] = None) -> Dict[str, Any]:
"""
Processes all readable files in the project's input directory.
Args:
output_format: Optional output format to override config.
Returns:
A dictionary summarizing the batch processing results.
"""
def save_questions(self, result: Dict[str, Any],
output_path: str,
format_name: Optional[str] = None) -> str:
"""
Save generated questions to a file.
Args:
result: Results from process_document.
output_path: Path where to save the output.
format_name: Format to save in (json, csv, jsonl).
Returns:
Path to the saved file.
"""
def create_default_config_file(self, output_path: str, include_comments: bool = True) -> None:
"""Create a default configuration file."""
def dump_failed_chunks(self, output_path: Optional[str] = None) -> int:
"""
Generate a detailed report of failed chunks for debugging.
Args:
output_path: Optional path to write the report.
Returns:
Number of failed chunks reported.
"""SemanticQAGen provides a command-line interface:
semantic-qa-gen process <document> [-o OUTPUT] [-f {json,csv}] [-c CONFIG] [-p PROJECT] [-v]
semantic-qa-gen create-project [path]
semantic-qa-gen init-config [output] [-i] [-p PROJECT]
semantic-qa-gen interactive
semantic-qa-gen formats
semantic-qa-gen info
semantic-qa-gen version
Note: The
processcommand's-f/--formatflag acceptsjsonandcsv. JSONL output is fully supported, but is selected via the configuration (output.format: jsonl) or theformat_nameargument ofsave_questions()rather than this flag.
process Process a document and generate questions
document Path to the document file
-o, --output Path for output file
-f, --format Output format (json, csv); defaults to config
-c, --config Path to config file
-p, --project Path to QAGenProject directory
-v, --verbose Enable verbose output
create-project Create a new QAGenProject structure
path Path for the new project (default: current directory)
init-config Create a default configuration file
output Path for the config file. Optional when --project is given,
in which case it defaults to <project>/config/system.yaml
-i, --interactive Create config interactively
-p, --project Path to QAGenProject directory
interactive Run in interactive mode
formats List supported file formats
info Show system information
version Show the version and exit
# Process a PDF document
semantic-qa-gen process document.pdf -o questions_output
# Create a new project
semantic-qa-gen create-project my_qa_project
# Create a default configuration file
semantic-qa-gen init-config config.yml
# Create a configuration file interactively
semantic-qa-gen init-config config.yml --interactivefrom semantic_qa_gen import SemanticQAGen
# Initialize with default settings
qa_gen = SemanticQAGen()
# Process a document
result = qa_gen.process_document("path/to/document.txt")
# Save the questions to a JSON file
qa_gen.save_questions(result, "qa_pairs")
# Display stats
print(f"Generated {len(result['questions'])} questions")
print(f"Factual questions: {result['statistics']['categories'].get('factual', 0)}")
print(f"Inferential questions: {result['statistics']['categories'].get('inferential', 0)}")
print(f"Conceptual questions: {result['statistics']['categories'].get('conceptual', 0)}")from semantic_qa_gen import SemanticQAGen
# Create or use an existing project structure
qa_gen = SemanticQAGen(project_path="my_qa_project")
# Process a document (can be in project's input directory)
result = qa_gen.process_document("input/document.txt")
# Save questions (will save to project's output directory)
qa_gen.save_questions(result, "questions_output")
# Process all documents in the input directory
batch_results = qa_gen.process_input_directory()from semantic_qa_gen import SemanticQAGen
# Configuration for hybrid LLM setup
config = {
"llm_services": {
"local": {
"enabled": True,
"url": "http://localhost:11434",
"model": "mistral:7b",
"preferred_tasks": ["validation"]
},
"remote": {
"enabled": True,
"provider": "openai",
"model": "gpt-4o",
"api_key": "YOUR_API_KEY",
"preferred_tasks": ["analysis", "generation"]
}
}
}
# Initialize with hybrid LLM config
qa_gen = SemanticQAGen(config_dict=config)
# Process document using hybrid approach
# - Local model will handle validation
# - Remote model will handle analysis and question generation
result = qa_gen.process_document("document.pdf")config = {
"question_generation": {
"max_questions_per_chunk": 12,
"categories": {
"factual": {
"min_questions": 4, # Prefer more factual questions
"weight": 1.5
},
"inferential": {
"min_questions": 3,
"weight": 1.2
},
"conceptual": {
"min_questions": 2,
"weight": 1.0
},
"applied": { # Custom category - practical applications
"min_questions": 3,
"weight": 1.3
}
}
}
}
qa_gen = SemanticQAGen(config_dict=config)from semantic_qa_gen import SemanticQAGen
config = {
"processing": {
"enable_checkpoints": True,
"checkpoint_interval": 5 # Save checkpoints every 5 chunks
}
}
qa_gen = SemanticQAGen(config_dict=config)
result = qa_gen.process_document("large_document.pdf")SemanticQAGen is designed to be easily extended with custom components.
from semantic_qa_gen.document.loaders.base import BaseLoader
from semantic_qa_gen.document.models import Document, DocumentType, DocumentMetadata
from semantic_qa_gen.utils.error import DocumentError
class CustomFileLoader(BaseLoader):
"""Loader for custom file format."""
def __init__(self, config: Optional[Dict[str, Any]] = None):
super().__init__(config)
def load(self, path: str) -> Document:
"""Load a document from a custom file format."""
if not self.supports_type(path):
raise DocumentError(f"Unsupported file type: {path}")
# Implementation for loading custom format
with open(path, 'r', encoding='utf-8') as file:
content = file.read()
# Create and return document
return Document(
content=content,
doc_type=DocumentType.TEXT,
path=path,
metadata=self.extract_metadata(path)
)
def supports_type(self, file_path: str) -> bool:
"""Check if this loader supports the given file type."""
_, ext = os.path.splitext(file_path.lower())
return ext == '.custom'
def extract_metadata(self, path: str) -> DocumentMetadata:
"""Extract metadata from the custom file."""
# Implementation for extracting metadata
return DocumentMetadata(
title=os.path.basename(path),
source=path
)from semantic_qa_gen.question.validation.base import BaseValidator, ValidationResult
from semantic_qa_gen.document.models import Question, Chunk
class CustomValidator(BaseValidator):
"""Custom validator for specialized validation logic."""
def __init__(self, config: Optional[Dict[str, Any]] = None):
super().__init__(config)
self.threshold = self.config.get('threshold', 0.7)
async def validate(self, question: Question,
chunk: Chunk,
llm_validation_data: Optional[Dict[str, Any]] = None) -> ValidationResult:
"""Implement custom validation logic."""
# Custom validation implementation
score = 0.8 # Example score
return ValidationResult(
question_id=question.id,
validator_name=self.name,
is_valid=score >= self.threshold,
scores={"custom_score": score},
reasons=[f"Custom validation: {score:.2f}"],
suggested_improvements=None if score >= self.threshold else "Suggestion for improvement"
)from semantic_qa_gen.chunking.strategies.base import BaseChunkingStrategy
from semantic_qa_gen.document.models import Document, Section, Chunk
class CustomChunkingStrategy(BaseChunkingStrategy):
"""Custom strategy for document chunking."""
def __init__(self, config: Optional[Dict[str, Any]] = None):
super().__init__(config)
self.target_size = self.config.get('target_chunk_size', 1500)
def chunk_document(self, document: Document, sections: List[Section]) -> List[Chunk]:
"""Break a document into chunks using a custom strategy."""
chunks = []
# Custom implementation of chunking algorithm
return chunksIssue: Missing dependencies when installing Solution: Install with the appropriate extra dependencies:
pip install semantic-qa-gen[full]Issue: ImportError or "failed to find libmagic" on startup
Solution: python-magic needs the system libmagic library. Install it (sudo apt-get install libmagic1, or brew install libmagic), or on Windows install python-magic-bin.
Issue: OCR produces no text from scanned PDFs
Solution: Install the [ocr] extra and the Tesseract engine itself (sudo apt-get install tesseract-ocr or brew install tesseract), then set ocr_enabled: true in the PDF loader config.
Issue: Conflicts with existing packages Solution: Use a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install semantic-qa-genIssue: Out of memory errors with large documents Solution: Adjust chunking and processing settings:
config = {
"chunking": {
"target_chunk_size": 1000, # Smaller chunks
"max_chunk_size": 1500
},
"processing": {
"concurrency": 1, # Reduce concurrency
"enable_checkpoints": True,
"checkpoint_interval": 3 # More frequent checkpoints
}
}Issue: Slow processing with PDF documents Solution: Disable unnecessary PDF features:
config = {
"document": {
"loaders": {
"pdf": {
"extract_images": False,
"ocr_enabled": False,
"use_advanced_reading_order": False
}
}
}
}Issue: OpenAI rate limits Solution: Adjust rate limiting settings:
config = {
"llm_services": {
"remote": {
"rate_limit_tokens": 60000, # Reduce token usage
"rate_limit_requests": 50 # Reduce requests per minute
}
}
}Issue: Local LLM not responding Solution: Check connection settings and increase timeout:
config = {
"llm_services": {
"local": {
"url": "http://localhost:11434", # Verify URL
"timeout": 120 # Increase timeout
}
}
}To enable detailed logging for troubleshooting:
from semantic_qa_gen import SemanticQAGen
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Or enable verbose mode
qa_gen = SemanticQAGen(verbose=True)For CLI usage:
semantic-qa-gen process document.pdf -o output --verboseSemanticQAGen is released under the MIT License.
Copyright © 2025–2026 Stephen Genusa