Advanced Data Processing¶
LLMBuilder provides sophisticated data processing capabilities designed to handle diverse document formats, perform intelligent deduplication, and prepare high-quality training datasets. This guide covers the advanced features that go beyond basic text processing.
Overview¶
The advanced data processing system includes:
- Multi-format Document Ingestion: Process HTML, Markdown, EPUB, PDF, and plain text files
- Intelligent Text Extraction: OCR fallback for scanned PDFs, clean HTML parsing, structured document handling
- Advanced Deduplication: Both exact and semantic duplicate detection with configurable thresholds
- Flexible Tokenizer Training: Support for multiple algorithms including BPE, SentencePiece, and Unigram
- GGUF Model Conversion: Convert trained models to GGUF format with various quantization levels
- Configuration Management: Template-based configuration system with validation
Multi-Format Document Ingestion¶
Supported Formats¶
LLMBuilder can process the following document formats:
| Format | Extension | Features |
|---|---|---|
| HTML | .html, .htm |
Clean text extraction, script/style removal, configurable parsers |
| Markdown | .md, .markdown |
Syntax removal, structure preservation, metadata extraction |
| EPUB | .epub |
Chapter organization, table of contents, metadata handling |
.pdf |
Text extraction with OCR fallback, quality assessment | |
| Plain Text | .txt |
Direct processing with encoding detection |
Basic Usage¶
# Process a single file
llmbuilder data load --input document.pdf --output processed.txt
# Process a directory of mixed formats
llmbuilder data load --input documents/ --output dataset.txt --format all
# Enable OCR for scanned PDFs
llmbuilder data load --input scanned.pdf --output text.txt --enable-ocr
Configuration Options¶
The ingestion system can be configured through the configuration file:
{
"data": {
"ingestion": {
"supported_formats": ["html", "markdown", "epub", "pdf", "txt"],
"batch_size": 100,
"num_workers": 4,
"output_format": "txt",
"preserve_structure": false,
"extract_metadata": true,
"html_parser": "lxml",
"remove_scripts": true,
"remove_styles": true,
"enable_ocr": true,
"ocr_quality_threshold": 0.5,
"ocr_language": "eng",
"pdf_extraction_method": "auto",
"extract_toc": true,
"chapter_separator": "\n\n---\n\n"
}
}
}
HTML Processing¶
The HTML processor provides clean text extraction with configurable options:
from llmbuilder.data.ingest import HTMLProcessor
processor = HTMLProcessor(
parser='lxml', # html.parser, lxml, html5lib
remove_scripts=True,
remove_styles=True
)
text = processor.process_file('webpage.html')
Features:
- Multiple parser backends (html.parser, lxml, html5lib)
- Script and style tag removal
- Encoding detection and handling
- Malformed HTML recovery
PDF Processing with OCR¶
The PDF processor includes intelligent OCR fallback:
from llmbuilder.data.ingest import PDFProcessor
processor = PDFProcessor(
enable_ocr=True,
ocr_quality_threshold=0.5,
ocr_language='eng'
)
text = processor.process_file('document.pdf')
Features:
- Primary text extraction using PyMuPDF
- Quality assessment to trigger OCR
- OCR fallback using Tesseract
- Mixed content handling (text + scanned pages)
EPUB Processing¶
Extract text from EPUB files with chapter organization:
from llmbuilder.data.ingest import EPUBProcessor
processor = EPUBProcessor(
extract_toc=True,
chapter_separator='\n\n---\n\n'
)
text = processor.process_file('book.epub')
Features:
- Chapter-by-chapter extraction
- Table of contents handling
- Metadata preservation
- Multiple EPUB format support
Advanced Deduplication¶
Exact Deduplication¶
Remove exact duplicates using normalized hashing:
# Basic exact deduplication
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method exact
# With custom normalization
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method exact --normalize
Semantic Deduplication¶
Detect near-duplicates using sentence embeddings:
# Semantic deduplication with default threshold
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method semantic
# Custom similarity threshold
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method semantic --threshold 0.9
# Use GPU for faster processing
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method semantic --use-gpu
Combined Deduplication¶
Use both exact and semantic deduplication:
llmbuilder data deduplicate --input dataset.txt --output clean_dataset.txt --method both --threshold 0.85
Configuration¶
Configure deduplication behavior:
{
"data": {
"deduplication": {
"enable_exact_deduplication": true,
"enable_semantic_deduplication": true,
"similarity_threshold": 0.85,
"embedding_model": "all-MiniLM-L6-v2",
"batch_size": 1000,
"chunk_size": 512,
"min_text_length": 50,
"normalize_text": true,
"use_gpu_for_embeddings": true,
"embedding_cache_size": 10000,
"similarity_metric": "cosine"
}
}
}
Flexible Tokenizer Training¶
Supported Algorithms¶
LLMBuilder supports multiple tokenization algorithms:
| Algorithm | Library | Best For |
|---|---|---|
| BPE | Hugging Face Tokenizers | General purpose, subword tokenization |
| SentencePiece | Google SentencePiece | Language-agnostic, handles any text |
| Unigram | Hugging Face Tokenizers | Probabilistic subword segmentation |
| WordPiece | Hugging Face Tokenizers | BERT-style tokenization |
Basic Training¶
# Train a BPE tokenizer
llmbuilder data tokenizer --input corpus.txt --output tokenizer/ --algorithm bpe --vocab-size 16000
# Train a SentencePiece tokenizer
llmbuilder data tokenizer --input corpus.txt --output tokenizer/ --algorithm sentencepiece --vocab-size 32000
# Train with custom special tokens
llmbuilder data tokenizer --input corpus.txt --output tokenizer/ --algorithm bpe --vocab-size 16000 --special-tokens "<pad>,<unk>,<s>,</s>,<mask>"
Advanced Configuration¶
{
"tokenizer_training": {
"vocab_size": 32000,
"algorithm": "sentencepiece",
"min_frequency": 2,
"special_tokens": ["<pad>", "<unk>", "<s>", "</s>", "<mask>"],
"character_coverage": 0.9995,
"max_sentence_length": 4192,
"shuffle_input_sentence": true,
"normalization_rule_name": "nmt_nfkc_cf",
"remove_extra_whitespaces": true,
"add_dummy_prefix": true,
"continuing_subword_prefix": "##",
"end_of_word_suffix": "",
"num_threads": 4,
"max_training_time": 3600,
"validation_split": 0.1
}
}
Validation and Testing¶
Validate trained tokenizers:
# Test tokenizer on sample text
llmbuilder tokenizer test --tokenizer tokenizer/ --text "Hello world, this is a test."
# Benchmark tokenizer performance
llmbuilder tokenizer benchmark --tokenizer tokenizer/ --test-file test_corpus.txt
GGUF Model Conversion¶
Basic Conversion¶
Convert trained models to GGUF format for inference:
# Basic conversion with Q8_0 quantization
llmbuilder convert gguf model/ --output model.gguf
# Specify quantization level
llmbuilder convert gguf model/ --output model.gguf --quantization Q4_0
# Enable validation
llmbuilder convert gguf model/ --output model.gguf --quantization Q4_0 --validate
Batch Conversion¶
Convert multiple models with different quantization levels:
# Convert with all quantization levels
llmbuilder convert gguf model/ --output model.gguf --all-quantizations
# Batch convert multiple models
llmbuilder convert batch --input-dir models/ --output-dir gguf_models/ --quantization Q8_0 Q4_0 Q4_1
Supported Quantization Levels¶
| Level | Description | Use Case |
|---|---|---|
| F32 | Full precision | Maximum quality, large size |
| F16 | Half precision | Good quality, moderate size |
| Q8_0 | 8-bit quantization | Balanced quality/size |
| Q5_1 | 5-bit quantization | Good compression |
| Q5_0 | 5-bit quantization | Better compression |
| Q4_1 | 4-bit quantization | High compression |
| Q4_0 | 4-bit quantization | Maximum compression |
Configuration¶
{
"gguf_conversion": {
"quantization_level": "Q4_0",
"validate_output": true,
"conversion_timeout": 3600,
"preferred_script": "auto",
"output_naming": "quantization_suffix",
"preserve_metadata": true
}
}
Configuration Management¶
Using Templates¶
LLMBuilder provides pre-configured templates for common use cases:
# List available templates
llmbuilder config templates
# Create config from template
llmbuilder config from-template basic_config --output my_config.json
# Create with overrides
llmbuilder config from-template basic_config --output my_config.json \
--override model.vocab_size=24000 \
--override training.batch_size=64
Available Templates¶
| Template | Description | Best For |
|---|---|---|
basic_config |
Balanced configuration | General use, moderate resources |
cpu_optimized_config |
CPU-optimized settings | CPU-only training |
advanced_processing_config |
Full feature set | High-end GPU, large datasets |
inference_optimized_config |
Inference settings | Production deployment |
large_scale_config |
Large model training | High-end hardware, large models |
Configuration Validation¶
Validate configuration files:
# Basic validation
llmbuilder config validate my_config.json
# Detailed validation with summary
llmbuilder config validate my_config.json --detailed
# Show configuration summary
llmbuilder config summary my_config.json
End-to-End Workflows¶
Complete Data Processing Pipeline¶
# 1. Ingest multi-format documents
llmbuilder data load --input documents/ --output raw_text.txt --format all --clean
# 2. Deduplicate the dataset
llmbuilder data deduplicate --input raw_text.txt --output clean_text.txt --method both --threshold 0.85
# 3. Train a tokenizer
llmbuilder data tokenizer --input clean_text.txt --output tokenizer/ --algorithm sentencepiece --vocab-size 32000
# 4. Train the model (using existing training commands)
llmbuilder train model --data clean_text.txt --tokenizer tokenizer/ --output model/ --config advanced_config.json
# 5. Convert to GGUF format
llmbuilder convert gguf model/ --output model_q4.gguf --quantization Q4_0 --validate
Configuration-Driven Workflow¶
Create a comprehensive configuration file:
{
"data": {
"ingestion": {
"supported_formats": ["html", "markdown", "epub", "pdf"],
"enable_ocr": true,
"batch_size": 200
},
"deduplication": {
"enable_exact_deduplication": true,
"enable_semantic_deduplication": true,
"similarity_threshold": 0.9
}
},
"tokenizer_training": {
"algorithm": "sentencepiece",
"vocab_size": 32000
},
"gguf_conversion": {
"quantization_level": "Q4_0",
"validate_output": true
}
}
Then use it throughout the pipeline:
# All commands will use the same configuration
llmbuilder data load --config my_config.json --input docs/ --output dataset.txt
llmbuilder data deduplicate --config my_config.json --input dataset.txt --output clean.txt
llmbuilder data tokenizer --config my_config.json --input clean.txt --output tokenizer/
Performance Optimization¶
Memory Management¶
For large datasets, optimize memory usage:
{
"data": {
"ingestion": {
"batch_size": 50,
"num_workers": 2
},
"deduplication": {
"batch_size": 500,
"embedding_cache_size": 5000,
"use_gpu_for_embeddings": false
}
}
}
GPU Acceleration¶
Enable GPU acceleration where available:
{
"data": {
"deduplication": {
"use_gpu_for_embeddings": true,
"batch_size": 2000
}
},
"system": {
"device": "cuda",
"mixed_precision": true
}
}
Parallel Processing¶
Optimize for multi-core systems:
{
"data": {
"ingestion": {
"num_workers": 8
}
},
"tokenizer_training": {
"num_threads": 8
},
"system": {
"num_workers": 8
}
}
Troubleshooting¶
Common Issues¶
OCR Dependencies Missing
# Install Tesseract
sudo apt-get install tesseract-ocr # Ubuntu/Debian
brew install tesseract # macOS
Memory Issues with Large Files
- Reduce batch sizes in configuration
- Use streaming processing for very large files
- Enable disk-based caching for embeddings
Slow Semantic Deduplication
- Enable GPU acceleration
- Reduce embedding model size
- Increase batch size
- Use embedding caching
GGUF Conversion Failures
- Ensure llama.cpp is installed and accessible
- Check model format compatibility
- Verify sufficient disk space
- Use longer conversion timeout
Debug Mode¶
Enable verbose logging for troubleshooting:
# Enable debug logging
export LLMBUILDER_LOG_LEVEL=DEBUG
# Run with verbose output
llmbuilder data load --input docs/ --output dataset.txt --verbose
Best Practices¶
Data Quality¶
- Always validate input data before processing
- Use appropriate OCR thresholds for PDF processing
- Configure deduplication carefully to avoid over-removal
- Monitor processing statistics for quality assessment
Performance¶
- Use GPU acceleration for semantic deduplication
- Optimize batch sizes for your hardware
- Enable parallel processing for multi-core systems
- Use appropriate quantization levels for your use case
Configuration Management¶
- Use templates as starting points
- Validate configurations before use
- Version control your configuration files
- Document custom settings and their rationale
Workflow Organization¶
- Process data incrementally for large datasets
- Keep intermediate results for debugging
- Use consistent naming conventions for outputs
- Monitor resource usage during processing
This comprehensive guide covers all aspects of LLMBuilder's advanced data processing capabilities. For specific API documentation, see the API Reference.