Skip to content

Data Processing API Reference

Core Classes

IngestionPipeline

from llmbuilder.data.ingest import IngestionPipeline

pipeline = IngestionPipeline(config)
pipeline.process_directory("data/raw/")

DeduplicationPipeline

from llmbuilder.data.dedup import DeduplicationPipeline

dedup = DeduplicationPipeline(config)
dedup.process_files(["file1.txt", "file2.txt"])

TokenizerTrainer

from llmbuilder.training.tokenizer import TokenizerTrainer

trainer = TokenizerTrainer(config)
trainer.train(corpus_files, output_dir)

GGUFConverter

from llmbuilder.tools.convert_to_gguf import GGUFConverter

converter = GGUFConverter()
converter.convert(model_path, output_path)

Configuration

All components use the unified config system:

from llmbuilder.config import ConfigManager

config = ConfigManager.load_config("config.json")