Code Style Guide¶
This guide outlines the coding standards and style conventions for LLMBuilder to ensure consistent, readable, and maintainable code.
Overview¶
LLMBuilder follows PEP 8 with some modifications and uses automated tools to enforce consistency:
- Black for code formatting
- isort for import sorting
- flake8 for style checking
- mypy for type checking
- pre-commit for automated checks
Automated Formatting¶
Setup¶
# Install development dependencies
pip install -e ".[dev]"
# Set up pre-commit hooks
pre-commit install
Usage¶
# Format code automatically
black llmbuilder/ tests/ examples/
isort llmbuilder/ tests/ examples/
# Check style
flake8 llmbuilder/ tests/
mypy llmbuilder/
# Run all pre-commit checks
pre-commit run --all-files
Python Code Style¶
General Principles¶
- Readability counts - Code is read more often than written
- Explicit is better than implicit - Be clear about intentions
- Simple is better than complex - Prefer straightforward solutions
- Consistency matters - Follow established patterns in the codebase
Naming Conventions¶
# Classes: PascalCase
class DataProcessor:
pass
class HTMLProcessor(DataProcessor):
pass
# Functions and methods: snake_case
def process_documents():
pass
def load_configuration_file():
pass
# Variables: snake_case
input_file = "data.txt"
processing_results = []
user_config = {}
# Constants: UPPER_CASE
MAX_BATCH_SIZE = 1000
DEFAULT_TIMEOUT = 30
SUPPORTED_FORMATS = ["html", "pdf", "txt"]
# Private attributes: leading underscore
class MyClass:
def __init__(self):
self._internal_state = {}
self.__private_data = [] # Name mangling for truly private
# Type variables: PascalCase with T suffix
from typing import TypeVar
T = TypeVar('T')
ConfigT = TypeVar('ConfigT', bound='BaseConfig')
Function and Method Design¶
def process_file(
input_path: str,
output_path: str,
config: Optional[ProcessingConfig] = None,
*, # Force keyword-only arguments after this
validate_output: bool = True,
progress_callback: Optional[Callable[[int], None]] = None
) -> ProcessingResult:
"""
Process a single file with the specified configuration.
Args:
input_path: Path to the input file
output_path: Path where processed output will be saved
config: Processing configuration (uses default if None)
validate_output: Whether to validate the output file
progress_callback: Optional callback for progress updates
Returns:
ProcessingResult with statistics and status information
Raises:
FileNotFoundError: If input_path doesn't exist
PermissionError: If unable to write to output_path
ProcessingError: If processing fails
Example:
>>> config = ProcessingConfig(batch_size=100)
>>> result = process_file("input.txt", "output.txt", config)
>>> print(f"Processed {result.items_count} items")
"""
# Implementation here
pass
Class Design¶
from abc import ABC, abstractmethod
from dataclasses import dataclass
from typing import Protocol, runtime_checkable
# Use dataclasses for simple data containers
@dataclass
class ProcessingResult:
"""Result of a processing operation."""
success: bool
items_processed: int
processing_time: float
error_message: Optional[str] = None
def __post_init__(self):
"""Validate the result after initialization."""
if self.items_processed < 0:
raise ValueError("items_processed cannot be negative")
# Use ABC for base classes with abstract methods
class BaseProcessor(ABC):
"""Base class for all document processors."""
def __init__(self, config: ProcessorConfig):
self.config = config
self._stats = ProcessingStats()
@abstractmethod
def process_file(self, file_path: str) -> ProcessingResult:
"""Process a single file."""
pass
@property
def stats(self) -> ProcessingStats:
"""Get processing statistics."""
return self._stats
# Use Protocol for structural typing
@runtime_checkable
class Configurable(Protocol):
"""Protocol for objects that can be configured."""
def configure(self, config: Dict[str, Any]) -> None:
"""Configure the object with the given settings."""
...
def get_config(self) -> Dict[str, Any]:
"""Get current configuration."""
...
Error Handling¶
# Define custom exceptions
class LLMBuilderError(Exception):
"""Base exception for LLMBuilder."""
pass
class ProcessingError(LLMBuilderError):
"""Raised when processing fails."""
pass
class ConfigurationError(LLMBuilderError):
"""Raised when configuration is invalid."""
pass
# Use specific exception handling
def process_document(file_path: str) -> ProcessingResult:
"""Process a document with proper error handling."""
try:
# Validate input
if not Path(file_path).exists():
raise FileNotFoundError(f"File not found: {file_path}")
# Process file
result = _do_processing(file_path)
# Validate result
if not result.success:
raise ProcessingError(f"Processing failed: {result.error_message}")
return result
except FileNotFoundError:
# Re-raise file system errors as-is
raise
except ProcessingError:
# Re-raise processing errors as-is
raise
except Exception as e:
# Wrap unexpected errors
raise ProcessingError(f"Unexpected error processing {file_path}: {e}") from e
# Use context managers for resource management
from contextlib import contextmanager
@contextmanager
def processing_context(config: ProcessingConfig):
"""Context manager for processing operations."""
logger.info("Starting processing context")
try:
# Setup
setup_processing(config)
yield
except Exception as e:
logger.error(f"Processing failed: {e}")
raise
finally:
# Cleanup
cleanup_processing()
logger.info("Processing context closed")
Type Hints¶
from typing import (
Any, Dict, List, Optional, Union, Tuple, Callable,
TypeVar, Generic, Protocol, Literal, overload
)
from pathlib import Path
# Use specific types
def load_config(config_path: Union[str, Path]) -> Dict[str, Any]:
"""Load configuration from file."""
pass
# Use generics for reusable code
T = TypeVar('T')
class Cache(Generic[T]):
"""Generic cache implementation."""
def __init__(self) -> None:
self._data: Dict[str, T] = {}
def get(self, key: str) -> Optional[T]:
"""Get item from cache."""
return self._data.get(key)
def set(self, key: str, value: T) -> None:
"""Set item in cache."""
self._data[key] = value
# Use Literal for constrained values
ProcessingMode = Literal["fast", "accurate", "balanced"]
def set_processing_mode(mode: ProcessingMode) -> None:
"""Set the processing mode."""
pass
# Use overload for different signatures
@overload
def process_data(data: str) -> str: ...
@overload
def process_data(data: List[str]) -> List[str]: ...
def process_data(data: Union[str, List[str]]) -> Union[str, List[str]]:
"""Process data with different input types."""
if isinstance(data, str):
return _process_single(data)
else:
return [_process_single(item) for item in data]
Documentation¶
def complex_processing_function(
input_data: List[Dict[str, Any]],
config: ProcessingConfig,
*,
parallel: bool = True,
max_workers: Optional[int] = None,
progress_callback: Optional[Callable[[float], None]] = None
) -> ProcessingResult:
"""
Process complex data with configurable parallelization.
This function processes a list of data items according to the provided
configuration. It supports parallel processing for improved performance
on multi-core systems.
Args:
input_data: List of data items to process. Each item should be a
dictionary with required keys 'id' and 'content'.
config: Processing configuration specifying how to handle the data.
Must include 'batch_size' and 'processing_mode' settings.
parallel: Whether to use parallel processing. Defaults to True.
Set to False for debugging or when processing order matters.
max_workers: Maximum number of worker processes. If None, uses
the number of CPU cores. Ignored if parallel=False.
progress_callback: Optional callback function that receives progress
updates as a float between 0.0 and 1.0.
Returns:
ProcessingResult containing:
- success: Whether all items were processed successfully
- items_processed: Number of items successfully processed
- processing_time: Total processing time in seconds
- error_message: Error description if success=False
- detailed_stats: Dictionary with per-item processing statistics
Raises:
ValueError: If input_data is empty or contains invalid items
ConfigurationError: If config is missing required settings
ProcessingError: If processing fails for any item
Example:
Basic usage:
>>> data = [{"id": "1", "content": "text1"}, {"id": "2", "content": "text2"}]
>>> config = ProcessingConfig(batch_size=10, processing_mode="fast")
>>> result = complex_processing_function(data, config)
>>> print(f"Processed {result.items_processed} items in {result.processing_time:.2f}s")
With progress tracking:
>>> def progress_handler(progress: float):
... print(f"Progress: {progress:.1%}")
>>> result = complex_processing_function(
... data, config, progress_callback=progress_handler
... )
Sequential processing for debugging:
>>> result = complex_processing_function(data, config, parallel=False)
Note:
When using parallel processing, ensure that your data items are
independent and don't require shared state. The function uses
multiprocessing, so large data items may impact performance due
to serialization overhead.
See Also:
- ProcessingConfig: Configuration options
- ProcessingResult: Return value structure
- simple_processing_function: For basic use cases
"""
# Implementation here
pass
Import Organization¶
# Standard library imports (alphabetical)
import json
import logging
import os
import sys
from pathlib import Path
from typing import Any, Dict, List, Optional
# Third-party imports (alphabetical)
import click
import numpy as np
import torch
from transformers import AutoTokenizer
# Local imports (alphabetical)
from llmbuilder.config import Config, ModelConfig
from llmbuilder.data import DataLoader, TextProcessor
from llmbuilder.utils import setup_logging
# Relative imports (if needed, but prefer absolute)
from .base import BaseProcessor
from .utils import validate_input
Testing Style¶
Test Structure¶
import pytest
import tempfile
from pathlib import Path
from unittest.mock import Mock, patch, MagicMock
from llmbuilder.data.processor import DocumentProcessor
from llmbuilder.config import ProcessingConfig
class TestDocumentProcessor:
"""Test cases for DocumentProcessor."""
def setup_method(self):
"""Set up test environment before each test."""
self.temp_dir = Path(tempfile.mkdtemp())
self.config = ProcessingConfig(batch_size=10)
self.processor = DocumentProcessor(self.config)
def teardown_method(self):
"""Clean up after each test."""
import shutil
shutil.rmtree(self.temp_dir, ignore_errors=True)
def test_process_valid_document_returns_success(self):
"""Test that processing a valid document returns success."""
# Arrange
document_content = "This is a test document with valid content."
document_file = self.temp_dir / "test.txt"
document_file.write_text(document_content)
# Act
result = self.processor.process_file(str(document_file))
# Assert
assert result.success is True
assert result.items_processed == 1
assert result.processing_time > 0
assert result.error_message is None
def test_process_nonexistent_file_raises_file_not_found(self):
"""Test that processing non-existent file raises FileNotFoundError."""
# Arrange
nonexistent_file = str(self.temp_dir / "nonexistent.txt")
# Act & Assert
with pytest.raises(FileNotFoundError, match="File not found"):
self.processor.process_file(nonexistent_file)
@pytest.mark.parametrize("content,expected_items", [
("Single line", 1),
("Line 1\nLine 2", 2),
("", 0),
("Line 1\n\nLine 3", 2), # Empty lines ignored
])
def test_process_file_counts_items_correctly(self, content, expected_items):
"""Test that file processing counts items correctly."""
# Arrange
test_file = self.temp_dir / "test.txt"
test_file.write_text(content)
# Act
result = self.processor.process_file(str(test_file))
# Assert
assert result.items_processed == expected_items
def test_process_file_with_mock_dependencies(self):
"""Test file processing with mocked dependencies."""
# Arrange
test_file = self.temp_dir / "test.txt"
test_file.write_text("test content")
with patch.object(self.processor, '_validate_content') as mock_validate:
mock_validate.return_value = True
# Act
result = self.processor.process_file(str(test_file))
# Assert
assert result.success is True
mock_validate.assert_called_once_with("test content")
@pytest.fixture
def sample_documents(self):
"""Provide sample documents for testing."""
documents = []
for i in range(3):
doc_file = self.temp_dir / f"doc_{i}.txt"
doc_file.write_text(f"Document {i} content")
documents.append(str(doc_file))
return documents
def test_batch_processing_with_fixture(self, sample_documents):
"""Test batch processing using fixture."""
# Act
results = self.processor.process_batch(sample_documents)
# Assert
assert len(results) == 3
assert all(result.success for result in results)
Test Naming¶
# Good test names - describe what is being tested
def test_process_valid_html_file_extracts_text_content():
"""Test that HTML processor extracts text from valid HTML."""
pass
def test_process_empty_file_returns_empty_result():
"""Test that processing empty file returns appropriate result."""
pass
def test_process_file_with_invalid_encoding_raises_encoding_error():
"""Test that invalid encoding raises appropriate error."""
pass
# Bad test names - not descriptive
def test_process(): # What does it process? What's expected?
pass
def test_html(): # What about HTML?
pass
def test_error(): # What error? When?
pass
Configuration and Data Classes¶
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
@dataclass
class ProcessingConfig:
"""Configuration for document processing operations."""
# Required fields first
batch_size: int
output_format: str
# Optional fields with defaults
max_workers: int = 4
timeout_seconds: int = 300
enable_validation: bool = True
# Complex defaults using field()
supported_formats: List[str] = field(
default_factory=lambda: ["txt", "html", "pdf"]
)
processing_options: Dict[str, Any] = field(default_factory=dict)
def __post_init__(self):
"""Validate configuration after initialization."""
if self.batch_size <= 0:
raise ValueError("batch_size must be positive")
if self.output_format not in ["txt", "json", "xml"]:
raise ValueError(f"Unsupported output format: {self.output_format}")
if self.max_workers <= 0:
raise ValueError("max_workers must be positive")
@classmethod
def from_dict(cls, data: Dict[str, Any]) -> 'ProcessingConfig':
"""Create configuration from dictionary."""
return cls(**data)
def to_dict(self) -> Dict[str, Any]:
"""Convert configuration to dictionary."""
from dataclasses import asdict
return asdict(self)
CLI Style¶
import click
from typing import Optional
@click.group()
@click.version_option()
@click.option('--verbose', '-v', is_flag=True, help='Enable verbose output')
@click.option('--config', '-c', type=click.Path(exists=True), help='Configuration file path')
@click.pass_context
def main(ctx: click.Context, verbose: bool, config: Optional[str]):
"""
LLMBuilder - A comprehensive toolkit for language model development.
Use 'llmbuilder COMMAND --help' for more information on specific commands.
"""
ctx.ensure_object(dict)
ctx.obj['verbose'] = verbose
ctx.obj['config'] = config
if verbose:
click.echo("Verbose mode enabled")
@main.command()
@click.argument('input_file', type=click.Path(exists=True))
@click.argument('output_file', type=click.Path())
@click.option('--format', 'output_format',
type=click.Choice(['txt', 'json', 'xml']),
default='txt',
help='Output format')
@click.option('--batch-size', default=100,
help='Processing batch size')
@click.option('--workers', default=4,
help='Number of worker processes')
@click.option('--dry-run', is_flag=True,
help='Show what would be done without executing')
@click.pass_context
def process(ctx: click.Context, input_file: str, output_file: str,
output_format: str, batch_size: int, workers: int, dry_run: bool):
"""Process documents with specified options."""
if ctx.obj['verbose']:
click.echo(f"Processing {input_file} -> {output_file}")
click.echo(f"Format: {output_format}, Batch: {batch_size}, Workers: {workers}")
if dry_run:
click.echo("DRY RUN: Would process file with these settings")
return
try:
# Processing logic here
with click.progressbar(length=100, label='Processing') as bar:
# Simulate progress
for i in range(100):
# Do work
bar.update(1)
click.echo(f"✅ Successfully processed {input_file}")
except Exception as e:
click.echo(f"❌ Processing failed: {e}", err=True)
ctx.exit(1)
Performance Considerations¶
# Use generators for memory efficiency
def process_large_file(file_path: str) -> Iterator[ProcessedItem]:
"""Process large file without loading everything into memory."""
with open(file_path, 'r') as f:
for line_num, line in enumerate(f, 1):
if line.strip(): # Skip empty lines
yield ProcessedItem(
line_number=line_num,
content=line.strip(),
processed_at=datetime.now()
)
# Use appropriate data structures
from collections import defaultdict, deque
from typing import DefaultDict
# For counting/grouping
item_counts: DefaultDict[str, int] = defaultdict(int)
# For FIFO operations
processing_queue: deque[ProcessingTask] = deque()
# Use context managers for resources
@contextmanager
def managed_processor(config: ProcessingConfig):
"""Context manager for processor lifecycle."""
processor = None
try:
processor = create_processor(config)
processor.initialize()
yield processor
finally:
if processor:
processor.cleanup()
Pre-commit Configuration¶
The project uses pre-commit hooks to enforce code quality:
# .pre-commit-config.yaml
repos:
- repo: https://github.com/psf/black
rev: 23.12.1
hooks:
- id: black
args: [--line-length=88]
- repo: https://github.com/pycqa/isort
rev: 5.13.2
hooks:
- id: isort
args: [--profile=black]
- repo: https://github.com/pycqa/flake8
rev: 7.0.0
hooks:
- id: flake8
args: [--max-line-length=88, --extend-ignore=E203,W503]
IDE Configuration¶
VS Code Settings¶
{
"python.formatting.provider": "black",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.linting.mypyEnabled": true,
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": true
},
"python.testing.pytestEnabled": true
}
Summary¶
Following these style guidelines ensures:
- Consistency across the codebase
- Readability for all contributors
- Maintainability over time
- Quality through automated checks
The automated tools handle most formatting concerns, allowing you to focus on writing great code. When in doubt, run the pre-commit hooks to catch any style issues before committing.
For questions about code style, check existing code for examples or ask in GitHub Discussions.