MetaPulsar Performance Guide

Performance Optimization

File Discovery Optimization

Batch Processing

Process multiple PTAs in batches to reduce memory usage:

# Good: Process in batches
ptas = ["epta_dr2", "ppta_dr2", "nanograv_15y"]
for batch in [ptas[i:i+2] for i in range(0, len(ptas), 2)]:
    file_data = discovery.discover_all_files_in_ptas(batch)
    # Process batch...

# Avoid: Process all at once
file_data = discovery.discover_all_files_in_ptas(ptas)  # May use too much memory

Specific PTA Selection

Use specific PTA names instead of discovering all PTAs:

# Good: Specific PTAs
file_data = discovery.discover_all_files_in_ptas(["epta_dr2", "ppta_dr2"])

# Avoid: All PTAs (slower)
file_data = discovery.discover_all_files_in_ptas(discovery.list_ptas())

Memory Management

Object Cleanup

Clean up large objects when no longer needed:

# Good: Clean up after use
metapulsar = factory.create_metapulsar(file_data, strategy="composite")
# Use metapulsar...
del metapulsar  # Free memory

# Or use context managers
with factory.create_metapulsar(file_data, strategy="composite") as metapulsar:
    # Use metapulsar...
    pass  # Automatically cleaned up

Data Type Optimization

Use appropriate data types to reduce memory usage:

# Good: Use float32 for large arrays when precision allows
timing_data = np.array(data, dtype=np.float32)

# Avoid: Default float64 for everything
timing_data = np.array(data)  # Uses more memory

File I/O Optimization

Efficient File Formats

Use efficient file formats for large datasets:

# Good: Use HDF5 for large datasets
import h5py
with h5py.File('data.h5', 'w') as f:
    f.create_dataset('timing_data', data=timing_data, compression='gzip')

# Avoid: Plain text files for large data
np.savetxt('data.txt', timing_data)  # Slow and large

Caching

Cache frequently accessed data:

# Good: Cache discovery results
@functools.lru_cache(maxsize=128)
def discover_pta_files(pta_name):
    return discovery.discover_files_in_pta(pta_name)

Algorithm Optimization

Vectorized Operations

Use NumPy vectorized operations instead of loops:

# Good: Vectorized operations
residuals = np.array([calc_residual(toa) for toa in toas])  # Slow
residuals = calc_residuals_vectorized(toas)  # Fast

# Avoid: Python loops
for i, toa in enumerate(toas):
    residuals[i] = calc_residual(toa)

Parallel Processing

Use parallel processing for independent operations:

# Good: Parallel file discovery
from concurrent.futures import ThreadPoolExecutor

def discover_pta(pta_name):
    return discovery.discover_files_in_pta(pta_name)

with ThreadPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(discover_pta, pta_names))

Performance Monitoring

Memory Usage

Monitor memory usage during processing:

import psutil
import os

def monitor_memory():
    process = psutil.Process(os.getpid())
    memory_info = process.memory_info()
    print(f"Memory usage: {memory_info.rss / 1024 / 1024:.1f} MB")

# Use before and after operations
monitor_memory()
metapulsar = factory.create_metapulsar(file_data)
monitor_memory()

Timing

Measure execution time for optimization:

import time

start_time = time.time()
metapulsar = factory.create_metapulsar(file_data)
end_time = time.time()

print(f"MetaPulsar creation took: {end_time - start_time:.2f} seconds")

Best Practices

  1. Profile First: Use profiling tools to identify bottlenecks

  2. Measure Changes: Always measure performance before and after optimizations

  3. Test with Real Data: Performance characteristics may differ with real data

  4. Monitor Resources: Keep track of memory and CPU usage

  5. Use Appropriate Data Types: Choose data types based on precision requirements

  6. Cache When Possible: Cache expensive operations that are repeated

  7. Parallelize Independent Operations: Use parallel processing for independent tasks

Common Performance Issues

Slow File Discovery

  • Cause: Too many files or slow file system

  • Solution: Use specific PTA names, cache results, or use faster storage

High Memory Usage

  • Cause: Large datasets or inefficient data types

  • Solution: Use appropriate data types, process in batches, clean up objects

Slow Parameter Processing

  • Cause: Inefficient parameter mapping or validation

  • Solution: Use vectorized operations, cache parameter mappings

Slow MetaPulsar Creation

  • Cause: Complex parameter consistency checks

  • Solution: Use composite strategy if consistency not needed, optimize parameter mapping