MetaPulsar Performance Guide
Performance Optimization
File Discovery Optimization
Batch Processing
Process multiple PTAs in batches to reduce memory usage:
# Good: Process in batches
ptas = ["epta_dr2", "ppta_dr2", "nanograv_15y"]
for batch in [ptas[i:i+2] for i in range(0, len(ptas), 2)]:
file_data = discovery.discover_all_files_in_ptas(batch)
# Process batch...
# Avoid: Process all at once
file_data = discovery.discover_all_files_in_ptas(ptas) # May use too much memory
Specific PTA Selection
Use specific PTA names instead of discovering all PTAs:
# Good: Specific PTAs
file_data = discovery.discover_all_files_in_ptas(["epta_dr2", "ppta_dr2"])
# Avoid: All PTAs (slower)
file_data = discovery.discover_all_files_in_ptas(discovery.list_ptas())
Memory Management
Object Cleanup
Clean up large objects when no longer needed:
# Good: Clean up after use
metapulsar = factory.create_metapulsar(file_data, strategy="composite")
# Use metapulsar...
del metapulsar # Free memory
# Or use context managers
with factory.create_metapulsar(file_data, strategy="composite") as metapulsar:
# Use metapulsar...
pass # Automatically cleaned up
Data Type Optimization
Use appropriate data types to reduce memory usage:
# Good: Use float32 for large arrays when precision allows
timing_data = np.array(data, dtype=np.float32)
# Avoid: Default float64 for everything
timing_data = np.array(data) # Uses more memory
File I/O Optimization
Efficient File Formats
Use efficient file formats for large datasets:
# Good: Use HDF5 for large datasets
import h5py
with h5py.File('data.h5', 'w') as f:
f.create_dataset('timing_data', data=timing_data, compression='gzip')
# Avoid: Plain text files for large data
np.savetxt('data.txt', timing_data) # Slow and large
Caching
Cache frequently accessed data:
# Good: Cache discovery results
@functools.lru_cache(maxsize=128)
def discover_pta_files(pta_name):
return discovery.discover_files_in_pta(pta_name)
Algorithm Optimization
Vectorized Operations
Use NumPy vectorized operations instead of loops:
# Good: Vectorized operations
residuals = np.array([calc_residual(toa) for toa in toas]) # Slow
residuals = calc_residuals_vectorized(toas) # Fast
# Avoid: Python loops
for i, toa in enumerate(toas):
residuals[i] = calc_residual(toa)
Parallel Processing
Use parallel processing for independent operations:
# Good: Parallel file discovery
from concurrent.futures import ThreadPoolExecutor
def discover_pta(pta_name):
return discovery.discover_files_in_pta(pta_name)
with ThreadPoolExecutor(max_workers=4) as executor:
results = list(executor.map(discover_pta, pta_names))
Performance Monitoring
Memory Usage
Monitor memory usage during processing:
import psutil
import os
def monitor_memory():
process = psutil.Process(os.getpid())
memory_info = process.memory_info()
print(f"Memory usage: {memory_info.rss / 1024 / 1024:.1f} MB")
# Use before and after operations
monitor_memory()
metapulsar = factory.create_metapulsar(file_data)
monitor_memory()
Timing
Measure execution time for optimization:
import time
start_time = time.time()
metapulsar = factory.create_metapulsar(file_data)
end_time = time.time()
print(f"MetaPulsar creation took: {end_time - start_time:.2f} seconds")
Best Practices
Profile First: Use profiling tools to identify bottlenecks
Measure Changes: Always measure performance before and after optimizations
Test with Real Data: Performance characteristics may differ with real data
Monitor Resources: Keep track of memory and CPU usage
Use Appropriate Data Types: Choose data types based on precision requirements
Cache When Possible: Cache expensive operations that are repeated
Parallelize Independent Operations: Use parallel processing for independent tasks
Common Performance Issues
Slow File Discovery
Cause: Too many files or slow file system
Solution: Use specific PTA names, cache results, or use faster storage
High Memory Usage
Cause: Large datasets or inefficient data types
Solution: Use appropriate data types, process in batches, clean up objects
Slow Parameter Processing
Cause: Inefficient parameter mapping or validation
Solution: Use vectorized operations, cache parameter mappings
Slow MetaPulsar Creation
Cause: Complex parameter consistency checks
Solution: Use composite strategy if consistency not needed, optimize parameter mapping