athena-file (1.0.0)
Installation
pip install --index-url athena-fileAbout this package
Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation
Athena-File
Large-scale file archive management system with JSON-based cataloging, intelligent deduplication, multi-format support, and thumbnail generation. Designed to efficiently handle 1M+ files.
Features
- Scalable Architecture: Sharded JSON storage handles 1M+ files with < 100 MB memory
- Fast Indexing: High-throughput batch processing with parallel execution
- Intelligent Deduplication: Find duplicate files by content hash in seconds
- Multi-Format Support: Automatic detection for images, videos, archives, documents
- Thumbnail Generation: Create thumbnails for images and videos (Pillow + ffmpeg)
- Plugin System: Extensible architecture for custom file type handlers
- Integrity Verification: Hash-based verification with corruption detection
- Comprehensive API: Python API for all archive operations
Quick Start
Installation
# Install with uv
uv pip install athena-file
# Or install from source
git clone <repository-url>
cd athena-file
uv sync
System Requirements
- Python 3.13+
exiftool(for EXIF metadata extraction)ffmpeg(optional, for video thumbnails)
Basic Usage
from pathlib import Path
from athena_file.archive import FileArchiver
from athena_file.config import ArchiveConfig
# Configure archive
config = ArchiveConfig(
storage_dir=Path(".athena_archive"),
default_hash_algo="blake3",
batch_size=1000,
)
# Create archiver
archiver = FileArchiver(config)
# Index a directory
report = archiver.scan_and_index(
Path("/data/photos"),
recursive=True
)
print(f"Indexed {report.file_count:,} files in {report.elapsed_seconds:.1f}s")
print(f"Total size: {report.total_size / (1024**3):.2f} GB")
# Find duplicates
dup_report = archiver.find_duplicates()
print(f"Found {dup_report.duplicate_groups_count} groups of duplicates")
print(f"Wasted space: {dup_report.total_wasted_space / (1024**2):.1f} MB")
# Verify integrity
verification = archiver.verify_integrity(
check_exists=True,
recompute_hashes=True
)
print(f"Valid: {verification.valid_count}")
print(f"Corrupted: {verification.corrupted_count}")
Performance
- Fast: BLAKE3 hashing with parallel batch processing
- Memory Efficient: < 100 MB memory usage even with large archives
- Scalable: Sharded storage with O(1) hash lookups
- Optimized: Thread pool execution for multi-core systems
See PERFORMANCE.md for analysis and optimization tips.
Architecture
Sharded JSON Storage
Instead of a single large JSON file, athena-file uses sharded storage:
.athena_archive/
├── index.json # Master index (~1-2 MB)
├── metadata.json # Archive metadata
└── shards/
├── shard_0000.json # 10,000 files (~1 MB)
├── shard_0001.json
└── ...
Benefits:
- Memory efficient: Only loads active shard (~1 MB)
- Fast lookups: O(1) hash-based index
- Scalable: Handles 1M+ files easily
- Human-readable: JSON format for debugging
Plugin System
Extensible file type handlers:
from athena_file.plugins import FileTypeHandler, HandlerRegistry
class CustomHandler(FileTypeHandler):
def can_handle(self, file_path, mime_type):
return file_path.suffix == ".custom"
def extract_metadata(self, fw):
return {"custom_field": "value"}
registry = HandlerRegistry()
registry.register(CustomHandler())
Built-in handlers for:
- Images: JPEG, PNG, GIF, WEBP, BMP, TIFF
- Videos: MP4, AVI, MOV, MKV, WEBM
- Archives: ZIP, TAR, GZ, BZ2, XZ
Documentation
- ARCHIVE_GUIDE.md - Comprehensive user guide with examples
- PERFORMANCE.md - Performance analysis and benchmarks
- PROJECT_SUMMARY.md - Implementation details and architecture
Testing
# Run all tests (fast, ~5 seconds)
make test
# Run with large-scale tests (10k+ files, slower)
pytest -m largescale
# Run with slow tests
pytest -m slow
Test Coverage: 130 tests, 77% code coverage, 100% pass rate
Use Cases
Photo Library Management
config = ArchiveConfig(
generate_thumbnails=True,
thumbnail_sizes=[(256, 256), (128, 128)],
thumbnail_format="webp"
)
archiver = FileArchiver(config)
archiver.scan_and_index(Path("/photos"), recursive=True)
Backup Verification
# Index backup
archiver.scan_and_index(Path("/backup"), recursive=True)
# Export checksums
archiver.export_checksums(Path("backup_checksums.blake3"))
# Later: verify integrity
verification = archiver.verify_integrity(recompute_hashes=True)
if verification.corrupted_count > 0:
print("⚠ Backup corruption detected!")
Deduplication Analysis
dup_report = archiver.find_duplicates()
# Get largest duplicates
for group in dup_report.get_largest_groups(n=10):
print(f"\n{group.count} copies ({group.wasted_space / 1024**2:.1f} MB wasted):")
for file in group.files:
print(f" - {file.path}")
Configuration Options
config = ArchiveConfig(
# Storage
storage_dir=Path(".athena_archive"),
shard_size=10000, # Files per shard
compression=True, # Enable gzip compression
# Hashing
default_hash_algo="blake3", # blake3, sha256, md5, sha1
compute_all_hashes=False, # Only compute default hash
# Performance
batch_size=1000, # Files per batch
max_workers=4, # Thread pool size
# Thumbnails
generate_thumbnails=True,
thumbnail_sizes=[(256, 256), (128, 128)],
thumbnail_format="webp", # webp, jpg, png
thumbnail_quality=85,
# Error Handling
retry_attempts=3,
continue_on_error=True,
# Progress
enable_progress=True,
progress_interval=100,
)
API Reference
FileArchiver
Main API for archive operations:
scan_and_index(directory, recursive=True)- Index directoryverify_integrity(check_exists=True, recompute_hashes=False)- Verify filesfind_duplicates()- Find duplicate filesexport_checksums(path, hash_type="blake3")- Export checksumssearch_by_name(pattern)- Search by filenamesearch_by_type(file_type)- Search by typeget_stats()- Get archive statisticsget_file_metadata(hash)- Get specific file metadata
FileWorker
Low-level file operations (original API, maintained for backward compatibility):
blake3_hash- BLAKE3 hashsha256_hash- SHA256 hashmd5_hash- MD5 hashsha1_hash- SHA1 hashfast_hash- Fast hash for quick comparisonsexif_data- EXIF metadatacopy(destination)- Copy filemove(destination)- Move file
Dependencies
Production
blake3>=1.0.8- Fast cryptographic hashingpyexiftool>=0.5.6- EXIF metadata extractionPillow>=10.0.0- Image processing and thumbnails
Development
pytest>=9.0.2- Testing frameworkpytest-cov>=7.0.0- Coverage reportingmypy>=1.19.1- Static type checkingruff>=0.14.10- Linting and formatting
Contributing
Contributions are welcome! Please ensure:
- All tests pass (
make test) - Code coverage remains > 75%
- Code follows ruff formatting
- New features include tests and documentation
License
See LICENSE file for details.
References
- blake3-py - BLAKE3 Python bindings
- pyexiftool - ExifTool Python wrapper
- Pillow - Python Imaging Library
Project Status
✓ Production Ready - All 5 implementation phases complete:
- Core Infrastructure
- Batch Operations & Utils
- Archive Operations
- Detection, Plugins & Thumbnails
- Performance Testing & Optimization
System successfully handles 1M+ file archives with excellent performance characteristics.