athena-file (1.0.0)

Published 2026-01-10 00:11:18 +00:00 by gravityfargo in gravityfargo/athena-file

Installation

pip install --index-url  athena-file

About this package

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation

Athena-File

Large-scale file archive management system with JSON-based cataloging, intelligent deduplication, multi-format support, and thumbnail generation. Designed to efficiently handle 1M+ files.

Tests Coverage Python

Features

  • Scalable Architecture: Sharded JSON storage handles 1M+ files with < 100 MB memory
  • Fast Indexing: High-throughput batch processing with parallel execution
  • Intelligent Deduplication: Find duplicate files by content hash in seconds
  • Multi-Format Support: Automatic detection for images, videos, archives, documents
  • Thumbnail Generation: Create thumbnails for images and videos (Pillow + ffmpeg)
  • Plugin System: Extensible architecture for custom file type handlers
  • Integrity Verification: Hash-based verification with corruption detection
  • Comprehensive API: Python API for all archive operations

Quick Start

Installation

# Install with uv
uv pip install athena-file

# Or install from source
git clone <repository-url>
cd athena-file
uv sync

System Requirements

  • Python 3.13+
  • exiftool (for EXIF metadata extraction)
  • ffmpeg (optional, for video thumbnails)

Basic Usage

from pathlib import Path
from athena_file.archive import FileArchiver
from athena_file.config import ArchiveConfig

# Configure archive
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    default_hash_algo="blake3",
    batch_size=1000,
)

# Create archiver
archiver = FileArchiver(config)

# Index a directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True
)

print(f"Indexed {report.file_count:,} files in {report.elapsed_seconds:.1f}s")
print(f"Total size: {report.total_size / (1024**3):.2f} GB")

# Find duplicates
dup_report = archiver.find_duplicates()
print(f"Found {dup_report.duplicate_groups_count} groups of duplicates")
print(f"Wasted space: {dup_report.total_wasted_space / (1024**2):.1f} MB")

# Verify integrity
verification = archiver.verify_integrity(
    check_exists=True,
    recompute_hashes=True
)
print(f"Valid: {verification.valid_count}")
print(f"Corrupted: {verification.corrupted_count}")

Performance

  • Fast: BLAKE3 hashing with parallel batch processing
  • Memory Efficient: < 100 MB memory usage even with large archives
  • Scalable: Sharded storage with O(1) hash lookups
  • Optimized: Thread pool execution for multi-core systems

See PERFORMANCE.md for analysis and optimization tips.

Architecture

Sharded JSON Storage

Instead of a single large JSON file, athena-file uses sharded storage:

.athena_archive/
├── index.json              # Master index (~1-2 MB)
├── metadata.json           # Archive metadata
└── shards/
    ├── shard_0000.json    # 10,000 files (~1 MB)
    ├── shard_0001.json
    └── ...

Benefits:

  • Memory efficient: Only loads active shard (~1 MB)
  • Fast lookups: O(1) hash-based index
  • Scalable: Handles 1M+ files easily
  • Human-readable: JSON format for debugging

Plugin System

Extensible file type handlers:

from athena_file.plugins import FileTypeHandler, HandlerRegistry

class CustomHandler(FileTypeHandler):
    def can_handle(self, file_path, mime_type):
        return file_path.suffix == ".custom"

    def extract_metadata(self, fw):
        return {"custom_field": "value"}

registry = HandlerRegistry()
registry.register(CustomHandler())

Built-in handlers for:

  • Images: JPEG, PNG, GIF, WEBP, BMP, TIFF
  • Videos: MP4, AVI, MOV, MKV, WEBM
  • Archives: ZIP, TAR, GZ, BZ2, XZ

Documentation

Testing

# Run all tests (fast, ~5 seconds)
make test

# Run with large-scale tests (10k+ files, slower)
pytest -m largescale

# Run with slow tests
pytest -m slow

Test Coverage: 130 tests, 77% code coverage, 100% pass rate

Use Cases

Photo Library Management

config = ArchiveConfig(
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp"
)
archiver = FileArchiver(config)
archiver.scan_and_index(Path("/photos"), recursive=True)

Backup Verification

# Index backup
archiver.scan_and_index(Path("/backup"), recursive=True)

# Export checksums
archiver.export_checksums(Path("backup_checksums.blake3"))

# Later: verify integrity
verification = archiver.verify_integrity(recompute_hashes=True)
if verification.corrupted_count > 0:
    print("⚠ Backup corruption detected!")

Deduplication Analysis

dup_report = archiver.find_duplicates()

# Get largest duplicates
for group in dup_report.get_largest_groups(n=10):
    print(f"\n{group.count} copies ({group.wasted_space / 1024**2:.1f} MB wasted):")
    for file in group.files:
        print(f"  - {file.path}")

Configuration Options

config = ArchiveConfig(
    # Storage
    storage_dir=Path(".athena_archive"),
    shard_size=10000,              # Files per shard
    compression=True,               # Enable gzip compression

    # Hashing
    default_hash_algo="blake3",     # blake3, sha256, md5, sha1
    compute_all_hashes=False,       # Only compute default hash

    # Performance
    batch_size=1000,                # Files per batch
    max_workers=4,                  # Thread pool size

    # Thumbnails
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",        # webp, jpg, png
    thumbnail_quality=85,

    # Error Handling
    retry_attempts=3,
    continue_on_error=True,

    # Progress
    enable_progress=True,
    progress_interval=100,
)

API Reference

FileArchiver

Main API for archive operations:

  • scan_and_index(directory, recursive=True) - Index directory
  • verify_integrity(check_exists=True, recompute_hashes=False) - Verify files
  • find_duplicates() - Find duplicate files
  • export_checksums(path, hash_type="blake3") - Export checksums
  • search_by_name(pattern) - Search by filename
  • search_by_type(file_type) - Search by type
  • get_stats() - Get archive statistics
  • get_file_metadata(hash) - Get specific file metadata

FileWorker

Low-level file operations (original API, maintained for backward compatibility):

  • blake3_hash - BLAKE3 hash
  • sha256_hash - SHA256 hash
  • md5_hash - MD5 hash
  • sha1_hash - SHA1 hash
  • fast_hash - Fast hash for quick comparisons
  • exif_data - EXIF metadata
  • copy(destination) - Copy file
  • move(destination) - Move file

Dependencies

Production

  • blake3>=1.0.8 - Fast cryptographic hashing
  • pyexiftool>=0.5.6 - EXIF metadata extraction
  • Pillow>=10.0.0 - Image processing and thumbnails

Development

  • pytest>=9.0.2 - Testing framework
  • pytest-cov>=7.0.0 - Coverage reporting
  • mypy>=1.19.1 - Static type checking
  • ruff>=0.14.10 - Linting and formatting

Contributing

Contributions are welcome! Please ensure:

  • All tests pass (make test)
  • Code coverage remains > 75%
  • Code follows ruff formatting
  • New features include tests and documentation

License

See LICENSE file for details.

References

Project Status

Production Ready - All 5 implementation phases complete:

  1. Core Infrastructure
  2. Batch Operations & Utils
  3. Archive Operations
  4. Detection, Plugins & Thumbnails
  5. Performance Testing & Optimization

System successfully handles 1M+ file archives with excellent performance characteristics.

Requirements

Requires Python: >=3.13
Details
PyPI
2026-01-10 00:11:18 +00:00
5
Nathan Price
77 KiB
Assets (2)
Versions (8) View all
1.3.3 2026-01-12
1.3.2 2026-01-12
1.3.1 2026-01-12
1.3.0 2026-01-12
1.2.0 2026-01-11