athena-file (1.0.0)

Published 2026-01-10 00:11:18 +00:00 by gravityfargo in gravityfargo/athena-file

To install the package using pip, run the following command:

pip install --index-url  athena-file

For more information on the PyPI registry, see the documentation.

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation

Athena-File

Large-scale file archive management system with JSON-based cataloging, intelligent deduplication, multi-format support, and thumbnail generation. Designed to efficiently handle 1M+ files.

Features

Scalable Architecture: Sharded JSON storage handles 1M+ files with < 100 MB memory
Fast Indexing: High-throughput batch processing with parallel execution
Intelligent Deduplication: Find duplicate files by content hash in seconds
Multi-Format Support: Automatic detection for images, videos, archives, documents
Thumbnail Generation: Create thumbnails for images and videos (Pillow + ffmpeg)
Plugin System: Extensible architecture for custom file type handlers
Integrity Verification: Hash-based verification with corruption detection
Comprehensive API: Python API for all archive operations

Quick Start

Installation

# Install with uv
uv pip install athena-file

# Or install from source
git clone <repository-url>
cd athena-file
uv sync

System Requirements

Python 3.13+
exiftool (for EXIF metadata extraction)
ffmpeg (optional, for video thumbnails)

Basic Usage

from pathlib import Path
from athena_file.archive import FileArchiver
from athena_file.config import ArchiveConfig

# Configure archive
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    default_hash_algo="blake3",
    batch_size=1000,
)

# Create archiver
archiver = FileArchiver(config)

# Index a directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True
)

print(f"Indexed {report.file_count:,} files in {report.elapsed_seconds:.1f}s")
print(f"Total size: {report.total_size / (1024**3):.2f} GB")

# Find duplicates
dup_report = archiver.find_duplicates()
print(f"Found {dup_report.duplicate_groups_count} groups of duplicates")
print(f"Wasted space: {dup_report.total_wasted_space / (1024**2):.1f} MB")

# Verify integrity
verification = archiver.verify_integrity(
    check_exists=True,
    recompute_hashes=True
)
print(f"Valid: {verification.valid_count}")
print(f"Corrupted: {verification.corrupted_count}")

Performance

Fast: BLAKE3 hashing with parallel batch processing
Memory Efficient: < 100 MB memory usage even with large archives
Scalable: Sharded storage with O(1) hash lookups
Optimized: Thread pool execution for multi-core systems

See PERFORMANCE.md for analysis and optimization tips.

Architecture

Sharded JSON Storage

Instead of a single large JSON file, athena-file uses sharded storage:

.athena_archive/
├── index.json              # Master index (~1-2 MB)
├── metadata.json           # Archive metadata
└── shards/
    ├── shard_0000.json    # 10,000 files (~1 MB)
    ├── shard_0001.json
    └── ...

Benefits:

Memory efficient: Only loads active shard (~1 MB)
Fast lookups: O(1) hash-based index
Scalable: Handles 1M+ files easily
Human-readable: JSON format for debugging

Plugin System

Extensible file type handlers:

from athena_file.plugins import FileTypeHandler, HandlerRegistry

class CustomHandler(FileTypeHandler):
    def can_handle(self, file_path, mime_type):
        return file_path.suffix == ".custom"

    def extract_metadata(self, fw):
        return {"custom_field": "value"}

registry = HandlerRegistry()
registry.register(CustomHandler())

Built-in handlers for:

Images: JPEG, PNG, GIF, WEBP, BMP, TIFF
Videos: MP4, AVI, MOV, MKV, WEBM
Archives: ZIP, TAR, GZ, BZ2, XZ

Documentation

ARCHIVE_GUIDE.md - Comprehensive user guide with examples
PERFORMANCE.md - Performance analysis and benchmarks
PROJECT_SUMMARY.md - Implementation details and architecture

Testing

# Run all tests (fast, ~5 seconds)
make test

# Run with large-scale tests (10k+ files, slower)
pytest -m largescale

# Run with slow tests
pytest -m slow

Test Coverage: 130 tests, 77% code coverage, 100% pass rate

Use Cases

Photo Library Management

config = ArchiveConfig(
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp"
)
archiver = FileArchiver(config)
archiver.scan_and_index(Path("/photos"), recursive=True)

Backup Verification

# Index backup
archiver.scan_and_index(Path("/backup"), recursive=True)

# Export checksums
archiver.export_checksums(Path("backup_checksums.blake3"))

# Later: verify integrity
verification = archiver.verify_integrity(recompute_hashes=True)
if verification.corrupted_count > 0:
    print("⚠ Backup corruption detected!")

Deduplication Analysis

dup_report = archiver.find_duplicates()

# Get largest duplicates
for group in dup_report.get_largest_groups(n=10):
    print(f"\n{group.count} copies ({group.wasted_space / 1024**2:.1f} MB wasted):")
    for file in group.files:
        print(f"  - {file.path}")

Configuration Options

config = ArchiveConfig(
    # Storage
    storage_dir=Path(".athena_archive"),
    shard_size=10000,              # Files per shard
    compression=True,               # Enable gzip compression

    # Hashing
    default_hash_algo="blake3",     # blake3, sha256, md5, sha1
    compute_all_hashes=False,       # Only compute default hash

    # Performance
    batch_size=1000,                # Files per batch
    max_workers=4,                  # Thread pool size

    # Thumbnails
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",        # webp, jpg, png
    thumbnail_quality=85,

    # Error Handling
    retry_attempts=3,
    continue_on_error=True,

    # Progress
    enable_progress=True,
    progress_interval=100,
)

API Reference

FileArchiver

Main API for archive operations:

scan_and_index(directory, recursive=True) - Index directory
verify_integrity(check_exists=True, recompute_hashes=False) - Verify files
find_duplicates() - Find duplicate files
export_checksums(path, hash_type="blake3") - Export checksums
search_by_name(pattern) - Search by filename
search_by_type(file_type) - Search by type
get_stats() - Get archive statistics
get_file_metadata(hash) - Get specific file metadata

FileWorker

Low-level file operations (original API, maintained for backward compatibility):

blake3_hash - BLAKE3 hash
sha256_hash - SHA256 hash
md5_hash - MD5 hash
sha1_hash - SHA1 hash
fast_hash - Fast hash for quick comparisons
exif_data - EXIF metadata
copy(destination) - Copy file
move(destination) - Move file

Dependencies

Production

blake3>=1.0.8 - Fast cryptographic hashing
pyexiftool>=0.5.6 - EXIF metadata extraction
Pillow>=10.0.0 - Image processing and thumbnails

Development

pytest>=9.0.2 - Testing framework
pytest-cov>=7.0.0 - Coverage reporting
mypy>=1.19.1 - Static type checking
ruff>=0.14.10 - Linting and formatting

Contributing

Contributions are welcome! Please ensure:

All tests pass (make test)
Code coverage remains > 75%
Code follows ruff formatting
New features include tests and documentation

License

See LICENSE file for details.

References

blake3-py - BLAKE3 Python bindings
pyexiftool - ExifTool Python wrapper
Pillow - Python Imaging Library

Project Status

✓ Production Ready - All 5 implementation phases complete:

Core Infrastructure
Batch Operations & Utils
Archive Operations
Detection, Plugins & Thumbnails
Performance Testing & Optimization

System successfully handles 1M+ file archives with excellent performance characteristics.

Requires Python: >=3.13

Details

PyPI

gravityfargo/athena-file

2026-01-10 00:11:18 +00:00

Nathan Price

77 KiB

Assets (2)

athena_file-1.0.0.tar.gz 34 KiB

athena_file-1.0.0-py3-none-any.whl 44 KiB

Versions (8) View all

1.3.3

2026-01-12

1.3.2

2026-01-12

1.3.1

2026-01-12

1.3.0

2026-01-12

1.2.0

2026-01-11

Issues

athena-file (1.0.0)

Installation

About this package

Athena-File

Features

Quick Start

Installation

System Requirements

Basic Usage

Performance

Architecture

Sharded JSON Storage

Plugin System

Documentation

Testing

Use Cases

Photo Library Management

Backup Verification

Deduplication Analysis

Configuration Options

API Reference

FileArchiver

FileWorker

Dependencies

Production

Development

Contributing

License

References

Project Status

Requirements