athena-file (1.3.3)

Published 2026-01-12 20:29:16 +00:00 by gravityfargo in gravityfargo/athena-file

Installation

pip install --index-url  athena-file

About this package

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation

Athena File

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation.

Features

  • Fast File Hashing: BLAKE3 primary hashing with optional SHA256, MD5, and SHA1 support
  • Metadata Extraction: Automatic EXIF data extraction for images and videos
  • Deduplication: Identify duplicate files by hash or size
  • Thumbnail Generation: Automatic thumbnail creation for images, videos, and PDFs
  • Sharded Storage: Efficient JSON-based storage with automatic sharding for large archives
  • Batch Processing: Parallel file processing with progress tracking
  • Integrity Verification: Verify file integrity and detect corruption
  • Flexible Search: Search by filename, type, or custom metadata

Installation

pip install athena-file

System Dependencies

For thumbnail generation, install the following system packages:

# Ubuntu/Debian
sudo apt-get install exiftool poppler-utils ffmpeg

# macOS
brew install exiftool poppler ffmpeg

# Arch Linux
sudo pacman -S perl-image-exiftool poppler ffmpeg

Quick Start

from pathlib import Path
from athena_file import FileArchiver

# Initialize archiver with default settings
archiver = FileArchiver()

# Index a directory
report = archiver.scan_and_index(Path("/path/to/photos"), recursive=True)
print(f"Indexed {report.success_count} files in {report.elapsed_seconds:.2f}s")

# Find duplicates
duplicates = archiver.find_duplicates()
print(f"Found {len(duplicates.duplicate_groups)} groups of duplicates")

# Search for files
results = archiver.search_by_name("vacation")
for file in results:
    print(f"{file.path} - {file.size} bytes")

Examples

Basic File Indexing

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Create archive with custom configuration
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    compute_all_hashes=True,  # Compute SHA256, MD5, SHA1 in addition to BLAKE3
    generate_thumbnails=True,  # Generate thumbnails for media files
)

archiver = FileArchiver(config)

# Index a specific directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True,
    file_pattern="*.jpg"  # Only process JPEG files
)

# Print summary
summary = report.get_summary()
print(f"Total files: {summary['total_files']}")
print(f"Success: {summary['successful']}")
print(f"Failed: {summary['failed']}")
print(f"Processing rate: {summary['files_per_second']:.2f} files/sec")

Working with File Metadata

from pathlib import Path
from athena_file import FileArchiver
from athena_file.main import FileWorker

archiver = FileArchiver()

# Index a single file
file_path = Path("/data/image.jpg")
fw = FileWorker(file_path)

# Get detailed hash information
print(f"BLAKE3: {fw.blake3_hash}")
print(f"Fast hash: {fw.fast_hash}")
print(f"SHA256: {fw.sha256_hash}")

# Access EXIF data
if fw.exif_data:
    print(f"Camera: {fw.exif_data.get('Make')} {fw.exif_data.get('Model')}")
    print(f"Date taken: {fw.exif_data.get('DateTimeOriginal')}")

# Retrieve metadata from archive
metadata = archiver.get_file_metadata(fw.blake3_hash)
if metadata:
    print(f"File type: {metadata.file_type}")
    print(f"MIME type: {metadata.mime_type}")
    print(f"Indexed at: {metadata.indexed_at}")

Deduplication

from athena_file import FileArchiver

archiver = FileArchiver()

# Find exact duplicates by hash
duplicates = archiver.find_duplicates()

print(f"Found {duplicates.total_duplicates} duplicate files")
print(f"Wasted space: {duplicates.wasted_space / 1024**3:.2f} GB")

# Iterate through duplicate groups
for group in duplicates.duplicate_groups:
    print(f"\nHash: {group.hash}")
    print(f"Size: {group.size} bytes")
    print(f"Count: {group.count} files")
    for file_meta in group.files:
        print(f"  - {file_meta.path}")

# Find potential duplicates by size (faster but less accurate)
size_duplicates = archiver.find_duplicates_by_size(min_size=1024*1024)  # 1MB minimum
print(f"Found {len(size_duplicates.duplicate_groups)} groups with same size")

Thumbnail Generation

from pathlib import Path
from athena_file import ArchiveConfig
from athena_file.media import ThumbnailGenerator
from athena_file.main import FileWorker

# Configure thumbnail settings
config = ArchiveConfig(
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128), (64, 64)],
    thumbnail_format="webp",  # or "jpeg"
    generate_thumbnails=True,
)

config.ensure_directories()

# Generate thumbnails for a file
file_worker = FileWorker(Path("/data/photo.jpg"))
generator = ThumbnailGenerator(config, file_worker.blake3_hash)

# Generate all configured sizes
thumbnails = generator.generate_thumbnail(file_worker)

for size, thumb_path in thumbnails.items():
    print(f"{size}: {thumb_path}")

# Thumbnails are organized by unique_id:
# .athena_archive/thumbnails/
# └── [unique_id]/
#     ├── 256x256.webp
#     ├── 128x128.webp
#     └── 64x64.webp

File Verification

from athena_file import FileArchiver

archiver = FileArchiver()

# Verify all files in archive
report = archiver.verify_integrity(
    check_exists=True,      # Verify files still exist
    recompute_hashes=True   # Recompute and verify hashes
)

print(f"Total files: {report.total_files}")
print(f"Verified: {report.verified_count}")
print(f"Missing: {report.missing_count}")
print(f"Corrupted: {report.corrupted_count}")

# Check specific files
if report.missing_files:
    print("\nMissing files:")
    for file_meta in report.missing_files:
        print(f"  - {file_meta.path}")

if report.corrupted_files:
    print("\nCorrupted files:")
    for file_meta in report.corrupted_files:
        print(f"  - {file_meta.path}")

Search and Query

from athena_file import FileArchiver

archiver = FileArchiver()

# Search by filename pattern
results = archiver.search_by_name("vacation")
print(f"Found {len(results)} files matching 'vacation'")

# Search by file type
images = archiver.search_by_type("image")
videos = archiver.search_by_type("video")
documents = archiver.search_by_type("document")

print(f"Images: {len(images)}")
print(f"Videos: {len(videos)}")
print(f"Documents: {len(documents)}")

# Iterate through all files
for batch in archiver.storage.iterate_files():
    for file_meta in batch:
        if file_meta.size > 100 * 1024 * 1024:  # Files larger than 100MB
            print(f"Large file: {file_meta.path} - {file_meta.size / 1024**2:.2f} MB")

Archive Statistics

from athena_file import FileArchiver

archiver = FileArchiver()

# Get archive statistics
stats = archiver.get_stats()

print(f"Total files: {stats['total_files']}")
print(f"Storage shards: {stats['storage_shards']}")
print(f"Storage size: {stats['storage_size_mb']:.2f} MB")
print(f"Average shard size: {stats['average_shard_size_mb']:.2f} MB")

# Get storage-level details
storage_stats = archiver.storage.get_stats()
print(f"Files per shard: {storage_stats['files_per_shard']}")
print(f"Total storage bytes: {storage_stats['total_storage_bytes']}")

Batch Processing with Progress

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Enable progress tracking
config = ArchiveConfig(
    enable_progress=True,
    progress_interval=1.0,  # Update every 1 second
    batch_size=100,         # Process 100 files per batch
    max_workers=4,          # Use 4 parallel workers
)

archiver = FileArchiver(config)

# Process large directory with progress
report = archiver.scan_and_index(
    Path("/data/large_archive"),
    recursive=True
)

print(f"Processed {report.file_count} files")
print(f"Success rate: {(report.success_count / report.file_count) * 100:.1f}%")
print(f"Time: {report.elapsed_seconds:.2f}s")

Custom Metadata

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
from athena_file.metadata import FileMetadata
from athena_file.main import FileWorker

archiver = FileArchiver()

# Create metadata with custom fields
config = ArchiveConfig()
fw = FileWorker(Path("/data/photo.jpg"))
metadata = FileMetadata.from_file_worker(fw, config)

# Add custom metadata
metadata.custom_metadata["photographer"] = "Jane Doe"
metadata.custom_metadata["location"] = "Paris, France"
metadata.custom_metadata["tags"] = ["landscape", "travel", "europe"]

# Store in archive
archiver.storage.add_file(metadata)

# Retrieve and access custom metadata
retrieved = archiver.get_file_metadata(fw.blake3_hash)
print(retrieved.custom_metadata["photographer"])  # "Jane Doe"
print(retrieved.custom_metadata["tags"])          # ["landscape", "travel", "europe"]

Error Handling

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Continue processing even if some files fail
config = ArchiveConfig(
    continue_on_error=True,
    max_retries=3,
)

archiver = FileArchiver(config)

report = archiver.scan_and_index(Path("/data/mixed_files"))

# Check for errors
if report.failure_count > 0:
    print(f"Failed to process {report.failure_count} files")

    # Get error details
    error_summary = archiver.batch_processor.get_error_summary()
    print(f"Error types: {error_summary}")

Configuration

ArchiveConfig Options

from pathlib import Path
from athena_file import ArchiveConfig

config = ArchiveConfig(
    # Storage settings
    storage_dir=Path(".athena_archive"),     # Archive storage directory
    shard_size=10000,                        # Files per shard

    # Hashing
    compute_all_hashes=False,                # Compute SHA256, MD5, SHA1

    # Thumbnails
    generate_thumbnails=False,               # Auto-generate thumbnails
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",                 # "webp" or "jpeg"

    # Processing
    batch_size=1000,                         # Files per batch
    max_workers=4,                           # Parallel workers
    continue_on_error=True,                  # Continue on failure
    max_retries=3,                           # Retry failed operations

    # Progress
    enable_progress=False,                   # Enable progress tracking
    progress_interval=1.0,                   # Progress update interval (seconds)
)

Architecture

Storage Structure

.athena_archive/
├── index.json              # Master index with shard mappings
├── metadata.json           # Archive-level metadata
├── shards/
│   ├── shard_0000.json    # Files 0-9999
│   ├── shard_0001.json    # Files 10000-19999
│   └── ...
└── thumbnails/
    ├── [unique_id_1]/
    │   ├── 256x256.webp
    │   └── 128x128.webp
    └── [unique_id_2]/
        └── ...

File Metadata Schema

Each file in the archive stores:

  • Hashes: BLAKE3 (primary), fast hash, optional SHA256/MD5/SHA1
  • File attributes: size, mtime, ctime
  • Type information: MIME type, file category
  • EXIF data: Camera settings, GPS, timestamps (images/videos)
  • Thumbnails: Paths to generated thumbnails
  • Custom metadata: User-defined fields
  • Archive tracking: Indexed timestamp, verification timestamp

Performance

Athena File is designed for large-scale archives:

  • Sharded storage: Prevents JSON files from growing too large
  • Parallel processing: Multi-threaded batch operations
  • Incremental indexing: Add files without reprocessing entire archive
  • Fast hashing: BLAKE3 is significantly faster than SHA256
  • Generator-based discovery: Avoids loading all file paths into memory

Typical performance on modern hardware:

  • 1000-2000 files/second (indexing without thumbnails)
  • 200-500 files/second (indexing with thumbnails)
  • 5000+ files/second (duplicate detection by hash)

Requirements

  • Python >= 3.13
  • Pillow >= 10.0.0
  • blake3 >= 1.0.8
  • pdf2image >= 1.17.0 (for PDF thumbnails)
  • pyexiftool >= 0.5.6 (for EXIF extraction)

Optional System Dependencies

  • exiftool: EXIF metadata extraction
  • poppler-utils: PDF thumbnail generation
  • ffmpeg: Video thumbnail generation

License

See LICENSE file for details.

Author

Nathan Price nathan@modernleft.org

Requirements

Requires Python: >=3.13
Details
PyPI
2026-01-12 20:29:16 +00:00
34
Nathan Price
76 KiB
Assets (2)
Versions (8) View all
1.3.3 2026-01-12
1.3.2 2026-01-12
1.3.1 2026-01-12
1.3.0 2026-01-12
1.2.0 2026-01-11