athena-file (1.3.3)

Published 2026-01-12 20:29:16 +00:00 by gravityfargo in gravityfargo/athena-file

To install the package using pip, run the following command:

pip install --index-url  athena-file

For more information on the PyPI registry, see the documentation.

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation

Athena File

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation.

Features

Fast File Hashing: BLAKE3 primary hashing with optional SHA256, MD5, and SHA1 support
Metadata Extraction: Automatic EXIF data extraction for images and videos
Deduplication: Identify duplicate files by hash or size
Thumbnail Generation: Automatic thumbnail creation for images, videos, and PDFs
Sharded Storage: Efficient JSON-based storage with automatic sharding for large archives
Batch Processing: Parallel file processing with progress tracking
Integrity Verification: Verify file integrity and detect corruption
Flexible Search: Search by filename, type, or custom metadata

Installation

pip install athena-file

System Dependencies

For thumbnail generation, install the following system packages:

# Ubuntu/Debian
sudo apt-get install exiftool poppler-utils ffmpeg

# macOS
brew install exiftool poppler ffmpeg

# Arch Linux
sudo pacman -S perl-image-exiftool poppler ffmpeg

Quick Start

from pathlib import Path
from athena_file import FileArchiver

# Initialize archiver with default settings
archiver = FileArchiver()

# Index a directory
report = archiver.scan_and_index(Path("/path/to/photos"), recursive=True)
print(f"Indexed {report.success_count} files in {report.elapsed_seconds:.2f}s")

# Find duplicates
duplicates = archiver.find_duplicates()
print(f"Found {len(duplicates.duplicate_groups)} groups of duplicates")

# Search for files
results = archiver.search_by_name("vacation")
for file in results:
    print(f"{file.path} - {file.size} bytes")

Examples

Basic File Indexing

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Create archive with custom configuration
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    compute_all_hashes=True,  # Compute SHA256, MD5, SHA1 in addition to BLAKE3
    generate_thumbnails=True,  # Generate thumbnails for media files
)

archiver = FileArchiver(config)

# Index a specific directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True,
    file_pattern="*.jpg"  # Only process JPEG files
)

# Print summary
summary = report.get_summary()
print(f"Total files: {summary['total_files']}")
print(f"Success: {summary['successful']}")
print(f"Failed: {summary['failed']}")
print(f"Processing rate: {summary['files_per_second']:.2f} files/sec")

Working with File Metadata

from pathlib import Path
from athena_file import FileArchiver
from athena_file.main import FileWorker

archiver = FileArchiver()

# Index a single file
file_path = Path("/data/image.jpg")
fw = FileWorker(file_path)

# Get detailed hash information
print(f"BLAKE3: {fw.blake3_hash}")
print(f"Fast hash: {fw.fast_hash}")
print(f"SHA256: {fw.sha256_hash}")

# Access EXIF data
if fw.exif_data:
    print(f"Camera: {fw.exif_data.get('Make')} {fw.exif_data.get('Model')}")
    print(f"Date taken: {fw.exif_data.get('DateTimeOriginal')}")

# Retrieve metadata from archive
metadata = archiver.get_file_metadata(fw.blake3_hash)
if metadata:
    print(f"File type: {metadata.file_type}")
    print(f"MIME type: {metadata.mime_type}")
    print(f"Indexed at: {metadata.indexed_at}")

Deduplication

from athena_file import FileArchiver

archiver = FileArchiver()

# Find exact duplicates by hash
duplicates = archiver.find_duplicates()

print(f"Found {duplicates.total_duplicates} duplicate files")
print(f"Wasted space: {duplicates.wasted_space / 1024**3:.2f} GB")

# Iterate through duplicate groups
for group in duplicates.duplicate_groups:
    print(f"\nHash: {group.hash}")
    print(f"Size: {group.size} bytes")
    print(f"Count: {group.count} files")
    for file_meta in group.files:
        print(f"  - {file_meta.path}")

# Find potential duplicates by size (faster but less accurate)
size_duplicates = archiver.find_duplicates_by_size(min_size=1024*1024)  # 1MB minimum
print(f"Found {len(size_duplicates.duplicate_groups)} groups with same size")

Thumbnail Generation

from pathlib import Path
from athena_file import ArchiveConfig
from athena_file.media import ThumbnailGenerator
from athena_file.main import FileWorker

# Configure thumbnail settings
config = ArchiveConfig(
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128), (64, 64)],
    thumbnail_format="webp",  # or "jpeg"
    generate_thumbnails=True,
)

config.ensure_directories()

# Generate thumbnails for a file
file_worker = FileWorker(Path("/data/photo.jpg"))
generator = ThumbnailGenerator(config, file_worker.blake3_hash)

# Generate all configured sizes
thumbnails = generator.generate_thumbnail(file_worker)

for size, thumb_path in thumbnails.items():
    print(f"{size}: {thumb_path}")

# Thumbnails are organized by unique_id:
# .athena_archive/thumbnails/
# └── [unique_id]/
#     ├── 256x256.webp
#     ├── 128x128.webp
#     └── 64x64.webp

File Verification

from athena_file import FileArchiver

archiver = FileArchiver()

# Verify all files in archive
report = archiver.verify_integrity(
    check_exists=True,      # Verify files still exist
    recompute_hashes=True   # Recompute and verify hashes
)

print(f"Total files: {report.total_files}")
print(f"Verified: {report.verified_count}")
print(f"Missing: {report.missing_count}")
print(f"Corrupted: {report.corrupted_count}")

# Check specific files
if report.missing_files:
    print("\nMissing files:")
    for file_meta in report.missing_files:
        print(f"  - {file_meta.path}")

if report.corrupted_files:
    print("\nCorrupted files:")
    for file_meta in report.corrupted_files:
        print(f"  - {file_meta.path}")

Search and Query

from athena_file import FileArchiver

archiver = FileArchiver()

# Search by filename pattern
results = archiver.search_by_name("vacation")
print(f"Found {len(results)} files matching 'vacation'")

# Search by file type
images = archiver.search_by_type("image")
videos = archiver.search_by_type("video")
documents = archiver.search_by_type("document")

print(f"Images: {len(images)}")
print(f"Videos: {len(videos)}")
print(f"Documents: {len(documents)}")

# Iterate through all files
for batch in archiver.storage.iterate_files():
    for file_meta in batch:
        if file_meta.size > 100 * 1024 * 1024:  # Files larger than 100MB
            print(f"Large file: {file_meta.path} - {file_meta.size / 1024**2:.2f} MB")

Archive Statistics

from athena_file import FileArchiver

archiver = FileArchiver()

# Get archive statistics
stats = archiver.get_stats()

print(f"Total files: {stats['total_files']}")
print(f"Storage shards: {stats['storage_shards']}")
print(f"Storage size: {stats['storage_size_mb']:.2f} MB")
print(f"Average shard size: {stats['average_shard_size_mb']:.2f} MB")

# Get storage-level details
storage_stats = archiver.storage.get_stats()
print(f"Files per shard: {storage_stats['files_per_shard']}")
print(f"Total storage bytes: {storage_stats['total_storage_bytes']}")

Batch Processing with Progress

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Enable progress tracking
config = ArchiveConfig(
    enable_progress=True,
    progress_interval=1.0,  # Update every 1 second
    batch_size=100,         # Process 100 files per batch
    max_workers=4,          # Use 4 parallel workers
)

archiver = FileArchiver(config)

# Process large directory with progress
report = archiver.scan_and_index(
    Path("/data/large_archive"),
    recursive=True
)

print(f"Processed {report.file_count} files")
print(f"Success rate: {(report.success_count / report.file_count) * 100:.1f}%")
print(f"Time: {report.elapsed_seconds:.2f}s")

Custom Metadata

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
from athena_file.metadata import FileMetadata
from athena_file.main import FileWorker

archiver = FileArchiver()

# Create metadata with custom fields
config = ArchiveConfig()
fw = FileWorker(Path("/data/photo.jpg"))
metadata = FileMetadata.from_file_worker(fw, config)

# Add custom metadata
metadata.custom_metadata["photographer"] = "Jane Doe"
metadata.custom_metadata["location"] = "Paris, France"
metadata.custom_metadata["tags"] = ["landscape", "travel", "europe"]

# Store in archive
archiver.storage.add_file(metadata)

# Retrieve and access custom metadata
retrieved = archiver.get_file_metadata(fw.blake3_hash)
print(retrieved.custom_metadata["photographer"])  # "Jane Doe"
print(retrieved.custom_metadata["tags"])          # ["landscape", "travel", "europe"]

Error Handling

from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Continue processing even if some files fail
config = ArchiveConfig(
    continue_on_error=True,
    max_retries=3,
)

archiver = FileArchiver(config)

report = archiver.scan_and_index(Path("/data/mixed_files"))

# Check for errors
if report.failure_count > 0:
    print(f"Failed to process {report.failure_count} files")

    # Get error details
    error_summary = archiver.batch_processor.get_error_summary()
    print(f"Error types: {error_summary}")

Configuration

ArchiveConfig Options

from pathlib import Path
from athena_file import ArchiveConfig

config = ArchiveConfig(
    # Storage settings
    storage_dir=Path(".athena_archive"),     # Archive storage directory
    shard_size=10000,                        # Files per shard

    # Hashing
    compute_all_hashes=False,                # Compute SHA256, MD5, SHA1

    # Thumbnails
    generate_thumbnails=False,               # Auto-generate thumbnails
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",                 # "webp" or "jpeg"

    # Processing
    batch_size=1000,                         # Files per batch
    max_workers=4,                           # Parallel workers
    continue_on_error=True,                  # Continue on failure
    max_retries=3,                           # Retry failed operations

    # Progress
    enable_progress=False,                   # Enable progress tracking
    progress_interval=1.0,                   # Progress update interval (seconds)
)

Architecture

Storage Structure

.athena_archive/
├── index.json              # Master index with shard mappings
├── metadata.json           # Archive-level metadata
├── shards/
│   ├── shard_0000.json    # Files 0-9999
│   ├── shard_0001.json    # Files 10000-19999
│   └── ...
└── thumbnails/
    ├── [unique_id_1]/
    │   ├── 256x256.webp
    │   └── 128x128.webp
    └── [unique_id_2]/
        └── ...

File Metadata Schema

Each file in the archive stores:

Hashes: BLAKE3 (primary), fast hash, optional SHA256/MD5/SHA1
File attributes: size, mtime, ctime
Type information: MIME type, file category
EXIF data: Camera settings, GPS, timestamps (images/videos)
Thumbnails: Paths to generated thumbnails
Custom metadata: User-defined fields
Archive tracking: Indexed timestamp, verification timestamp

Performance

Athena File is designed for large-scale archives:

Sharded storage: Prevents JSON files from growing too large
Parallel processing: Multi-threaded batch operations
Incremental indexing: Add files without reprocessing entire archive
Fast hashing: BLAKE3 is significantly faster than SHA256
Generator-based discovery: Avoids loading all file paths into memory

Typical performance on modern hardware:

1000-2000 files/second (indexing without thumbnails)
200-500 files/second (indexing with thumbnails)
5000+ files/second (duplicate detection by hash)

Requirements

Python >= 3.13
Pillow >= 10.0.0
blake3 >= 1.0.8
pdf2image >= 1.17.0 (for PDF thumbnails)
pyexiftool >= 0.5.6 (for EXIF extraction)

Optional System Dependencies

exiftool: EXIF metadata extraction
poppler-utils: PDF thumbnail generation
ffmpeg: Video thumbnail generation

License

See LICENSE file for details.

Author

Nathan Price nathan@modernleft.org

Requires Python: >=3.13

Details

PyPI

gravityfargo/athena-file

2026-01-12 20:29:16 +00:00

Nathan Price

76 KiB

Assets (2)

athena_file-1.3.3.tar.gz 33 KiB

athena_file-1.3.3-py3-none-any.whl 44 KiB

Versions (8) View all

1.3.3

2026-01-12

1.3.2

2026-01-12

1.3.1

2026-01-12

1.3.0

2026-01-12

1.2.0

2026-01-11

Issues

athena-file (1.3.3)

Installation

About this package

Athena File

Features

Installation

System Dependencies

Quick Start

Examples

Basic File Indexing

Working with File Metadata

Deduplication

Thumbnail Generation

File Verification

Search and Query

Archive Statistics

Batch Processing with Progress

Custom Metadata

Error Handling

Configuration

ArchiveConfig Options

Architecture

Storage Structure

File Metadata Schema

Performance

Requirements

Optional System Dependencies

License

Author

Requirements