athena-file (1.3.3)
Installation
pip install --index-url athena-fileAbout this package
Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation
Athena File
Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation.
Features
- Fast File Hashing: BLAKE3 primary hashing with optional SHA256, MD5, and SHA1 support
- Metadata Extraction: Automatic EXIF data extraction for images and videos
- Deduplication: Identify duplicate files by hash or size
- Thumbnail Generation: Automatic thumbnail creation for images, videos, and PDFs
- Sharded Storage: Efficient JSON-based storage with automatic sharding for large archives
- Batch Processing: Parallel file processing with progress tracking
- Integrity Verification: Verify file integrity and detect corruption
- Flexible Search: Search by filename, type, or custom metadata
Installation
pip install athena-file
System Dependencies
For thumbnail generation, install the following system packages:
# Ubuntu/Debian
sudo apt-get install exiftool poppler-utils ffmpeg
# macOS
brew install exiftool poppler ffmpeg
# Arch Linux
sudo pacman -S perl-image-exiftool poppler ffmpeg
Quick Start
from pathlib import Path
from athena_file import FileArchiver
# Initialize archiver with default settings
archiver = FileArchiver()
# Index a directory
report = archiver.scan_and_index(Path("/path/to/photos"), recursive=True)
print(f"Indexed {report.success_count} files in {report.elapsed_seconds:.2f}s")
# Find duplicates
duplicates = archiver.find_duplicates()
print(f"Found {len(duplicates.duplicate_groups)} groups of duplicates")
# Search for files
results = archiver.search_by_name("vacation")
for file in results:
print(f"{file.path} - {file.size} bytes")
Examples
Basic File Indexing
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
# Create archive with custom configuration
config = ArchiveConfig(
storage_dir=Path(".athena_archive"),
compute_all_hashes=True, # Compute SHA256, MD5, SHA1 in addition to BLAKE3
generate_thumbnails=True, # Generate thumbnails for media files
)
archiver = FileArchiver(config)
# Index a specific directory
report = archiver.scan_and_index(
Path("/data/photos"),
recursive=True,
file_pattern="*.jpg" # Only process JPEG files
)
# Print summary
summary = report.get_summary()
print(f"Total files: {summary['total_files']}")
print(f"Success: {summary['successful']}")
print(f"Failed: {summary['failed']}")
print(f"Processing rate: {summary['files_per_second']:.2f} files/sec")
Working with File Metadata
from pathlib import Path
from athena_file import FileArchiver
from athena_file.main import FileWorker
archiver = FileArchiver()
# Index a single file
file_path = Path("/data/image.jpg")
fw = FileWorker(file_path)
# Get detailed hash information
print(f"BLAKE3: {fw.blake3_hash}")
print(f"Fast hash: {fw.fast_hash}")
print(f"SHA256: {fw.sha256_hash}")
# Access EXIF data
if fw.exif_data:
print(f"Camera: {fw.exif_data.get('Make')} {fw.exif_data.get('Model')}")
print(f"Date taken: {fw.exif_data.get('DateTimeOriginal')}")
# Retrieve metadata from archive
metadata = archiver.get_file_metadata(fw.blake3_hash)
if metadata:
print(f"File type: {metadata.file_type}")
print(f"MIME type: {metadata.mime_type}")
print(f"Indexed at: {metadata.indexed_at}")
Deduplication
from athena_file import FileArchiver
archiver = FileArchiver()
# Find exact duplicates by hash
duplicates = archiver.find_duplicates()
print(f"Found {duplicates.total_duplicates} duplicate files")
print(f"Wasted space: {duplicates.wasted_space / 1024**3:.2f} GB")
# Iterate through duplicate groups
for group in duplicates.duplicate_groups:
print(f"\nHash: {group.hash}")
print(f"Size: {group.size} bytes")
print(f"Count: {group.count} files")
for file_meta in group.files:
print(f" - {file_meta.path}")
# Find potential duplicates by size (faster but less accurate)
size_duplicates = archiver.find_duplicates_by_size(min_size=1024*1024) # 1MB minimum
print(f"Found {len(size_duplicates.duplicate_groups)} groups with same size")
Thumbnail Generation
from pathlib import Path
from athena_file import ArchiveConfig
from athena_file.media import ThumbnailGenerator
from athena_file.main import FileWorker
# Configure thumbnail settings
config = ArchiveConfig(
thumbnail_dir=Path(".athena_archive/thumbnails"),
thumbnail_sizes=[(256, 256), (128, 128), (64, 64)],
thumbnail_format="webp", # or "jpeg"
generate_thumbnails=True,
)
config.ensure_directories()
# Generate thumbnails for a file
file_worker = FileWorker(Path("/data/photo.jpg"))
generator = ThumbnailGenerator(config, file_worker.blake3_hash)
# Generate all configured sizes
thumbnails = generator.generate_thumbnail(file_worker)
for size, thumb_path in thumbnails.items():
print(f"{size}: {thumb_path}")
# Thumbnails are organized by unique_id:
# .athena_archive/thumbnails/
# └── [unique_id]/
# ├── 256x256.webp
# ├── 128x128.webp
# └── 64x64.webp
File Verification
from athena_file import FileArchiver
archiver = FileArchiver()
# Verify all files in archive
report = archiver.verify_integrity(
check_exists=True, # Verify files still exist
recompute_hashes=True # Recompute and verify hashes
)
print(f"Total files: {report.total_files}")
print(f"Verified: {report.verified_count}")
print(f"Missing: {report.missing_count}")
print(f"Corrupted: {report.corrupted_count}")
# Check specific files
if report.missing_files:
print("\nMissing files:")
for file_meta in report.missing_files:
print(f" - {file_meta.path}")
if report.corrupted_files:
print("\nCorrupted files:")
for file_meta in report.corrupted_files:
print(f" - {file_meta.path}")
Search and Query
from athena_file import FileArchiver
archiver = FileArchiver()
# Search by filename pattern
results = archiver.search_by_name("vacation")
print(f"Found {len(results)} files matching 'vacation'")
# Search by file type
images = archiver.search_by_type("image")
videos = archiver.search_by_type("video")
documents = archiver.search_by_type("document")
print(f"Images: {len(images)}")
print(f"Videos: {len(videos)}")
print(f"Documents: {len(documents)}")
# Iterate through all files
for batch in archiver.storage.iterate_files():
for file_meta in batch:
if file_meta.size > 100 * 1024 * 1024: # Files larger than 100MB
print(f"Large file: {file_meta.path} - {file_meta.size / 1024**2:.2f} MB")
Archive Statistics
from athena_file import FileArchiver
archiver = FileArchiver()
# Get archive statistics
stats = archiver.get_stats()
print(f"Total files: {stats['total_files']}")
print(f"Storage shards: {stats['storage_shards']}")
print(f"Storage size: {stats['storage_size_mb']:.2f} MB")
print(f"Average shard size: {stats['average_shard_size_mb']:.2f} MB")
# Get storage-level details
storage_stats = archiver.storage.get_stats()
print(f"Files per shard: {storage_stats['files_per_shard']}")
print(f"Total storage bytes: {storage_stats['total_storage_bytes']}")
Batch Processing with Progress
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
# Enable progress tracking
config = ArchiveConfig(
enable_progress=True,
progress_interval=1.0, # Update every 1 second
batch_size=100, # Process 100 files per batch
max_workers=4, # Use 4 parallel workers
)
archiver = FileArchiver(config)
# Process large directory with progress
report = archiver.scan_and_index(
Path("/data/large_archive"),
recursive=True
)
print(f"Processed {report.file_count} files")
print(f"Success rate: {(report.success_count / report.file_count) * 100:.1f}%")
print(f"Time: {report.elapsed_seconds:.2f}s")
Custom Metadata
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
from athena_file.metadata import FileMetadata
from athena_file.main import FileWorker
archiver = FileArchiver()
# Create metadata with custom fields
config = ArchiveConfig()
fw = FileWorker(Path("/data/photo.jpg"))
metadata = FileMetadata.from_file_worker(fw, config)
# Add custom metadata
metadata.custom_metadata["photographer"] = "Jane Doe"
metadata.custom_metadata["location"] = "Paris, France"
metadata.custom_metadata["tags"] = ["landscape", "travel", "europe"]
# Store in archive
archiver.storage.add_file(metadata)
# Retrieve and access custom metadata
retrieved = archiver.get_file_metadata(fw.blake3_hash)
print(retrieved.custom_metadata["photographer"]) # "Jane Doe"
print(retrieved.custom_metadata["tags"]) # ["landscape", "travel", "europe"]
Error Handling
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
# Continue processing even if some files fail
config = ArchiveConfig(
continue_on_error=True,
max_retries=3,
)
archiver = FileArchiver(config)
report = archiver.scan_and_index(Path("/data/mixed_files"))
# Check for errors
if report.failure_count > 0:
print(f"Failed to process {report.failure_count} files")
# Get error details
error_summary = archiver.batch_processor.get_error_summary()
print(f"Error types: {error_summary}")
Configuration
ArchiveConfig Options
from pathlib import Path
from athena_file import ArchiveConfig
config = ArchiveConfig(
# Storage settings
storage_dir=Path(".athena_archive"), # Archive storage directory
shard_size=10000, # Files per shard
# Hashing
compute_all_hashes=False, # Compute SHA256, MD5, SHA1
# Thumbnails
generate_thumbnails=False, # Auto-generate thumbnails
thumbnail_dir=Path(".athena_archive/thumbnails"),
thumbnail_sizes=[(256, 256), (128, 128)],
thumbnail_format="webp", # "webp" or "jpeg"
# Processing
batch_size=1000, # Files per batch
max_workers=4, # Parallel workers
continue_on_error=True, # Continue on failure
max_retries=3, # Retry failed operations
# Progress
enable_progress=False, # Enable progress tracking
progress_interval=1.0, # Progress update interval (seconds)
)
Architecture
Storage Structure
.athena_archive/
├── index.json # Master index with shard mappings
├── metadata.json # Archive-level metadata
├── shards/
│ ├── shard_0000.json # Files 0-9999
│ ├── shard_0001.json # Files 10000-19999
│ └── ...
└── thumbnails/
├── [unique_id_1]/
│ ├── 256x256.webp
│ └── 128x128.webp
└── [unique_id_2]/
└── ...
File Metadata Schema
Each file in the archive stores:
- Hashes: BLAKE3 (primary), fast hash, optional SHA256/MD5/SHA1
- File attributes: size, mtime, ctime
- Type information: MIME type, file category
- EXIF data: Camera settings, GPS, timestamps (images/videos)
- Thumbnails: Paths to generated thumbnails
- Custom metadata: User-defined fields
- Archive tracking: Indexed timestamp, verification timestamp
Performance
Athena File is designed for large-scale archives:
- Sharded storage: Prevents JSON files from growing too large
- Parallel processing: Multi-threaded batch operations
- Incremental indexing: Add files without reprocessing entire archive
- Fast hashing: BLAKE3 is significantly faster than SHA256
- Generator-based discovery: Avoids loading all file paths into memory
Typical performance on modern hardware:
- 1000-2000 files/second (indexing without thumbnails)
- 200-500 files/second (indexing with thumbnails)
- 5000+ files/second (duplicate detection by hash)
Requirements
- Python >= 3.13
- Pillow >= 10.0.0
- blake3 >= 1.0.8
- pdf2image >= 1.17.0 (for PDF thumbnails)
- pyexiftool >= 0.5.6 (for EXIF extraction)
Optional System Dependencies
exiftool: EXIF metadata extractionpoppler-utils: PDF thumbnail generationffmpeg: Video thumbnail generation
License
See LICENSE file for details.
Author
Nathan Price nathan@modernleft.org
Requirements
Requires Python: >=3.13
Details
Assets (2)
Versions (8)
View all
athena_file-1.3.3.tar.gz
33 KiB