Metadata-Version: 2.3
Name: athena-file
Version: 1.3.1
Summary: Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation
Author: Nathan Price
Author-email: Nathan Price <nathan@modernleft.org>
Requires-Dist: pillow>=10.0.0
Requires-Dist: blake3>=1.0.8
Requires-Dist: pdf2image>=1.17.0
Requires-Dist: pyexiftool>=0.5.6
Requires-Python: >=3.13
Description-Content-Type: text/markdown

# Athena File

Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation.

## Features

-   **Fast File Hashing**: BLAKE3 primary hashing with optional SHA256, MD5, and SHA1 support
-   **Metadata Extraction**: Automatic EXIF data extraction for images and videos
-   **Deduplication**: Identify duplicate files by hash or size
-   **Thumbnail Generation**: Automatic thumbnail creation for images, videos, and PDFs
-   **Sharded Storage**: Efficient JSON-based storage with automatic sharding for large archives
-   **Batch Processing**: Parallel file processing with progress tracking
-   **Integrity Verification**: Verify file integrity and detect corruption
-   **Flexible Search**: Search by filename, type, or custom metadata

## Installation

```bash
pip install athena-file
```

### System Dependencies

For thumbnail generation, install the following system packages:

```bash
# Ubuntu/Debian
sudo apt-get install exiftool poppler-utils ffmpeg

# macOS
brew install exiftool poppler ffmpeg

# Arch Linux
sudo pacman -S perl-image-exiftool poppler ffmpeg
```

## Quick Start

```python
from pathlib import Path
from athena_file import FileArchiver

# Initialize archiver with default settings
archiver = FileArchiver()

# Index a directory
report = archiver.scan_and_index(Path("/path/to/photos"), recursive=True)
print(f"Indexed {report.success_count} files in {report.elapsed_seconds:.2f}s")

# Find duplicates
duplicates = archiver.find_duplicates()
print(f"Found {len(duplicates.duplicate_groups)} groups of duplicates")

# Search for files
results = archiver.search_by_name("vacation")
for file in results:
    print(f"{file.path} - {file.size} bytes")
```

## Examples

### Basic File Indexing

```python
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Create archive with custom configuration
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    compute_all_hashes=True,  # Compute SHA256, MD5, SHA1 in addition to BLAKE3
    generate_thumbnails=True,  # Generate thumbnails for media files
)

archiver = FileArchiver(config)

# Index a specific directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True,
    file_pattern="*.jpg"  # Only process JPEG files
)

# Print summary
summary = report.get_summary()
print(f"Total files: {summary['total_files']}")
print(f"Success: {summary['successful']}")
print(f"Failed: {summary['failed']}")
print(f"Processing rate: {summary['files_per_second']:.2f} files/sec")
```

### Working with File Metadata

```python
from pathlib import Path
from athena_file import FileArchiver
from athena_file.main import FileWorker

archiver = FileArchiver()

# Index a single file
file_path = Path("/data/image.jpg")
fw = FileWorker(file_path)

# Get detailed hash information
print(f"BLAKE3: {fw.blake3_hash}")
print(f"Fast hash: {fw.fast_hash}")
print(f"SHA256: {fw.sha256_hash}")

# Access EXIF data
if fw.exif_data:
    print(f"Camera: {fw.exif_data.get('Make')} {fw.exif_data.get('Model')}")
    print(f"Date taken: {fw.exif_data.get('DateTimeOriginal')}")

# Retrieve metadata from archive
metadata = archiver.get_file_metadata(fw.blake3_hash)
if metadata:
    print(f"File type: {metadata.file_type}")
    print(f"MIME type: {metadata.mime_type}")
    print(f"Indexed at: {metadata.indexed_at}")
```

### Deduplication

```python
from athena_file import FileArchiver

archiver = FileArchiver()

# Find exact duplicates by hash
duplicates = archiver.find_duplicates()

print(f"Found {duplicates.total_duplicates} duplicate files")
print(f"Wasted space: {duplicates.wasted_space / 1024**3:.2f} GB")

# Iterate through duplicate groups
for group in duplicates.duplicate_groups:
    print(f"\nHash: {group.hash}")
    print(f"Size: {group.size} bytes")
    print(f"Count: {group.count} files")
    for file_meta in group.files:
        print(f"  - {file_meta.path}")

# Find potential duplicates by size (faster but less accurate)
size_duplicates = archiver.find_duplicates_by_size(min_size=1024*1024)  # 1MB minimum
print(f"Found {len(size_duplicates.duplicate_groups)} groups with same size")
```

### Thumbnail Generation

```python
from pathlib import Path
from athena_file import ArchiveConfig
from athena_file.media import ThumbnailGenerator
from athena_file.main import FileWorker

# Configure thumbnail settings
config = ArchiveConfig(
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128), (64, 64)],
    thumbnail_format="webp",  # or "jpeg"
    generate_thumbnails=True,
)

config.ensure_directories()

# Generate thumbnails for a file
file_worker = FileWorker(Path("/data/photo.jpg"))
generator = ThumbnailGenerator(config, file_worker.blake3_hash)

# Generate all configured sizes
thumbnails = generator.generate_thumbnail(file_worker)

for size, thumb_path in thumbnails.items():
    print(f"{size}: {thumb_path}")

# Thumbnails are organized by unique_id:
# .athena_archive/thumbnails/
# └── [unique_id]/
#     ├── 256x256.webp
#     ├── 128x128.webp
#     └── 64x64.webp
```

### File Verification

```python
from athena_file import FileArchiver

archiver = FileArchiver()

# Verify all files in archive
report = archiver.verify_integrity(
    check_exists=True,      # Verify files still exist
    recompute_hashes=True   # Recompute and verify hashes
)

print(f"Total files: {report.total_files}")
print(f"Verified: {report.verified_count}")
print(f"Missing: {report.missing_count}")
print(f"Corrupted: {report.corrupted_count}")

# Check specific files
if report.missing_files:
    print("\nMissing files:")
    for file_meta in report.missing_files:
        print(f"  - {file_meta.path}")

if report.corrupted_files:
    print("\nCorrupted files:")
    for file_meta in report.corrupted_files:
        print(f"  - {file_meta.path}")
```

### Search and Query

```python
from athena_file import FileArchiver

archiver = FileArchiver()

# Search by filename pattern
results = archiver.search_by_name("vacation")
print(f"Found {len(results)} files matching 'vacation'")

# Search by file type
images = archiver.search_by_type("image")
videos = archiver.search_by_type("video")
documents = archiver.search_by_type("document")

print(f"Images: {len(images)}")
print(f"Videos: {len(videos)}")
print(f"Documents: {len(documents)}")

# Iterate through all files
for batch in archiver.storage.iterate_files():
    for file_meta in batch:
        if file_meta.size > 100 * 1024 * 1024:  # Files larger than 100MB
            print(f"Large file: {file_meta.path} - {file_meta.size / 1024**2:.2f} MB")
```

### Archive Statistics

```python
from athena_file import FileArchiver

archiver = FileArchiver()

# Get archive statistics
stats = archiver.get_stats()

print(f"Total files: {stats['total_files']}")
print(f"Storage shards: {stats['storage_shards']}")
print(f"Storage size: {stats['storage_size_mb']:.2f} MB")
print(f"Average shard size: {stats['average_shard_size_mb']:.2f} MB")

# Get storage-level details
storage_stats = archiver.storage.get_stats()
print(f"Files per shard: {storage_stats['files_per_shard']}")
print(f"Total storage bytes: {storage_stats['total_storage_bytes']}")
```

### Batch Processing with Progress

```python
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Enable progress tracking
config = ArchiveConfig(
    enable_progress=True,
    progress_interval=1.0,  # Update every 1 second
    batch_size=100,         # Process 100 files per batch
    max_workers=4,          # Use 4 parallel workers
)

archiver = FileArchiver(config)

# Process large directory with progress
report = archiver.scan_and_index(
    Path("/data/large_archive"),
    recursive=True
)

print(f"Processed {report.file_count} files")
print(f"Success rate: {(report.success_count / report.file_count) * 100:.1f}%")
print(f"Time: {report.elapsed_seconds:.2f}s")
```

### Custom Metadata

```python
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig
from athena_file.metadata import FileMetadata
from athena_file.main import FileWorker

archiver = FileArchiver()

# Create metadata with custom fields
config = ArchiveConfig()
fw = FileWorker(Path("/data/photo.jpg"))
metadata = FileMetadata.from_file_worker(fw, config)

# Add custom metadata
metadata.custom_metadata["photographer"] = "Jane Doe"
metadata.custom_metadata["location"] = "Paris, France"
metadata.custom_metadata["tags"] = ["landscape", "travel", "europe"]

# Store in archive
archiver.storage.add_file(metadata)

# Retrieve and access custom metadata
retrieved = archiver.get_file_metadata(fw.blake3_hash)
print(retrieved.custom_metadata["photographer"])  # "Jane Doe"
print(retrieved.custom_metadata["tags"])          # ["landscape", "travel", "europe"]
```

### Error Handling

```python
from pathlib import Path
from athena_file import FileArchiver, ArchiveConfig

# Continue processing even if some files fail
config = ArchiveConfig(
    continue_on_error=True,
    max_retries=3,
)

archiver = FileArchiver(config)

report = archiver.scan_and_index(Path("/data/mixed_files"))

# Check for errors
if report.failure_count > 0:
    print(f"Failed to process {report.failure_count} files")

    # Get error details
    error_summary = archiver.batch_processor.get_error_summary()
    print(f"Error types: {error_summary}")
```

## Configuration

### ArchiveConfig Options

```python
from pathlib import Path
from athena_file import ArchiveConfig

config = ArchiveConfig(
    # Storage settings
    storage_dir=Path(".athena_archive"),     # Archive storage directory
    shard_size=10000,                        # Files per shard

    # Hashing
    compute_all_hashes=False,                # Compute SHA256, MD5, SHA1

    # Thumbnails
    generate_thumbnails=False,               # Auto-generate thumbnails
    thumbnail_dir=Path(".athena_archive/thumbnails"),
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",                 # "webp" or "jpeg"

    # Processing
    batch_size=1000,                         # Files per batch
    max_workers=4,                           # Parallel workers
    continue_on_error=True,                  # Continue on failure
    max_retries=3,                           # Retry failed operations

    # Progress
    enable_progress=False,                   # Enable progress tracking
    progress_interval=1.0,                   # Progress update interval (seconds)
)
```

## Architecture

### Storage Structure

```
.athena_archive/
├── index.json              # Master index with shard mappings
├── metadata.json           # Archive-level metadata
├── shards/
│   ├── shard_0000.json    # Files 0-9999
│   ├── shard_0001.json    # Files 10000-19999
│   └── ...
└── thumbnails/
    ├── [unique_id_1]/
    │   ├── 256x256.webp
    │   └── 128x128.webp
    └── [unique_id_2]/
        └── ...
```

### File Metadata Schema

Each file in the archive stores:

-   **Hashes**: BLAKE3 (primary), fast hash, optional SHA256/MD5/SHA1
-   **File attributes**: size, mtime, ctime
-   **Type information**: MIME type, file category
-   **EXIF data**: Camera settings, GPS, timestamps (images/videos)
-   **Thumbnails**: Paths to generated thumbnails
-   **Custom metadata**: User-defined fields
-   **Archive tracking**: Indexed timestamp, verification timestamp

## Performance

Athena File is designed for large-scale archives:

-   **Sharded storage**: Prevents JSON files from growing too large
-   **Parallel processing**: Multi-threaded batch operations
-   **Incremental indexing**: Add files without reprocessing entire archive
-   **Fast hashing**: BLAKE3 is significantly faster than SHA256
-   **Generator-based discovery**: Avoids loading all file paths into memory

Typical performance on modern hardware:

-   1000-2000 files/second (indexing without thumbnails)
-   200-500 files/second (indexing with thumbnails)
-   5000+ files/second (duplicate detection by hash)

## Requirements

-   Python >= 3.13
-   Pillow >= 10.0.0
-   blake3 >= 1.0.8
-   pdf2image >= 1.17.0 (for PDF thumbnails)
-   pyexiftool >= 0.5.6 (for EXIF extraction)

### Optional System Dependencies

-   `exiftool`: EXIF metadata extraction
-   `poppler-utils`: PDF thumbnail generation
-   `ffmpeg`: Video thumbnail generation

## License

See LICENSE file for details.

## Author

Nathan Price <nathan@modernleft.org>
