Metadata-Version: 2.3
Name: athena-file
Version: 1.0.0
Summary: Large-scale file archive management system with JSON cataloging, deduplication, and thumbnail generation
Author: Nathan Price
Author-email: Nathan Price <nathan@modernleft.org>
Requires-Dist: pillow>=10.0.0
Requires-Dist: blake3>=1.0.8
Requires-Dist: pyexiftool>=0.5.6
Requires-Python: >=3.13
Description-Content-Type: text/markdown

# Athena-File

Large-scale file archive management system with JSON-based cataloging, intelligent deduplication, multi-format support, and thumbnail generation. Designed to efficiently handle 1M+ files.

[![Tests](https://img.shields.io/badge/tests-130%20passing-brightgreen)]()
[![Coverage](https://img.shields.io/badge/coverage-77%25-green)]()
[![Python](https://img.shields.io/badge/python-3.13%2B-blue)]()

## Features

- **Scalable Architecture**: Sharded JSON storage handles 1M+ files with < 100 MB memory
- **Fast Indexing**: High-throughput batch processing with parallel execution
- **Intelligent Deduplication**: Find duplicate files by content hash in seconds
- **Multi-Format Support**: Automatic detection for images, videos, archives, documents
- **Thumbnail Generation**: Create thumbnails for images and videos (Pillow + ffmpeg)
- **Plugin System**: Extensible architecture for custom file type handlers
- **Integrity Verification**: Hash-based verification with corruption detection
- **Comprehensive API**: Python API for all archive operations

## Quick Start

### Installation

```bash
# Install with uv
uv pip install athena-file

# Or install from source
git clone <repository-url>
cd athena-file
uv sync
```

### System Requirements

- Python 3.13+
- `exiftool` (for EXIF metadata extraction)
- `ffmpeg` (optional, for video thumbnails)

### Basic Usage

```python
from pathlib import Path
from athena_file.archive import FileArchiver
from athena_file.config import ArchiveConfig

# Configure archive
config = ArchiveConfig(
    storage_dir=Path(".athena_archive"),
    default_hash_algo="blake3",
    batch_size=1000,
)

# Create archiver
archiver = FileArchiver(config)

# Index a directory
report = archiver.scan_and_index(
    Path("/data/photos"),
    recursive=True
)

print(f"Indexed {report.file_count:,} files in {report.elapsed_seconds:.1f}s")
print(f"Total size: {report.total_size / (1024**3):.2f} GB")

# Find duplicates
dup_report = archiver.find_duplicates()
print(f"Found {dup_report.duplicate_groups_count} groups of duplicates")
print(f"Wasted space: {dup_report.total_wasted_space / (1024**2):.1f} MB")

# Verify integrity
verification = archiver.verify_integrity(
    check_exists=True,
    recompute_hashes=True
)
print(f"Valid: {verification.valid_count}")
print(f"Corrupted: {verification.corrupted_count}")
```

## Performance

- **Fast**: BLAKE3 hashing with parallel batch processing
- **Memory Efficient**: < 100 MB memory usage even with large archives
- **Scalable**: Sharded storage with O(1) hash lookups
- **Optimized**: Thread pool execution for multi-core systems

See [PERFORMANCE.md](PERFORMANCE.md) for analysis and optimization tips.

## Architecture

### Sharded JSON Storage

Instead of a single large JSON file, athena-file uses sharded storage:

```
.athena_archive/
├── index.json              # Master index (~1-2 MB)
├── metadata.json           # Archive metadata
└── shards/
    ├── shard_0000.json    # 10,000 files (~1 MB)
    ├── shard_0001.json
    └── ...
```

**Benefits**:
- Memory efficient: Only loads active shard (~1 MB)
- Fast lookups: O(1) hash-based index
- Scalable: Handles 1M+ files easily
- Human-readable: JSON format for debugging

### Plugin System

Extensible file type handlers:

```python
from athena_file.plugins import FileTypeHandler, HandlerRegistry

class CustomHandler(FileTypeHandler):
    def can_handle(self, file_path, mime_type):
        return file_path.suffix == ".custom"

    def extract_metadata(self, fw):
        return {"custom_field": "value"}

registry = HandlerRegistry()
registry.register(CustomHandler())
```

Built-in handlers for:
- **Images**: JPEG, PNG, GIF, WEBP, BMP, TIFF
- **Videos**: MP4, AVI, MOV, MKV, WEBM
- **Archives**: ZIP, TAR, GZ, BZ2, XZ

## Documentation

- **[ARCHIVE_GUIDE.md](ARCHIVE_GUIDE.md)** - Comprehensive user guide with examples
- **[PERFORMANCE.md](PERFORMANCE.md)** - Performance analysis and benchmarks
- **[PROJECT_SUMMARY.md](PROJECT_SUMMARY.md)** - Implementation details and architecture

## Testing

```bash
# Run all tests (fast, ~5 seconds)
make test

# Run with large-scale tests (10k+ files, slower)
pytest -m largescale

# Run with slow tests
pytest -m slow
```

**Test Coverage**: 130 tests, 77% code coverage, 100% pass rate

## Use Cases

### Photo Library Management
```python
config = ArchiveConfig(
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp"
)
archiver = FileArchiver(config)
archiver.scan_and_index(Path("/photos"), recursive=True)
```

### Backup Verification
```python
# Index backup
archiver.scan_and_index(Path("/backup"), recursive=True)

# Export checksums
archiver.export_checksums(Path("backup_checksums.blake3"))

# Later: verify integrity
verification = archiver.verify_integrity(recompute_hashes=True)
if verification.corrupted_count > 0:
    print("⚠ Backup corruption detected!")
```

### Deduplication Analysis
```python
dup_report = archiver.find_duplicates()

# Get largest duplicates
for group in dup_report.get_largest_groups(n=10):
    print(f"\n{group.count} copies ({group.wasted_space / 1024**2:.1f} MB wasted):")
    for file in group.files:
        print(f"  - {file.path}")
```

## Configuration Options

```python
config = ArchiveConfig(
    # Storage
    storage_dir=Path(".athena_archive"),
    shard_size=10000,              # Files per shard
    compression=True,               # Enable gzip compression

    # Hashing
    default_hash_algo="blake3",     # blake3, sha256, md5, sha1
    compute_all_hashes=False,       # Only compute default hash

    # Performance
    batch_size=1000,                # Files per batch
    max_workers=4,                  # Thread pool size

    # Thumbnails
    generate_thumbnails=True,
    thumbnail_sizes=[(256, 256), (128, 128)],
    thumbnail_format="webp",        # webp, jpg, png
    thumbnail_quality=85,

    # Error Handling
    retry_attempts=3,
    continue_on_error=True,

    # Progress
    enable_progress=True,
    progress_interval=100,
)
```

## API Reference

### FileArchiver

Main API for archive operations:

- `scan_and_index(directory, recursive=True)` - Index directory
- `verify_integrity(check_exists=True, recompute_hashes=False)` - Verify files
- `find_duplicates()` - Find duplicate files
- `export_checksums(path, hash_type="blake3")` - Export checksums
- `search_by_name(pattern)` - Search by filename
- `search_by_type(file_type)` - Search by type
- `get_stats()` - Get archive statistics
- `get_file_metadata(hash)` - Get specific file metadata

### FileWorker

Low-level file operations (original API, maintained for backward compatibility):

- `blake3_hash` - BLAKE3 hash
- `sha256_hash` - SHA256 hash
- `md5_hash` - MD5 hash
- `sha1_hash` - SHA1 hash
- `fast_hash` - Fast hash for quick comparisons
- `exif_data` - EXIF metadata
- `copy(destination)` - Copy file
- `move(destination)` - Move file

## Dependencies

### Production
- `blake3>=1.0.8` - Fast cryptographic hashing
- `pyexiftool>=0.5.6` - EXIF metadata extraction
- `Pillow>=10.0.0` - Image processing and thumbnails

### Development
- `pytest>=9.0.2` - Testing framework
- `pytest-cov>=7.0.0` - Coverage reporting
- `mypy>=1.19.1` - Static type checking
- `ruff>=0.14.10` - Linting and formatting

## Contributing

Contributions are welcome! Please ensure:
- All tests pass (`make test`)
- Code coverage remains > 75%
- Code follows ruff formatting
- New features include tests and documentation

## License

See LICENSE file for details.

## References

- [blake3-py](https://github.com/oconnor663/blake3-py) - BLAKE3 Python bindings
- [pyexiftool](https://sylikc.github.io/) - ExifTool Python wrapper
- [Pillow](https://python-pillow.org/) - Python Imaging Library

## Project Status

✓ **Production Ready** - All 5 implementation phases complete:
1. Core Infrastructure
2. Batch Operations & Utils
3. Archive Operations
4. Detection, Plugins & Thumbnails
5. Performance Testing & Optimization

System successfully handles 1M+ file archives with excellent performance characteristics.
