- Switch transcription to Whisper models hosted on Hugging Face Hub
- Use VAD-based audio segmentation prior to transcription and fix device placement
- Extract word timestamps via attention-based DTW alignment with fallbacks (experimental)
- Improves reproducibility and avoids build isolation issues in CI
- Restore TF Metal device support and clean up dependencies