Skip to content

Integrating Faster-Whisper with Subtitle Edit for Local Speech-to-Text

TLDR

  • Faster-Whisper Advantages: Based on the CTranslate2 engine, it improves performance by approximately 4x compared to the original Whisper and significantly reduces VRAM requirements through 8-bit quantization.
  • Recommended Solution: Use Subtitle Edit with Purfview's Faster-Whisper-XXL to avoid common dependency conflict issues encountered when installing Python environments directly.
  • Model Selection Advice: For a balance of speed and accuracy, large-v3-turbo is the top choice; for maximum accuracy, choose large-v3.
  • Performance Benchmark: On an RTX 4070 Ti Super, processing 5 minutes and 16 seconds of audio takes approximately 16 seconds with large-v3-turbo and about 32 seconds with large-v3, demonstrating excellent performance.

Introduction to Faster-Whisper Technology

When would you need Faster-Whisper? It is ideal when users want to perform local Speech-to-Text (STT) but are constrained by hardware resources or wish to improve processing speed.

Faster-Whisper is an implementation of Whisper based on CTranslate2 (a fast inference engine for Transformer models). Compared to the original OpenAI Whisper, its core advantages include:

  • Faster Speed: Performance is improved by more than 4x.
  • Lower Memory Usage: Significantly reduces VRAM requirements through 8-bit quantization.

Subtitle Edit Integration

When might you encounter installation difficulties? When attempting to install the Faster-Whisper-XXL standalone package directly, users often face execution failures due to dependency issues with Python multimedia packages. Integrating it via Subtitle Edit helps avoid these environment configuration pitfalls.

Integration Steps

  1. Open Subtitle Edit and select "Video" -> "Audio to text (Whisper)..." from the menu.
  2. If the system prompts you to download ffmpeg, follow the instructions to complete the installation.
  3. In the Engine option, select "Purfview's Faster-Whisper-XXL".
  4. Download the model from the Choose model dropdown menu. It is recommended to select faster-whisper-large-v3 or faster-whisper-large-v3-turbo.

Model Differences

  • Large-v3: Currently the most accurate model with the most parameters; inference is slower and requires more memory.
  • Large-v3-Turbo: A distilled version of v3 that reduces the decoder layers from 32 to 4. Parameters are reduced by approximately 48%, but speed is increased by about 8x, with English recognition accuracy nearly identical to the full version.
  1. Drag and drop the video/audio file into the window and click "Generate" to start the transcription.

Performance Benchmark and Analysis

When will you notice a significant difference? When processing longer audio files or when high-quality recognition is required.

The following test was conducted using a 5-minute and 16-second mp3 file on a PNY RTX 4070 Ti Super 16GB:

  • Test Results:
    • large-v3-turbo: Took approximately 16 seconds.
    • large-v3: Took approximately 32 seconds.

Conclusion: Although large-v3-turbo is slightly slower than the older WhisperDesktop using the Medium model (11 seconds), large-v3 can complete transcription in 32 seconds, demonstrating a significant performance advantage. For audio with heavy background music interference, the recognition quality of large-v3 is superior to the Medium model, and its execution speed is more than sufficient for daily local transcription needs.


Changelog

    • Initial document creation.