初めに

発話中かどうかはVADを用いて判定することが多いです。今回は VADではなくターン検出を使って発話中の判定を行っていきます

開発環境

Mac OS
uv

環境構築

まずは音声入力を扱うために必要なライブラリを入れます

brew install portaudio

次にpip経由でインストールをします

uv pip install -r requirements_inference.txt

最後のモデルをダウンロードします CLIでダウンロードするときはいかになります

python -c "from huggingface_hub import hf_hub_download; \
  model_path = hf_hub_download(repo_id='pipecat-ai/smart-turn-v3', \
  filename='smart-turn-v3.0.onnx', local_dir='.'); \
  print(f'Model downloaded to: {model_path}')"

ターン検出

以下を実行します

python record_and_predict.py

実行後にマイクに話しかけると以下のようなログになります話したり黙ったりしてほぼリアルタイムで検出してくれていました

python record_and_predict.py
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading Silero VAD ONNX model...
ONNX model downloaded.
Listening for speech... (Ctrl+C to stop)
Processing segment (2.08s)...
--------
Prediction: Incomplete
Probability of complete: 0.0154
Inference time: 34.38 ms
Listening for speech...
Processing segment (4.93s)...
--------
Prediction: Incomplete
Probability of complete: 0.0307
Inference time: 43.75 ms
Listening for speech...
Processing segment (2.24s)...
--------
Prediction: Complete
Probability of complete: 0.6616
Inference time: 46.82 ms
Listening for speech...
Processing segment (3.81s)...
--------
Prediction: Incomplete
Probability of complete: 0.0986
Inference time: 45.21 ms
Listening for speech...
Processing segment (1.79s)...
--------
Prediction: Incomplete
Probability of complete: 0.0122
Inference time: 47.23 ms
Listening for speech...

yousanのメモ

ターン検出のsmart-turnでリアルタイムで発話中かどうかを判定する

初めに

開発環境

環境構築

ターン検出