yousanのメモ

話者ダイアライゼーションツールキットのDiariZenをWindowsで動かす

AI

初めに
開発環境
環境構築
モデルのダウンロード
実行

初めに

AudioZenとPyannote 3.1をベースとした話者ダイアライゼーション(speaker diarization)ツールキットが公開されました。

特徴としては以下になります。

自己教師あり学習(SSL)ベースのWavLMモデルを使用
構造化プルーニング(structured pruning)によるモデル軽量化に対応
Hugging Face統合による簡単な推論実行
複数のベンチマークデータセットで高精度を達成

ただし公開済みの事前学習モデルは非商用なので注意が必要です(CC BY-NC 4.0)

開発環境

Windows 11
uv 0.9.x

環境構築

uv venv --python 3.10
.\.venv\Scripts\activate

# 3. PyTorchのインストール (CUDA 12.1)
uv pip install torch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 --index-url https://download.pytorch.org/whl/cu121


uv pip install einops flit h5py joblib jupyterlab tensorboard librosa matplotlib "numpy==1.26.4" onnxruntime-gpu openpyxl pandas pre-commit pyyaml scipy soundfile tabulate toml torchinfo tqdm "accelerate==1.6.0" thop
uv pip install -e .

**pesqとpystoiはWindows環境でVisual Studio Build Toolsが必要**

uv pip install -e "pyannote-audio[dev,testing]"

git submodule init
git submodule update

モデルのダウンロード

# DiariZen Largeモデル（推奨・高精度）
python -c "from huggingface_hub import snapshot_download; import os; snapshot_download(repo_id='BUT-FIT/diarizen-wavlm-large-s80-md', cache_dir='./models')"

# 話者埋め込みモデル（必須）
python -c "from huggingface_hub import hf_hub_download; import os; hf_hub_download(repo_id='pyannote/wespeaker-voxceleb-resnet34-LM', filename='pytorch_model.bin', cache_dir='./models')"

実行

サンプルで用意されている音声ファイルを使って実行してみます。実行する際の推論スクリプトとして以下を作成します

"""
DiariZenモデルの動作確認スクリプト
"""
import os
import sys
from diarizen.pipelines.inference import DiariZenPipeline

# UTF-8出力を設定
if sys.platform == 'win32':
    sys.stdout.reconfigure(encoding='utf-8')

# モデルのキャッシュディレクトリを指定
cache_dir = os.path.join(os.getcwd(), 'models')

print("=" * 60)
print("DiariZen Model Test")
print("=" * 60)

# モデルの読み込み
print("\n[1/3] Loading model...")
print("Model: BUT-FIT/diarizen-wavlm-large-s80-md")
diar_pipeline = DiariZenPipeline.from_pretrained(
    "BUT-FIT/diarizen-wavlm-large-s80-md",
    cache_dir=cache_dir
)
print("OK: Model loaded successfully")

# 音声ファイルのパス
audio_file = './example/EN2002a_30s.wav'
print(f"\n[2/3] Processing audio: {audio_file}")

# ダイアライゼーションの実行
diar_results = diar_pipeline(audio_file)
print("OK: Diarization completed")

# 結果の表示
print(f"\n[3/3] Results:")
print("-" * 60)
print(f"{'Start':>10} | {'End':>10} | Speaker")
print("-" * 60)
for turn, _, speaker in diar_results.itertracks(yield_label=True):
    print(f"{turn.start:>9.1f}s | {turn.end:>9.1f}s | speaker_{speaker}")
print("-" * 60)

print("\nOK: Test completed successfully!")
print("=" * 60)

結果は以下のようになりまｓ

Results:
------------------------------------------------------------
     Start |        End | Speaker
------------------------------------------------------------
      0.0s |       2.7s | speaker_0
      0.8s |      13.6s | speaker_3
      5.8s |       6.4s | speaker_0
      8.0s |      10.6s | speaker_0
     10.6s |      10.7s | speaker_1
     10.7s |      13.6s | speaker_0
     13.7s |      18.4s | speaker_1
     17.8s |      18.2s | speaker_3
     18.9s |      19.3s | speaker_0
     19.6s |      20.0s | speaker_1
     20.3s |      22.2s | speaker_3
     23.3s |      23.5s | speaker_1
     23.5s |      30.4s | speaker_2
------------------------------------------------------------