初めに

日本語に対応しているCLIPモデルが新しく出てきたので、試してみます

blog.recruit.co.jp

環境

L4 GPU
ubuntu22.04

準備

ライブラリを入れていきます

!pip install pillow requests transformers torch torchvision sentencepiece

実行

モデルのロード

import io
import requests

import torch
import torchvision
from PIL import Image
from transformers import AutoTokenizer, AutoModel


model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)


def _convert_to_rgb(image):
    return image.convert('RGB')


preprocess = torchvision.transforms.Compose([
    torchvision.transforms.Resize(size=224, interpolation=torchvision.transforms.InterpolationMode.BICUBIC, max_size=None),
    torchvision.transforms.CenterCrop(size=(224, 224)),
    _convert_to_rgb,
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
])


def tokenize(tokenizer, texts):
    texts = ["[CLS]" + text for text in texts]
    encodings = [
        # NOTE: the maximum token length that can be fed into this model is 77
        tokenizer(text, max_length=77, padding="max_length", truncation=True, add_special_tokens=False)["input_ids"]
        for text in texts
    ]
    return torch.LongTensor(encodings)

サンプル画像のCLIPテスト

サンプルにある以下の画像でテストしてみます

# Run!
image = Image.open(
    io.BytesIO(
        requests.get(
            'https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260'
        ).content
    )
)
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["犬", "猫", "象"]).to(device)
with torch.inference_mode():
    image_features = model.get_image_features(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    probs = image_features @ text_features.T
print("Label probs:", probs.cpu().numpy()[0])

結果は以下のようになり、犬のラベル結果が数値が高くなっていました

Label probs: [0.49223694 0.23412797 0.25611094]

つくよみちゃん画像のCLIPテスト

以下のつくよみちゃんの画像でテストしてみます

Illustration by えみゃコーラ

# Run!
image = Image.open(
    io.BytesIO(
        requests.get(
            'https://tyc.rei-yumesaki.net/wp-content/uploads/emya-furisode.png'
        ).content
    )
)
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["女の子", "男の子", "猫"]).to(device)
with torch.inference_mode():
    image_features = model.get_image_features(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    probs = image_features @ text_features.T
print("Label probs:", probs.cpu().numpy()[0])

結果は以下のようになりました

Label probs: [0.41579735 0.25743386 0.2505836 ]

雰囲気のテスト

同じくつくよみちゃんの画像で雰囲気のテストをしてみます

# Run!
image = Image.open(
    io.BytesIO(
        requests.get(
            'https://tyc.rei-yumesaki.net/wp-content/uploads/emya-furisode.png'
        ).content
    )
)
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["かわいい", "かっこいい"]).to(device)
with torch.inference_mode():
    image_features = model.get_image_features(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    probs = image_features @ text_features.T
print("Label probs:", probs.cpu().numpy()[0])

結果は以下のようでした

Label probs: [0.44720185 0.32953736]

yousanのメモ

recruit-jp/japanese-clip-vit-b-32-roberta-baseを動かす

初めに

環境

準備

実行

モデルのロード

サンプル画像のCLIPテスト

つくよみちゃん画像のCLIPテスト

雰囲気のテスト