初めに

(非商用ですが)性能がいいとのことなので、実際に試してみます

🚀Introducing new (synthetic) RLHF Dataset Nectar and new open model Starling-LM-7B-alpha🚀

🌟 Model & Dataset Highlights:

📊 Scores 8.09 in MT Bench: Surpassing all existing models except OpenAI's GPT-4 and GPT-4 Turbo.

📚 183K Chat Prompts + 7 responses in Nectar: With 3.8M… pic.twitter.com/OGxPdUIuny
— Banghua Zhu (@BanghuaZ) 2023年11月27日

モデル

huggingface.co

環境

GPU L4
Python 3.10

準備

今回はStreamで実行したいので、その部分も書いていきます

8bit量子化でロードしようとするとエラーが出たので、諦めてfloat16でロードします

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import TextStreamer

tokenizer = AutoTokenizer.from_pretrained("berkeley-nest/Starling-LM-7B-alpha")
model = AutoModelForCausalLM.from_pretrained("berkeley-nest/Starling-LM-7B-alpha",torch_dtype=torch.float16, low_cpu_mem_usage=True)
streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
model = model.to("cuda:0")

推論

以下のコードで推論を実行していきます

text = "自然言語処理とは何か"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=10000,
        do_sample=True,
        top_p=0.95,
        temperature=0.7,
        pad_token_id = 32000,
        streamer=streamer,
    )[0]