GoogleColobでTheBloke/calm2-7B-chat-AWQを動かす

はじめに

今更ですが、CALM2のAWQ版を動かしてみます。

huggingface.co

環境

  • Google Colob A100 (GPU RAMを25GB使うため、無料枠だと無理でした)

準備

必要なライブラリの追加します

!pip -q install --upgrade accelerate autoawq
!pip install torch==2.1.0+cu121 torchtext==0.16.0+cpu torchdata==0.7.0 --index-url https://download.pytorch.org/whl/cu121

autoawqだけを入れようとすると以下のエラーで怒られたので、issueを参考にしています

ImportError                               Traceback (most recent call last)
<ipython-input-2-e1b236244288> in <cell line: 1>()
----> 1 from awq import AutoAWQForCausalLM
      2 from transformers import AutoTokenizer
      3 
      4 model_name_or_path = "TheBloke/calm2-7B-chat-AWQ"
      5 

5 frames
/usr/local/lib/python3.10/dist-packages/awq/modules/linear.py in <module>
      2 import torch
      3 import torch.nn as nn
----> 4 import awq_inference_engine  # with CUDA kernels
      5 
      6 

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.

github.com

推論

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

print("start")

model_name_or_path = "TheBloke/calm2-7B-chat-AWQ"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=False)
# Load model
model = AutoAWQForCausalLM.from_quantized(model_name_or_path, fuse_layers=True,
                                          trust_remote_code=False, safetensors=True)

prompt = "Tell me about AI"
prompt_template=f'''USER: {prompt}
ASSISTANT:
'''

print("*** Running model.generate:")

token_input = tokenizer(
    prompt_template,
    return_tensors='pt'
).input_ids.cuda()

# Generate output
generation_output = model.generate(
    token_input,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    max_new_tokens=512
)

# Get the tokens from the output, decode them, print them
token_output = generation_output[0]
text_output = tokenizer.decode(token_output)
print("LLM output: ", text_output)

"""
# Inference should be possible with transformers pipeline as well in future
# But currently this is not yet supported by AutoAWQ (correct as of September 25th 2023)
from transformers import pipeline

print("*** Pipeline:")
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    do_sample=True,
    temperature=0.7,
    top_p=0.95,
    top_k=40,
    repetition_penalty=1.1
)

print(pipe(prompt_template)[0]['generated_text'])
"""

いつもの質問も聞いてみます。

推論に必要なリソース