文書画像解析用のマルチモーダルVLMモデル「Dolphin」をWindowsで動かす

初めに

Bytedanceから文章の画像分析マルチモーダルがOSSで公開されています。こちらをローカルで動かしていきます

Model Spaceは以下で公開されています

huggingface.co

開発環境

環境構築

まずはPythonを動かすための環境を作成します

uv venv --python 3.11
.\.venv\Scripts\activate

次にライブラリをインストールしていきます

uv pip install torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu121
uv pip install numpy==1.24.4 omegaconf==2.3.0 opencv-python==4.5.5.64 opencv-python-headless==4.5.5.64 pillow==9.3.0 timm==0.5.4 transformers==4.47.0 accelerate==1.6.0 pymupdf==1.26

次にモデルのダウンロードを行います

uv run hf download ByteDance/Dolphin --local-dir ./hf_model

実行

サンプルのデータがいくつかあるのでそれらを用いて動かしていきます

まずは以下の画像を処理してみます

こちらを実行すると以下になります

# LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron; Thibaut Lavril; Gautier Izacard; Xavier Martinet

Marie-Anne Lachaux, Timothee Lacroix, Baptiste Rozière, Naman Goyal Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin Edouard Grave; Guillaume Lample*

Meta AI

## **Abstract**

We introduce LLaMA, a collection of foundation language models ranging from 7B to 65B parameters. We train our models on trillions of tokens, and show that it is possible to train state-of-the-art models using publicly available datasets exclusively, without resorting to proprietary and inaccessible datasets. In particular, LLaMA-13B outperforms GPT-3 (175B) on most benchmarks, and LLaMA65B is competitive with the best models, Chinchilla-70B and PaLM-540B. We release all our models to the research community $^1$ .

## 1 Introduction

Large Languages Models (LLMs) trained on massive corpora of texts have shown their ability to perform new tasks from textual instructions or from a few examples ( Brown et al. , 2020 ) . These few-shot properties first appeared when scaling models to a sufficient size ( Kaplan et al. , 2020 ) , resulting in a line of work that focuses on further scaling these models ( Chowdhery et al. , 2022 ; Rae et al. , 2021 ) . These efforts are based on the assumption that more parameters will lead to better performance. However, recent work from Hoffmann et al. ( 2022 ) shows that, for a given compute budget, the best performances are not achieved by the largest models, but by smaller models trained on more data.

The objective of the scaling laws from Hoffmann et al. ( 2022 ) is to determine how to best scale the dataset and model sizes for a particular training compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of

performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. ( 2022 ) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

The focus of this work is to train a series of language models that achieve the best possible performance at various inference budgets, by training on more tokens than what is typically used. The resulting models, called LLaMA , ranges from 7B to 65B parameters with competitive performance compared to the best existing LLMs. For instance, LLaMA-13B outperforms GPT-3 on most benchmarks, despite being 10 $\times$ smaller. We believe that this model will help democratize the access and study of LLMs, since it can be run on a single GPU. At the higher-end of the scale, our 65B-parameter model is also competitive with the best large language models such as Chinchilla or PaLM-540B.

Unlike Chinchilla, PaLM, or GPT-3, we only use publicly available data, making our work compatible with open-sourcing, while most existing models rely on data which is either not publicly available or undocumented (e.g. “ Books – 2TB ” or “ Social media conversations ” ). There exist some exceptions, notably OPT ( Zhang et al. , 2022 ) , GPT-NeoX ( Black et al. , 2022 ) , BLOOM ( Scao et al. , 2022 ) and GLM ( Zeng et al. , 2022 ) , but none that are competitive with PaLM-62B or Chinchilla.

In the rest of this paper, we present an overview of the modifications we made to the transformer architecture ( Vaswani et al. , 2017 ) , as well as our training method. We then report the performance of our models and compare with others LLMs on a set of standard benchmarks. Finally, we expose some of the biases and toxicity encoded in our models, using some of the most recent benchmarks from the responsible AI community.

* Equal contribution. Correspondence: {htouvron thibautlav,gizacard,egrave,glample}@meta.com

https://github.com/facebookresearch/llama

arXiv:2302.1397lvl [cs.CL] 27 Feb 2023

処理内容はだいたいあってそうです。

次にpdfファイルを解析してみます

uv run python demo_page_hf.py --model_path ./hf_model --input_path ./demo/page_imgs/page_6.pdf --save_dir ./results

以下のようなログになります

処理するファイル数: 1

処理中: ./demo/page_imgs/page_6.pdf
PDFから9ページを正常に変換しました
ページ 1/9 を処理中
Legacy behavior is being used. The current behavior will be deprecated in version 5.0.0. In the new behavior, if both images and text are provided, the default value of `add_special_tokens` will be changed to `False` when calling the tokenizer if `add_special_tokens` is unset. To test the new behavior, set `legacy=False`as a processor call argument.
ページ 2/9 を処理中
ページ 3/9 を処理中
ページ 4/9 を処理中
ページ 5/9 を処理中
ページ 6/9 を処理中
ページ 7/9 を処理中
ページ 8/9 を処理中
ページ 9/9 を処理中
処理完了。結果を保存: ./results