DevAIs-Gemma4でVLM利用アプリを試作(SFTあり)

Gemma4でVLM利用アプリを試作(SFTあり)

Varun

2026年5月1日

背景や目的

先日、OSSのVLMであるGemma4を用いて画像認識を試しました。
（記事はこちら：OSSのVLMモデル：Gemma4を試した）

特定業務アプリで映像評価機能を追加する際に、VLM利用が有効かどうか確かめてみたく、検証することにしました。
当社の案件で防災対策アプリがあるのでその追加機能として考えてみます。

VLMのSFT実行計画

Gemma4に対して追加学習（SFT）を実施します。

SFT目的：近所でスマホで撮影した画像を用い、地震発生時に比較的安全と考えられる場所の座標を正しく出力できるようにする。

環境は以下の通りです。
１．VLM利用ベースモデル
　前回良い結果が得られたGemma4 31Bを使用。（31B：総パラメータ数30.7Bのモデル）
２．実行環境
　・Google Colab Pro+を使用
　・ランタイムはA100

手順は以下の通りです。
１．SFT無しで推論してみる(問題点の確認)
２．SFT用学習データの準備
　・「画像に対する会話形式の教師データ」を準備します。
　・使用する元画像データは、以下の2枚です。

３．SFT実行
　Transformers + TRL ライブラリで、PEFT/LoRAを用いたSFTを実施。
４．SFT後モデルで推論

この流れで順番に実行してゆきます。

SFT無しで推論してみる(問題点の確認)

先ほど示した２つの画像を使って、地震が起きた時に安全な箇所を教えてもらいます。

プロンプトとしては、この内容にしました。

“この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]”

Colab上で新しいipynbファイルを作成し、以下の手順で進めます。
・以下のインストール用コードをセルに記述し、「Shift＋Enter」で実行します。

Google Colab

!pip install -U pip

# CUDA 12.8 系に固定
!pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 \
  --index-url https://download.pytorch.org/whl/cu128

# Gemma4 対応版
!pip install transformers==5.5.4

# 周辺
!pip install -U datasets accelerate evaluate trl pillow sentencepiece tensorboard
!pip install protobuf==5.29.5
!pip install -U peft
!pip install -U torchao

!pip install -U pip

# CUDA 12.8 系に固定
!pip install torch==2.10.0 torchvision==0.25.0 torchaudio==2.10.0 \
  --index-url https://download.pytorch.org/whl/cu128

# Gemma4 対応版
!pip install transformers==5.5.4

# 周辺
!pip install -U datasets accelerate evaluate trl pillow sentencepiece tensorboard
!pip install protobuf==5.29.5
!pip install -U peft
!pip install -U torchao

インストール完了後、モデルロードに進みます。

・以下のコードを実行してモデルをロードします。

Google Colab

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(model_id)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,   
    device_map="auto",
)

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_id = "google/gemma-4-31B-it"

processor = AutoProcessor.from_pretrained(model_id)

model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype=torch.bfloat16,   
    device_map="auto",
)

モデルがロードされたため、画像をColab上アップして推論を行います。

以下のコードで推論実行します。先ほどのプロンプトもここで指定してあります。

Google Colab

from PIL import Image
import torch

model.eval()

image_paths = [
    "/content/PXL_20260412_003329131.jpg",
    "/content/PXL_20260412_003046216.jpg",
]

for image_path in image_paths:
    image = Image.open(image_path).convert("RGB")

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"}
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )

    answer = processor.decode(
        outputs[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )

    print("画像:", image_path)
    print(answer)
    print("-" * 50)

from PIL import Image
import torch

model.eval()

image_paths = [
    "/content/PXL_20260412_003329131.jpg",
    "/content/PXL_20260412_003046216.jpg",
]

for image_path in image_paths:
    image = Image.open(image_path).convert("RGB")

    messages = [
        {
            "role": "system",
            "content": [
                {"type": "text", "text": "あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}
            ],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": "この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"}
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )

    answer = processor.decode(
        outputs[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )

    print("画像:", image_path)
    print(answer)
    print("-" * 50)

Shift+Enterで実行し、以下の結果を得ました。

良く分からないので、位置を画像にプロットして眺めましょう。

Google Colab

from PIL import Image, ImageDraw
from IPython.display import display

data = [
    ("/content/PXL_20260412_003329131.jpg", (650, 100)),
    ("/content/PXL_20260412_003046216.jpg", (750, 500)),
]

GREEN = (0, 255, 0)
radius = 70

for path, (x, y) in data:
    img = Image.open(path).convert("RGB")
    draw = ImageDraw.Draw(img)

    draw.ellipse(
        [x - radius, y - radius, x + radius, y + radius],
        outline=GREEN,
        width=20
    )

    print(path)
    display(img)

from PIL import Image, ImageDraw
from IPython.display import display

data = [
    ("/content/PXL_20260412_003329131.jpg", (650, 100)),
    ("/content/PXL_20260412_003046216.jpg", (750, 500)),
]

GREEN = (0, 255, 0)
radius = 70

for path, (x, y) in data:
    img = Image.open(path).convert("RGB")
    draw = ImageDraw.Draw(img)

    draw.ellipse(
        [x - radius, y - radius, x + radius, y + radius],
        outline=GREEN,
        width=20
    )

    print(path)
    display(img)

結果は以下の通り。位置がダメですね。空を指してますね。★位置：下図緑色の〇

画像A　　　　　　　　　　　　　　　　　　　　画像B

よく見ると、画像Bは示した場所自体も怪しいかもしれません。（「樹木から離れた開けた歩道」なのだろうか。。。）

とりあえず今回は、位置を補正する事のみを課題として、SFTに進みます。

SFT用学習データの準備

示した位置を是正するために、学習データとして、QAのAを変更したデータセットを用意します。

以下のようにベースモデルの推論で得た文の、位置を示す部分を訂正したAにします。

ベースのA	学習用A
安全な場所：周囲に落下物となる看板や電柱、崩落の危険がある壁などが少なく、開けた空間である歩道部分。位置：[650, 100]	安全な場所：周囲に落下物となる看板や電柱、崩落の危険がある壁などが少なく、開けた空間である歩道部分。位置：[350, 2000]

実際に用意したデータは以下の通りです。

JSON

{"image":"/content/PXL_20260412_003329131.jpg","messages":[{"role":"system","content":[{"type":"text","text":"あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}]},{"role":"user","content":[{"type":"text","text":"この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"},{"type":"image"}]},{"role":"assistant","content":[{"type":"text","text":"安全な場所：周囲に落下物となる看板や電柱、崩落の危険がある壁などが少なく、開けた空間である歩道部分。\n位置：[350, 2000]"}]}]}
{"image":"/content/PXL_20260412_003046216.jpg","messages":[{"role":"system","content":[{"type":"text","text":"あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}]},{"role":"user","content":[{"type":"text","text":"この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"},{"type":"image"}]},{"role":"assistant","content":[{"type":"text","text":"安全な場所：建物や電柱、樹木から離れた、開けた歩道の中央付近\n位置：[1100, 2000]"}]}]}

{"image":"/content/PXL_20260412_003329131.jpg","messages":[{"role":"system","content":[{"type":"text","text":"あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}]},{"role":"user","content":[{"type":"text","text":"この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"},{"type":"image"}]},{"role":"assistant","content":[{"type":"text","text":"安全な場所：周囲に落下物となる看板や電柱、崩落の危険がある壁などが少なく、開けた空間である歩道部分。\n位置：[350, 2000]"}]}]}
{"image":"/content/PXL_20260412_003046216.jpg","messages":[{"role":"system","content":[{"type":"text","text":"あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"}]},{"role":"user","content":[{"type":"text","text":"この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"},{"type":"image"}]},{"role":"assistant","content":[{"type":"text","text":"安全な場所：建物や電柱、樹木から離れた、開けた歩道の中央付近\n位置：[1100, 2000]"}]}]}

「train.jsonl」として保存し、SFT実行時に読み込みます。

SFT実行

課題は示す位置の是正でした。それを対策するSFTをしてみます。
・インストールおよびモデルロードのコードは、前段の「SFTなしで推論してみる」と同様です。（→コード参照）
モデルロードされたら次に進みます。
・SFT用に準備したデータ（JSONLファイルtrain.jsonlと2枚の画像）をColab上にアップします。

・以下のコードを実行してデータセットをロードします。

Google Colab

import json
from PIL import Image

with open("/content/train.jsonl", "r", encoding="utf-8") as f:
    rows = [json.loads(line) for line in f]

def convert_jsonl_row(row):
    image = Image.open(row["image"]).convert("RGB")

    # replace {"type": "image"} with actual PIL image
    messages = row["messages"]
    for msg in messages:
        if isinstance(msg.get("content"), list):
            for item in msg["content"]:
                if item.get("type") == "image":
                    item["image"] = image

    return {"messages": messages}

dataset_train = [convert_jsonl_row(row) for row in rows]

print(len(dataset_train))
print(dataset_train[0]["messages"])

import json
from PIL import Image

with open("/content/train.jsonl", "r", encoding="utf-8") as f:
    rows = [json.loads(line) for line in f]

def convert_jsonl_row(row):
    image = Image.open(row["image"]).convert("RGB")

    # replace {"type": "image"} with actual PIL image
    messages = row["messages"]
    for msg in messages:
        if isinstance(msg.get("content"), list):
            for item in msg["content"]:
                if item.get("type") == "image":
                    item["image"] = image

    return {"messages": messages}

dataset_train = [convert_jsonl_row(row) for row in rows]

print(len(dataset_train))
print(dataset_train[0]["messages"])

JSONLで用意したデータを、学習で使える形に整える関数を定義します。
・以下のコードセルを実行します。

Google Colab

def process_vision_info(messages):
    image_inputs = []

    for msg in messages:
        content = msg.get("content", [])
        if not isinstance(content, list):
            continue

        for item in content:
            if isinstance(item, dict) and item.get("type") == "image":
                image_inputs.append(item["image"].convert("RGB"))

    return image_inputs


def collate_fn(examples):
    texts = []
    images = []

    for example in examples:
        messages = example["messages"]

        text = processor.apply_chat_template(
            messages,
            add_generation_prompt=False,
            tokenize=False,
        )

        image_inputs = process_vision_info(messages)

        texts.append(text.strip())
        images.append(image_inputs)

    batch = processor(
        text=texts,
        images=images,
        return_tensors="pt",
        padding=True,
    )

    labels = batch["input_ids"].clone()

    if processor.tokenizer.pad_token_id is not None:
        labels[labels == processor.tokenizer.pad_token_id] = -100

    if hasattr(processor.tokenizer, "image_token_id"):
        labels[labels == processor.tokenizer.image_token_id] = -100

    batch["labels"] = labels
    return batch

def process_vision_info(messages):
    image_inputs = []

    for msg in messages:
        content = msg.get("content", [])
        if not isinstance(content, list):
            continue

        for item in content:
            if isinstance(item, dict) and item.get("type") == "image":
                image_inputs.append(item["image"].convert("RGB"))

    return image_inputs


def collate_fn(examples):
    texts = []
    images = []

    for example in examples:
        messages = example["messages"]

        text = processor.apply_chat_template(
            messages,
            add_generation_prompt=False,
            tokenize=False,
        )

        image_inputs = process_vision_info(messages)

        texts.append(text.strip())
        images.append(image_inputs)

    batch = processor(
        text=texts,
        images=images,
        return_tensors="pt",
        padding=True,
    )

    labels = batch["input_ids"].clone()

    if processor.tokenizer.pad_token_id is not None:
        labels[labels == processor.tokenizer.pad_token_id] = -100

    if hasattr(processor.tokenizer, "image_token_id"):
        labels[labels == processor.tokenizer.image_token_id] = -100

    batch["labels"] = labels
    return batch

・LoRAのパラメータの設定を定義し、コードセルを実行します。

Google Colab

from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.0,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    lora_dropout=0.0,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

・SFTの学習設定と出力ダイレクトリー等を定義し、コードセルを実行します。

Google Colab

from trl import SFTConfig

args = SFTConfig(
    output_dir="/content/gemma4-31b-earthquake-sft",
    num_train_epochs=15,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="adamw_torch",
    logging_steps=1,
    save_strategy="epoch",
    eval_strategy="no",
    learning_rate=2e-4,
    max_grad_norm=0.3,
    lr_scheduler_type="constant",
    push_to_hub=False,
    report_to="none",
    dataset_text_field="",
    dataset_kwargs={"skip_prepare_dataset": True},
    remove_unused_columns=False,
)

from trl import SFTConfig

args = SFTConfig(
    output_dir="/content/gemma4-31b-earthquake-sft",
    num_train_epochs=15,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    optim="adamw_torch",
    logging_steps=1,
    save_strategy="epoch",
    eval_strategy="no",
    learning_rate=2e-4,
    max_grad_norm=0.3,
    lr_scheduler_type="constant",
    push_to_hub=False,
    report_to="none",
    dataset_text_field="",
    dataset_kwargs={"skip_prepare_dataset": True},
    remove_unused_columns=False,
)

・SFTTrainerを定義し、コードセルを実行します。

Google Colab

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset_train,
    peft_config=peft_config,
    processing_class=processor,
    data_collator=collate_fn,
)

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset_train,
    peft_config=peft_config,
    processing_class=processor,
    data_collator=collate_fn,
)

SFTの準備ができたため、trainerを実行します。
・以下のコードセルを実行します。

Google Colab

trainer.train()

trainer.train()

SFTが完了しました！結果は上記の通りです。
学習済みモデルがColab上のディレクトリに保存されていることも確認済みです。

SFT済みモデルで、画像に対する学習結果を確認します。
・以下のコードセルで画像パスを設定し、実行します。

Google Colab

from PIL import Image
import torch

model.eval()

system_prompt = "あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"

user_prompt = "この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"

def ask_safe_point(image_path):
    image = Image.open(image_path).convert("RGB")

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_prompt}],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": user_prompt},
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )

    answer = processor.decode(
        outputs[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )

    return answer.strip()

test_paths = [
    "/content/PXL_20260412_003329131.jpg",
    "/content/PXL_20260412_003046216.jpg",
]

for path in test_paths:
    print("画像:", path)
    print(ask_safe_point(path))
    print("-" * 50)

from PIL import Image
import torch

model.eval()

system_prompt = "あなたは防災の専門家です。画像を確認し、地震発生時に比較的安全と考えられる場所を判断してください。"

user_prompt = "この画像において、地震発生時に比較的安全な場所を１箇所教えて下さい。またその位置も教えて下さい。以下の書式で返却下さい。\n安全な場所：(文章)\n位置：[x, y]"

def ask_safe_point(image_path):
    image = Image.open(image_path).convert("RGB")

    messages = [
        {
            "role": "system",
            "content": [{"type": "text", "text": system_prompt}],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": image},
                {"type": "text", "text": user_prompt},
            ],
        }
    ]

    inputs = processor.apply_chat_template(
        messages,
        add_generation_prompt=True,
        tokenize=True,
        return_dict=True,
        return_tensors="pt",
    ).to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            do_sample=False,
        )

    answer = processor.decode(
        outputs[0][inputs["input_ids"].shape[-1]:],
        skip_special_tokens=True,
    )

    return answer.strip()

test_paths = [
    "/content/PXL_20260412_003329131.jpg",
    "/content/PXL_20260412_003046216.jpg",
]

for path in test_paths:
    print("画像:", path)
    print(ask_safe_point(path))
    print("-" * 50)