We fine-tune two categories of VLMs: LLaVA and Qwen with their respective variants.
You can create a dataset of your own choice using Mapillary API to fetch public images and then use GPT API to create conversation corresponding to those images. Here are the steps:
-
Install dependency:
pip install -r requirements.datacreation.txt
-
Setup
(i) Access token for Mapillary from here and similarly from OpenAI.
(ii) Geopackage file with a polygon for the area you want to get data from. You can do that easily in QGIS by creating a layer and exporting as gpkg
-
Download Images from Mapillary API
This script allows you to download Mapillary images for a specified area defined in a geopackage.
python download_images_mapillary.py \ --access_token YOUR_MAPILLARY_TOKEN \ --gpkg_path area_bbox.gpkg \ -
Use GPT to create training dataset
This script allows you to download the conversation data given a set of images.
python create_conversation_GPT.py \ --data data/images \ --OPENAI_KEY YOUR_OPENAI_KEY \
Ensure your dataset follows the structure below:
data/
├── images/
├── train.json
└── val.json
-
To fine-tune LLaVA, start by installing the required packages — including llava and flash-attention — using the installation script defined in the Makefile:
Note: This version of LLaVA requires CUDA 11.8 and Python 3.10 to run correctly.
make install-llava
-
Similarly, for Qwen run the below line:
make install-qwen
-
LLaVA model-specific training configurations are located in:
config/llava.yaml💡 Note: The provided configuration is designed for LoRA fine-tuning with 4-bit quantization. If you prefer to train without quantization, simply remove the
bitsargument from the config file. -
Qwen specific configuration can be found here:
config/qwen.yaml
Once everything is set up, start the fine-tuning process with:
python finetune.py --model <model_name>
| Argument | Description | Options |
|---|---|---|
--model |
Specify which model to fine-tune | llava, qwen |
We provide two variants of the Qwen3 4-bit quantized fine-tuned models for road-safety applications.
| Model | Size | Description | Link |
|---|---|---|---|
| Qwen3 8B Instruct (4-bit) | 8B | High-capacity model for complex reasoning and better results. | Hugging Face |
| Qwen3 4B Instruct (4-bit) | 4B | Lightweight model for faster inference | Hugging Face |
Inference for the pre-trained models of Qwen can be done using inference.py:
python infer.py \
--model_name checkpoints/Road-Safety-Qwen3-VL-8B-Instruct-bnb-4bit \
--image infer_image.jpeg \
--max_new_tokens 512 \