Road-Safety-Vision-Language-Models

We fine-tune two categories of VLMs: LLaVA and Qwen with their respective variants.

Dataset

You can create a dataset of your own choice using Mapillary API to fetch public images and then use GPT API to create conversation corresponding to those images. Here are the steps:

Install dependency:

pip install -r requirements.datacreation.txt

Setup

(i) Access token for Mapillary from here and similarly from OpenAI.

(ii) Geopackage file with a polygon for the area you want to get data from. You can do that easily in QGIS by creating a layer and exporting as gpkg
Download Images from Mapillary API

This script allows you to download Mapillary images for a specified area defined in a geopackage.
```
python download_images_mapillary.py \
    --access_token YOUR_MAPILLARY_TOKEN \
    --gpkg_path area_bbox.gpkg \
```
Use GPT to create training dataset

This script allows you to download the conversation data given a set of images.
```
python create_conversation_GPT.py \
    --data data/images \
    --OPENAI_KEY YOUR_OPENAI_KEY \
```

Ensure your dataset follows the structure below:

data/
├── images/
├── train.json
└── val.json

Installation

To fine-tune LLaVA, start by installing the required packages — including llava and flash-attention — using the installation script defined in the Makefile:

Note: This version of LLaVA requires CUDA 11.8 and Python 3.10 to run correctly.
```
make install-llava
```
Similarly, for Qwen run the below line:
```
make install-qwen
```

⚙️ Configuration

LLaVA model-specific training configurations are located in:
```
config/llava.yaml
```
💡 Note: The provided configuration is designed for LoRA fine-tuning with 4-bit quantization. If you prefer to train without quantization, simply remove the bits argument from the config file.
Qwen specific configuration can be found here:
```
config/qwen.yaml
```

🚀 Fine-Tuning

Once everything is set up, start the fine-tuning process with:

python finetune.py --model <model_name>

Argument	Description	Options
`--model`	Specify which model to fine-tune	`llava`, `qwen`

Fine-Tuned Qwen3-VL Models

We provide two variants of the Qwen3 4-bit quantized fine-tuned models for road-safety applications.

Model	Size	Description	Link
Qwen3 8B Instruct (4-bit)	8B	High-capacity model for complex reasoning and better results.	Hugging Face
Qwen3 4B Instruct (4-bit)	4B	Lightweight model for faster inference	Hugging Face

Inference

Inference for the pre-trained models of Qwen can be done using inference.py:

python infer.py \
  --model_name checkpoints/Road-Safety-Qwen3-VL-8B-Instruct-bnb-4bit \
  --image infer_image.jpeg \
  --max_new_tokens 512 \

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
dataset-creation		dataset-creation
llava		llava
qwen		qwen
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fine_tune_llava_lora.sh		fine_tune_llava_lora.sh
finetune.py		finetune.py
infer.py		infer.py
pyproject.toml		pyproject.toml
requirements.datacreation.txt		requirements.datacreation.txt
requirements.qwen.txt		requirements.qwen.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Road-Safety-Vision-Language-Models

Contents

Dataset

Installation

⚙️ Configuration

🚀 Fine-Tuning

Fine-Tuned Qwen3-VL Models

Inference

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Road-Safety-Vision-Language-Models

⚙️ Configuration

🚀 Fine-Tuning

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages