Skip to content

Rohit8y/Road-Safety-Vision-Language-Models

Repository files navigation

Road-Safety-Vision-Language-Models

We fine-tune two categories of VLMs: LLaVA and Qwen with their respective variants.

  1. Dataset
  2. Installation
  3. Configuration
  4. Fine-tuning
  5. Fine-Tuned Qwen3-VL Models
  6. Inference

You can create a dataset of your own choice using Mapillary API to fetch public images and then use GPT API to create conversation corresponding to those images. Here are the steps:

  1. Install dependency:

    pip install -r requirements.datacreation.txt
  2. Setup

    (i) Access token for Mapillary from here and similarly from OpenAI.

    (ii) Geopackage file with a polygon for the area you want to get data from. You can do that easily in QGIS by creating a layer and exporting as gpkg

  3. Download Images from Mapillary API

    This script allows you to download Mapillary images for a specified area defined in a geopackage.

    python download_images_mapillary.py \
        --access_token YOUR_MAPILLARY_TOKEN \
        --gpkg_path area_bbox.gpkg \
  4. Use GPT to create training dataset

    This script allows you to download the conversation data given a set of images.

    python create_conversation_GPT.py \
        --data data/images \
        --OPENAI_KEY YOUR_OPENAI_KEY \

Ensure your dataset follows the structure below:

data/
├── images/
├── train.json
└── val.json

  1. To fine-tune LLaVA, start by installing the required packages — including llava and flash-attention — using the installation script defined in the Makefile:

    Note: This version of LLaVA requires CUDA 11.8 and Python 3.10 to run correctly.

    make install-llava
  2. Similarly, for Qwen run the below line:

    make install-qwen

  1. LLaVA model-specific training configurations are located in:

    config/llava.yaml
    

    💡 Note: The provided configuration is designed for LoRA fine-tuning with 4-bit quantization. If you prefer to train without quantization, simply remove the bits argument from the config file.

  2. Qwen specific configuration can be found here:

    config/qwen.yaml
    

Once everything is set up, start the fine-tuning process with:

python finetune.py --model <model_name>
Argument Description Options
--model Specify which model to fine-tune llava, qwen

We provide two variants of the Qwen3 4-bit quantized fine-tuned models for road-safety applications.

Model Size Description Link
Qwen3 8B Instruct (4-bit) 8B High-capacity model for complex reasoning and better results. Hugging Face
Qwen3 4B Instruct (4-bit) 4B Lightweight model for faster inference Hugging Face

Inference for the pre-trained models of Qwen can be done using inference.py:

python infer.py \
  --model_name checkpoints/Road-Safety-Qwen3-VL-8B-Instruct-bnb-4bit \
  --image infer_image.jpeg \
  --max_new_tokens 512 \
 

About

Fine-tuning of Vision Language Models (VLMs) for road safety inspection. Available models: LLaVA and QWEN3

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors