Synth-SONAR

Sonar Image Synthesis with Enhanced Diversity and Realism via Dual Diffusion Models and GPT Prompting

MIT License Python 3.8+ PyTorch Stable Diffusion GPT Prompting

Why Synth-SONAR?

Powerful, flexible, and easy-to-use sonar image synthesis framework

Style Injection

Enhance diversity by blending stylistic elements from real sonar images into generated content using pre-trained diffusion models.

Dual Diffusion Models

Two-phase generation process produces coarse images first, then refines them for fine-grained details and realism.

GPT-Based Prompting

Generate high-quality text prompts using GPT models to guide the diffusion process for better text-image alignment.

VLM Enhancement

Visual Language Models provide improved captions with domain-specific language instructions for realistic textures.

Flexible Fine-Tuning

Support for both standard fine-tuning and LoRA-based techniques to adapt models with limited computational resources.

Complete Utilities

Built-in tools for metadata creation, caption generation, and style-based clustering of sonar images.

Architecture

How Synth-SONAR combines multiple AI techniques for sonar image synthesis

Synth-SONAR Architecture

Three-Phase Pipeline

  • Phase 1: Data Acquisition

    Gather publicly available sonar images, S3 simulated images, and stylized sonar images. Create low-level descriptions manually or via GPT.

  • Phase 2: Training & Coarse Generation

    Use diffusion models with GPT-generated prompts to produce coarse sonar images. Fine-tune text transformers with LoRA for better alignment.

  • Phase 3: Fine-Grained Tuning

    Apply a second diffusion model to refine coarse images. Use VLM with domain-specific instructions for enhanced realism and details.

Documentation

Everything you need to get started with Synth-SONAR

Quick Start

Get started with Synth-SONAR in minutes

Installation

# Clone the repository git clone https://github.com/Purushothaman-natarajan/Synth-SONAR.git cd Synth-SONAR # Create and activate conda environment conda env create -f environment.yaml conda activate Synth-SONAR # Download StableDiffusion weights # From https://huggingface.co/CompVis/stable-diffusion-v1-4-original ln -s <path/to/model.ckpt> models/ldm/stable-diffusion-v1/model.ckpt

Step 1: Style Injection

# Run with default configuration python run_styleid.py --cnt data/cnt --sty data/sty --gamma 0.75 --T 1.5 # High style fidelity python run_styleid.py --cnt data/cnt --sty data/sty --gamma 0.3 --T 1.5

Step 2: Fine-Tuning

# Standard Fine-Tuning export MODEL_NAME="CompVis/stable-diffusion-v1-4" export TRAIN_DIR="path_to_your_dataset" accelerate launch --mixed_precision="fp16" ./text_to_image/train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=$TRAIN_DIR \ --use_ema --resolution=512 --center_crop --random_flip \ --train_batch_size=1 --gradient_accumulation_steps=4 \ --gradient_checkpointing --max_train_steps=5000 \ --learning_rate=1e-05 --output_dir="sd-sonar-model"

Step 3: Inference

from diffusers import StableDiffusionPipeline import torch model_path = "path_to_saved_model" pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16) pipe.to("cuda") image = pipe(prompt="A sonar image of an underwater scene").images[0] image.save("sonar_image.png")

Utility: Create Metadata

python create_metadata(data_to_json).py <image_folder>

Utility: Generate Captions

# Using GPT-3.5 Turbo python generate_captions_GPT.py <json_file> # Using LLaMA python generate_captions_llama.py <jsonl_file> --batch_size 8

Citation

If you use Synth-SONAR in your research, please cite our paper

@misc{natarajan2024synthsonar,
  title={Synth-SONAR: Sonar Image Synthesis with Enhanced Diversity and Realism via Dual Diffusion Models and GPT Prompting},
  author={Purushothaman Natarajan, Kamal Basha, Athira Nambiar},
  year={2024},
  eprint={2410.08612},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}