AgentThink Logo

AgentThink · Tool‑Augmented CoT for Autonomous Driving

EMNLP 2025 Findings

AgentThink: A Unified Framework for Tool‑Augmented Chain‑of‑Thought Reasoning in Vision‑Language Models for Autonomous Driving

Contact: qka23@mails.tsinghua.edu.cn

📜 “A gentleman is not inherently different from others; he excels by skillfully leveraging external tools.”

— Xunzi. This philosophy aligns with AgentThink: by orchestrating tools & models, we achieve robust understanding and responses in complex driving scenarios.

🌏

🎬 Demo Showcase · “Experience AgentThink’s perception and planning in complex traffic.”

Contents

🎬 Demo Showcase

Experience AgentThink's real‑world performance through demonstration materials that illustrate its capabilities in autonomous driving scenarios.

Video Demonstration

Watch this video to see AgentThink's environmental perception in complex traffic conditions:

Visualization Gallery

Complementing the video, these visualizations demonstrate key capabilities:

ScenarioDescriptionImage
High‑level planningVisualizes high‑level planningView
Spatial UnderstandingDemonstrates spatial relationship analysisView
Environment AdaptabilityShows performance in adverse weather or low‑lightView

✨ Highlights

📰 Project Updates

🚀 Quick Navigation

SectionDescriptionLink
Environment SetupInstall dependencies and setupEnvironment Setup
Model InferenceReal‑time inference on val setModel Inference
Demo InferenceRun demo on test setDemo Inference
Evaluation MetricsScoring pipeline using LLM‑as‑JudgeEvaluation Metrics
Benchmark ResultsQuantitative performance comparisonsBenchmark Results

⚙️ Getting Started · Environment Setup

🛠️ Basic

ComponentVersionCommand
OSUbuntu 20.04cat /etc/issue
Python3.10.12python --version
CUDA Toolkit12.4nvcc --version
GPU Driver535.129.03nvidia-smi
PyTorch2.6.0print(torch.__version__)

Basic Setup

# Create virtual environment
conda create -n agentthink python=3.10
conda activate agentthink

# Install dependencies
pip install -r requirements.txt

# Install ms-swift
bash scripts/env.sh

# Install drivemllm dependency
bash scripts/env_drivemllm.sh

Clone ms‑swift

cd third_party
git clone https://github.com/modelscope/ms-swift.git

Model Inference

🎬 Use your trained checkpoint AgentThink to run inference on val samples AgentThink‑CoT‑val:

# Inference script
bash scripts/inference_scripts/inference.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference with tool script
bash scripts/inference_scripts/inference_withtool.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference using multi-node GPUs
bash scripts/inference_scripts/inference_multigpu.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

# Inference AgentThink
bash scripts/inference_agentthink.sh [your_CKPT_PATH] [your_OUTPUT_DIR]

Evaluation Metrics

📊 Use LLM‑as‑Judge to calculate performance metrics:

# Evaluate reasoning ability and MCQ accuracy
python evaluation/evaluation_script.py

🚀 Quick Start

Download the model

Our AgentThink model based on the Qwen2.5‑VL‑7B.

Download the tool model

Clone the Depth Anything v2 DAM and YOLO‑World:

git clone https://github.com/DepthAnything/Depth-Anything-V2

git clone https://github.com/AILab-CVC/YOLO-World

Then download the pretrain models in the YOLO‑World and DepthAnything.

Download the basic tool results

Download the val.pkl file from USC‑GVL / Agent‑Driver.

Folder structure

AgentThink/
├── 📂 data/                    # Dataset and processed data
│   ├── DriveLMMo1_TEST_tool_results.jsonl
│   ├── DriveLMMo1_TEST.jsonl
│   ├── 📂 image2concat/        # Concatenated image files
│   └── 📂 tool_results/        # Results from tool processing
│
├── 📂 demo_image/              # Demonstration images
│   ├── nuscenes_CAM_FRONT_3590.webp
│   ├── nuscenes_CAM_FRONT_3757.webp
│   └── nuscenes_CAM_FRONT_3896.webp
│
├── 📂 pretrained_model/        # Pre-trained model files
│   ├── 📂 AgentThink/
│   │   └── checkpoint-700-merged
│   ├── depth_anything_v2_vitb.pth
│   └── yolov8x-world2.pt
│
├── 📂 assets/                  # Visual assets and resources
├── 📂 evaluation/              # Evaluation scripts and benchmarks
├── 📂 Inference/               # Inference-related scripts and data
├── 📂 results/                 # Output and result files
├── 📂 scripts/                 # Various utility scripts
├── 📂 third_party/             # Third-party libraries and resources
├── README.cn.md                # Chinese documentation
├── README.md                   # Project documentation
├── requirements.txt            # Python dependencies
└── ...                         # Other project files

Demo Inference

# drivemllm
python Inference/inference_demo_drivemllm.py

# drivelmm-o1
python Inference/inference_demo_drivelmm.py

📋 TODO List · Development Roadmap

StatusTask Description
AgentThink demo implementation
General reasoning evaluation metrics
🔜Tool‑specific evaluation metrics
🔜Data preprocessing pipeline
Debug example implementation
🔜Multi‑stage training framework
🔜Tool function interaction environment

AgentThink Results

📊 DriveLMM‑o1 Performance

Vision Language Models Risk Assess. (%) ↑ Rule Adh. (%) ↑ Scene Aware. (%) ↑ Relevance (%) ↑ Missing (%) ↑ Reason. (%) ↑ MCQ (%) ↑
GPT‑4o [16]71.3280.7272.9676.6571.4372.5257.84
Ovis1.5‑Gemma2‑9B [21]51.3466.3654.7455.7255.7455.6248.85
Mulberry‑7B [45]51.8963.6656.6857.2757.4557.6552.86
LLaVA‑CoT [43]57.6269.0160.8462.7260.6761.4149.27
LlamaV‑o1 [34]60.2073.5262.6764.6663.4163.1350.02
InternVL2.5‑8B [4]69.0278.4371.5275.8070.5471.6254.87
Qwen2.5‑VL‑7B [1]46.4460.4551.0250.1552.1951.7737.81
DriveLMM‑o1 [15]73.0181.5675.3979.4274.4975.2462.36
AgentThink (Ours)80.5184.9882.1184.9979.5679.6871.35

📊 DriveMLLM Comparison

TypeModelL/RF/BRHDRDPPosBBoxCVDCDAccSOverall
Zero‑shotGPT‑4o [16]91.7267.609.5814.6940.904.0746.1170.6543.1625.63
Zero‑shotGPT‑4o‑mini67.6750.1370.440.0029.283.780.0046.4033.4616.68
Zero‑shotLLaVA‑ov‑72B [19]85.4249.4813.7645.2716.460.0042.9727.0935.0621.10
Zero‑shotQwen2.5‑VL‑7B [1]76.5555.247.1417.1155.9738.3155.9451.5244.7213.36
Zero‑shotQwen + CoT87.0663.0916.6922.5652.5138.8776.9038.7149.5519.31
Zero‑shotQwen + DirectTool78.9548.9658.4367.5758.2042.2251.7651.3857.1824.05
Zero‑shotAgentThink (Ours)82.3354.4056.1461.4570.4556.2323.0951.6056.9626.52
One‑shotGPT‑4o91.0869.3736.5171.1742.445.100.0063.8847.4433.17
One‑shotGPT‑4o‑mini66.0048.9583.0258.4725.713.9752.7355.2349.2622.13
One‑shotLLaVA‑ov‑72B [19]79.1262.9749.2668.0428.572.2053.1260.9050.5236.66
One‑shotQwen2.5‑VL‑7B [1]80.3053.1436.9639.1362.6922.6349.8848.3249.1333.53
One‑shotQwen + CoT86.3559.9543.2931.8153.6426.9351.0242.3049.4132.06
One‑shotQwen + DirectTool84.5755.5067.3259.5485.5826.0752.3453.2560.5242.27
One‑shotAgentThink (Ours)78.7148.4660.6460.7172.3664.4652.2652.0461.2147.24

📁 Repository Structure

AgentThink/
├── assets/                 # Visual assets and resources
├── data/                   # Data files and datasets
├── evaluation/             # Evaluation scripts and benchmarks
│   ├── evaluation_script.py
│   └── inference_agentthink.py
├── Inference/              # Inference-related scripts and data
│   ├── inference_demo_data_drivemllm.json
│   ├── inference_demo_data_drivelmm.json
│   └── inference_demo_drivemllm.py
├── results/                # Output and result files
│   └── agentthink/
├── scripts/                # Various utility scripts
│   ├── debug_scripts/
│   ├── inference_scripts/
│   └── tools/              # Tool library implementations
├── third_party/            # Third-party libraries and resources
│   ├── 🐍 inference.py         # Main inference script
│   ├── 🐍 prepare_data.py      # Data preparation script
│   ├── 🐍 utlis.py             # Utility functions
│   ├── 🐚 env.sh               # Environment setup script
│   ├── 🐚 env_drivemllm.sh     # DriveMLLM environment script
│   └── 🐚 prepare_json_data.sh # Long JSON data preparation script
├── 📄 README.md            # Project documentation
├── 📄 README_CN.md         # 中文文档
├── 📄 requirements.txt     # Python dependencies

🪪 License & Citation

License

This project is licensed under Apache License 2.0. See LICENSE file for details.

Citation

@misc{qian2025agentthinkunifiedframeworktoolaugmented,
  title={AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving},
  author={Kangan Qian et al.},
  year={2025},
  eprint={2505.15298},
  archivePrefix={arXiv},
  url={https://arxiv.org/abs/2505.15298},
}