Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
- Website
- Community
- Solutions
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2504.16072

Describe Anything

Multimodal Large Language Models for Detailed Localized Image and Video Captioning

Paused

Agents

Featured

344

Describe Anything

⚡

344

Describe any selected part of an image
nvidia/DAM-3B

Image-Text-to-Text • Updated May 7, 2025 • 19.3k • 132
nvidia/DAM-3B-Video

Image-Text-to-Text • Updated May 7, 2025 • 196 • 59
nvidia/DAM-3B-Self-Contained

Image-Text-to-Text • Updated May 7, 2025 • 521 • 25

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 24
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 86
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 156
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Paper • 2410.09604 • Published Oct 12, 2024
Geospatial Mechanistic Interpretability of Large Language Models

Paper • 2505.03368 • Published May 6, 2025 • 13
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5, 2025 • 8

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Paper • 2504.08641 • Published Apr 11, 2025 • 6
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 140
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30, 2025 • 46

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 31
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 29
MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 14
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20, 2024 • 50
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21, 2024 • 33

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Paper • 2505.01456 • Published May 1, 2025 • 2
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66

TTRL: Test-Time Reinforcement Learning

Paper • 2504.16084 • Published Apr 22, 2025 • 122
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
Reinforcing General Reasoning without Verifiers

Paper • 2505.21493 • Published May 27, 2025 • 27
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Paper • 2505.24760 • Published May 30, 2025 • 74

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Paper • 2503.10630 • Published Mar 13, 2025 • 6
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Paper • 2503.09516 • Published Mar 12, 2025 • 40
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 88

Describe Anything

Multimodal Large Language Models for Detailed Localized Image and Video Captioning

Paused

Agents

Featured

344

Describe Anything

⚡

344

Describe any selected part of an image
nvidia/DAM-3B

Image-Text-to-Text • Updated May 7, 2025 • 19.3k • 132
nvidia/DAM-3B-Video

Image-Text-to-Text • Updated May 7, 2025 • 196 • 59
nvidia/DAM-3B-Self-Contained

Image-Text-to-Text • Updated May 7, 2025 • 521 • 25

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 31
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 45
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 24

AI Paper of the Day

A collection of papers that I think are interesting, one added each day

Can Large Language Models Understand Context?

Paper • 2402.00858 • Published Feb 1, 2024 • 24
OLMo: Accelerating the Science of Language Models

Paper • 2402.00838 • Published Feb 1, 2024 • 86
Self-Rewarding Language Models

Paper • 2401.10020 • Published Jan 18, 2024 • 156
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

Paper • 2401.17072 • Published Jan 30, 2024 • 25

CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data

Paper • 2404.15653 • Published Apr 24, 2024 • 29
MoDE: CLIP Data Experts via Clustering

Paper • 2404.16030 • Published Apr 24, 2024 • 14
MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

Paper • 2405.12130 • Published May 20, 2024 • 50
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Paper • 2405.12981 • Published May 21, 2024 • 33

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66

Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation

Paper • 2505.01456 • Published May 1, 2025 • 2
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66

Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
EmbodiedCity: A Benchmark Platform for Embodied Agent in Real-world City Environment

Paper • 2410.09604 • Published Oct 12, 2024
Geospatial Mechanistic Interpretability of Large Language Models

Paper • 2505.03368 • Published May 6, 2025 • 13
Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

Paper • 2505.02836 • Published May 5, 2025 • 8

TTRL: Test-Time Reinforcement Learning

Paper • 2504.16084 • Published Apr 22, 2025 • 122
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
Reinforcing General Reasoning without Verifiers

Paper • 2505.21493 • Published May 27, 2025 • 27
REASONING GYM: Reasoning Environments for Reinforcement Learning with Verifiable Rewards

Paper • 2505.24760 • Published May 30, 2025 • 74

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

Paper • 2504.08641 • Published Apr 11, 2025 • 6
PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters

Paper • 2504.08791 • Published Apr 7, 2025 • 140
Describe Anything: Detailed Localized Image and Video Captioning

Paper • 2504.16072 • Published Apr 22, 2025 • 66
A Survey of Interactive Generative Video

Paper • 2504.21853 • Published Apr 30, 2025 • 46

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization

Paper • 2503.10615 • Published Mar 13, 2025 • 17
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation

Paper • 2503.10630 • Published Mar 13, 2025 • 6
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Paper • 2503.09516 • Published Mar 12, 2025 • 40
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

Paper • 2503.07536 • Published Mar 10, 2025 • 88

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs