Papers
arxiv:2402.08327

PreFLMR: Scaling Up Fine-Grained Late-Interaction Multi-modal Retrievers

Published on Feb 13, 2024
Authors:
,
,

Abstract

The M2KR framework and PreFLMR, a pre-trained version of FLMR, achieve state-of-the-art results in Knowledge-based Visual Question Answering through extensive training and evaluation across multiple vision and language tasks.

Large Multimodal Models (LMMs) excel in natural language and visual understanding but are challenged by exacting tasks such as Knowledge-based Visual Question Answering (KB-VQA) which involve the retrieval of relevant information from document collections to use in shaping answers to questions. We present an extensive training and evaluation framework, M2KR, for KB-VQA. M2KR contains a collection of vision and language tasks which we have incorporated into a single suite of benchmark tasks for training and evaluating general-purpose multi-modal retrievers. We use M2KR to develop PreFLMR, a pre-trained version of the recently developed Fine-grained Late-interaction Multi-modal Retriever (FLMR) approach to KB-VQA, and we report new state-of-the-art results across a range of tasks. We also present investigations into the scaling behaviors of PreFLMR intended to be useful in future developments in general-purpose multi-modal retrievers.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2402.08327
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 9

Browse 9 models citing this paper

Datasets citing this paper 5

Browse 5 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.08327 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.