Image Comprehension
7 papers with code • 0 benchmarks • 1 datasets
Benchmarks
These leaderboards are used to track progress in Image Comprehension
Most implemented papers
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.
ArtGPT-4: Towards Artistic-understanding Large Vision-Language Models with Enhanced Adapter
However, a grand challenge of exploiting LLMs for multimodal learning is the size of pre-trained LLMs which are always with billions of parameters.
JourneyDB: A Benchmark for Generative Image Understanding
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension in relation to both content and style interpretation.
Hierarchical Open-vocabulary Universal Image Segmentation
Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
RegionBLIP: A Unified Multi-modal Pre-training Framework for Holistic and Regional Comprehension
To this end, we propose to extract features corresponding to regional objects as soft prompts for LLM, which provides a straightforward and scalable approach and eliminates the need for LLM fine-tuning.
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Multi-modal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain.