• Lior's View
  • Posts
  • 🥇 The 5 AI Papers You Should Read This Week

🥇 The 5 AI Papers You Should Read This Week

Fresh out the Neural Network. Our model analyzed and ranked 1000+ papers to provide you with the following summary. Enjoy!

AlphaSignal

Hey ,

Welcome back to AlphaSignal, where we bring you the latest developments in the world of AI.

In the past few days, an impressive number of AI papers have been released, and among them, we have handpicked the top six that truly stand out.

On Today’s Summary:

  • OtterHD: A High-Resolution Multi-modality Model

  • CogVLM: Visual Expert for Pretrained Language Models

  • Holistic Evaluation of Text-to-Image Models

  • Other notable papers

Reading time: 4 min 45 sec

📄 TOP PUBLICATIONS

OtterHD: A High-Resolution Multi-modality Model

Score: 9.9 Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu

Objective
The paper aims to devise a Large Multimodal Model (LMM) called OtterHD-8B, that is especially tailored for high-resolution image understanding. OtterHD-8B can accept images in their native resolution, making it possible for the model to pick up on minute details. Additionally, the authors propose the MagnifierBench benchmark for evaluating LMMs’ capacity to detect small features in high-resolution images.

Central Problem
Most LMMs process images at fixed resolution e.g. 224x224, 336x336. Large, high-resolution images need to be downscaled and reshaped to this input resolution, which can obscure fine-grained features, preventing the model from comprehending these details. Moreover, standard approaches to scaling up LMMs primarily focus on scaling the text decoder, keeping the image encoder the same.

Proposed Solution

  1. The authors build upon Fuyu-8B and introduce OtterHD-8B, a multimodal model specifically designed to process images with varying resolutions, up to 1024x1024. OtterHD-8B is instruction-tuned from Fuyu-8B, with images ranging in size from 448x448 ~ 1024x1024.

  2. The MagnifierBench benchmark was constructed with images from the Panoptic Scene Graph Generation (PVSG) dataset, which consists of video data, such as first-person videos of household chores. Annotators manually characterized small objects from the dataset to construct 283 question-answer pairs, including multiple choice and free-form answering.

Results
OtterHD-8B outperforms other LMMs such as InstructBLIP, LLaVA, and Qwen-VL with similar parameter sizes, on MagnifierBench. Ablation studies of image resolution reveal that the dynamic sampling scheme — training with images of varied resolution — generalizes better to higher resolutions that are unseen during training, highlighting the crucial role of resolution flexibility.

Must Watch: The Best Deep Dive on Synthetic Data

Join Gretel's groundbreaking virtual event on Tuesday, November 14th at 9am PST/12pm EST. Dive into the innovative world of synthetic data and its exciting applications.

What to Expect:

  • Live Demos: See Gretel's platform in action, showcasing the future of synthetic data.

  • Expert Insights: Hear from Alex Watson, Gretel's Co-Founder and CPO, on their vision and upcoming plans.

  • Exclusive Previews: First look at new models, including the Model Playground interface and Tabular LLM.

  • Operational Insights: Learn how "Gretel Workflows" can streamline synthetic data use in your enterprise.

  • Interactive Q&A: Get your questions answered live.

Reserve your spot now for this must-attend event and be at the forefront of the synthetic data revolution with Gretel.

CogVLM: Visual Expert for Pretrained Language Models

Score: 9.3  Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu

Objective
The paper aims to devise a Visual Language Model (VLM) focusing on the alignment between text and images. To this end, the authors propose a visual expert module that introduces new parameters in the attention and FFN layers, to deeply align image and text features without degrading LLM performance.

Central Problem
Bridging the modality gap between text and images is hard. VLMs such as BLIP-2 that leverage shallow alignment methods to incorporate visual information into pretrained language models struggle on more nuanced tasks. On the other hand, jointly training image and text models as in PALI-X yields catastrophic forgetting in the large language models, significantly hampering performance on language benchmarks.

Proposed Solution
Vision language model CogVLM-17B introduces deep fusion between vision and language information via new attention layers and feed-forward network layers for visual features. This vision expert module is trained while the weights for the language model are frozen. The model is trained from Vicuna-7B, and the model weights are open-sourced.

Results
CogVLM-17B achieves state-of-the-art or the second-best performance across 14 classic cross-modal benchmarks including image captioning datasets, VQA datasets, and visual grounding datasets.

Holistic Evaluation of Text-to-Image Models

Score: 7.1 Tony Lee, Michihiro Yasunaga, Chenlin Meng, Yifan Mai, Joon Sung Park, Agrim Gupta, Yunzhi Zhang, Deepak Narayanan

Objective
The paper aims to develop a benchmark for holistic evaluation of text-to-image models (HEIM), including text-image alignment, image quality, aesthetics, originality, reasoning, knowledge, bias, toxicity, fairness, robustness, multilinguality, and efficiency. 26 state-of-the-art T2I models are extensively evaluated on this benchmark.

Central Problem
Most T2I models are only evaluated in terms of text-image alignment (e.g. CLIPScore) and image quality (e.g. FID). Evaluations tend to overlook other critical aspects such as originality and aesthetics, and the presence of toxic or biased content. Papers in the literature adopt different evaluation datasets and metrics, making direct comparison challenging.

Proposed Solution
The paper introduces the Holistic Evaluation of Text-to-Image Models (HEIM) benchmark, covering 12 diverse aspects of model quality. The benchmark contains 62 scenarios, which are datasets of prompts, as well as 25 metrics, which are used to assess the quality of generated images specific to each aspect. The authors additionally conduct crowdsourced human evaluations. Using this benchmark, 26 text-to-image models were compared in a standardized fashion.

Results

  • Overall Performance: No single T2I model excelled in all assessment areas.

  • Correlations: Observed positive correlations between general alignment and reasoning, as well as between aesthetics and originality in leading models.

  • Trade-offs: Noted trade-offs in some models, particularly aesthetics against photorealism, and bias and toxicity against text-image alignment and photorealism.

  • Underperforming Areas: All evaluated models showed weaker performance in reasoning, photorealism, and multilinguality, indicating these as key areas for future improvements.

🏅 NOTABLE PAPERS

Levels of AGI: Operationalizing Progress on the Path to AGI
Score: 7.4 • Proposes a framework for categorizing AGI models based on performance and generality, outlines six guiding principles for AGI assessment, and emphasizes safe deployment considerations.

Simplifying Transformer Blocks
Score: 6.6 • Simplifies transformer blocks, removing elements like skip connections without affecting training speed. Resulting models are equally effective, 15% faster, and use 15% fewer parameters than standard transformers.

Black-Box Prompt Optimization: Aligning Large Language Models without Model Training
Score: 6.2  Introduces Black-Box Prompt Optimization (BPO) to align Large Language Models (LLMs) like GPTs without additional training, improving user intent realization by 22% for ChatGPT and 10% for GPT-4.

How was today’s email?

Not Great      Good      Amazing

Thank You

Hyungjin Chung is a contributing writer at AlphaSignal and second year Ph.D. student @KAIST bio-imaging signal processing & learning lab (BISPL). Prior research intern at the Los Alamos National Laboratory (LANL) applied math and plasma physics group (T-5).

Jacob Marks is an editor at AlphaSignal and ML engineer at Voxel51, is recognized as a leading AI voice on Medium and LinkedIn. Formerly at Google X and Samsung, he holds a Ph.D. in Theoretical Physics from Stanford.

Want to promote your company, product, job, or event to 150,000+ AI researchers and engineers? You can reach out here.