• Lior's View
  • Posts
  • 🥇 The 3 AI Papers You Should Read This Week

🥇 The 3 AI Papers You Should Read This Week

AlphaSignal

Hey ,

Welcome back to your weekly research summary.

In the last 7 days, over 1500+ AI research papers have been released but, worry not, our models and team identified the few ones that truly stand out.

Let's get into it!

Lior

On Today’s Summary:

  • LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

  • DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

  • Aligning large multimodal models with factually augmented RLHF

  • Other notable papers

đź“„ TOP PUBLICATIONS

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

Score: 9.9 â€˘ Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang

Objective
The paper focuses on creating a better text-to-video (T2V) model using a text-to-image (T2I) diffusion model. The goal is to make videos that look real and make sense over time, while also keeping the good parts of T2I models, like being creative.

Central Problem
Current T2V models are not as good as T2I models. The paper identifies three main issues: the design of the neural network, how T2V models are trained from T2I models, and the quality of the training data.

Proposed Solution

  • Network: Three latent diffusion models (LDMs) are used in a series. The first makes a simple, low-res video. The second smooths out the video. The last one improves video quality to a final size of 61 (temporal) x 1280 x 2480. A new part, called a temporal attention block, helps manage the time aspect in videos.

  • Training: Both videos and images are used for training. Images are mixed with video batches. A balanced loss function helps keep the good features of the original T2I model.

  • Dataset: A new high-quality dataset, Vimeo25M, is proposed. It has 25 million text-video pairs that are clean and high-res. The authors narrowed this down to 20 million videos and 400 million images for training.

Methodology
Data from Vimeo was collected and cleaned with Videochat. Ten different types of content are included. The base T2I model is LAVIE’s stable diffusion v1.4, fine-tuned for T2V. Training started with an easier dataset and moved to Vimeo25M later.

Results
The new model, LAVIE, performs better than most other T2V models in tests. It's only outdone by Make-a-Video, which had more data and lower resolution targets. In studies where people judged the quality, LAVIE also did better than other models like Modelscope and Videocrafter.

LAST CHANCE: Save Your Spot At The Most In-Depth Virtual Conference On LLMs In Production (Free)

The MLOps Community is running its third LLMs in Production virtual conference. This is the place to be if you want to go deep on the latest developments in gen AI. Join now to hear from real practitioners who have hands on experience incorporating LLMs into real-life products, rather than Twitter demos.

On the agenda: 40+ expert speakers, musical interludes, guided meditation, prompt injection games, and plenty of surprises. Sign up now to avoid FOMO later

Also: Musical interludes, guided meditation, and prompt injection games.

October 3rd | Online | 100% free

DreamGaussian: Generative Gaussian Splatting for Efficient 3D Content Creation

Score: 9.0 â€˘ Jiaxiang Tang, Jiawei Ren, Hang Zhou, Ziwei Liu, Gang Zeng

Objective
The paper aims to leverage 3D Gaussian splatting (GS), which has recently gained interest in the NeRF community, to a generative task of text-to-3D or image-to-3D. By switching from a neural representation to explicit Gaussian parameters, DreamGaussian aims to be a fast generator that can create assets within 2 minutes.

Central problem
Current text-to-3D or image-to-3D methods often use implicit NeRF representations, making them slow. They can take tens of minutes to create one 3D asset. GS is faster and doesn’t need neural networks but hasn’t been applied for generative tasks yet.

Proposed Solution

  1. Generative Gaussian Splatting: GS parameters like position, rotation, scaling, color, and opacity are optimized using a standard score distillation sampling (SDS) loss from a pre-trained diffusion model.

  2. Mesh Extraction: The generated 3D asset can be blurry. A mesh is created using a grid query, marching cubes, and post-processing steps like decimation and remeshing. UV coordinates are unwrapped for future texturing.

  3. UV-Space Refinement: In UV-space, texture is improved using an SDEdit method, applying forward-reverse diffusion to a smooth UV image. A Mean Square Error (MSE) loss optimizes this texture.

Methodology
3D Gaussians start off randomly and are updated every 50 steps. The model is trained in two stages: 500 steps for stage one and 50 for stage two. Different diffusion priors are used for text-to-3D and image-to-3D.

Results
DreamGaussian cuts down generation time from 20 minutes to 2 minutes. This is faster than Zero-1-to-3 and comparable to inference-only methods like One-2-3-45, Point-E, and Shap-E, which take between 30 to 75 seconds. Importantly, this speedup does not sacrifice performance.

Speechmatics has launched Real-Time Translation as part of its all-in-one Speech API.

Their new self-supervised model can bring your product or service to the largest audience possible, without the hassle of multiple different language APIs and lengthy setup times.

Aligning large multimodal models with factually augmented RLHF

Score: 9.1 â€˘ Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang

Objective
The paper aims to solve "hallucinations" in Large Multimodal Models (LMMs), where the text output doesn't align with the image context.

Central Problem
In LMMs, there's often a mismatch between vision and language, leading to inaccurate outputs, which is a big issue for tasks like image captioning or question answering.

Proposed Solution
Fact-RLHF is introduced, an algorithm that strengthens the reward model with extra data like image captions. This setup prevents the language model from tricking the reward model and distinguishes between "unhelpful" and "inaccurate" responses.

Methodology

  1. Multimodal Supervised Fine-tuning: Uses LLaVa architecture and a pre-trained CLIP image encoder. Both the vision encoder and language model are jointly fine-tuned.

  2. Multimodal Preference Modeling: A reward model is trained based on human feedback to better score the outputs.

  3. Reinforcement Learning: Utilizes Proximal Policy Optimization (PPO) for policy updates. LoRA with a rank of 64 is used to stabilize the original network during reinforcement learning.

Evaluation Metrics
Introduces MMHAL-BENCH, a benchmark focusing on hallucinations and covering realistic, open-ended questions, unlike previous yes/no benchmarks.

Results
Achieves a 94% performance increase on LLaVa-Bench compared to text-only GPT-4. Also shows a 60% improvement on MMHAL-BENCH, which explicitly penalizes hallucinations.

🏅 NOTABLE PAPERS

Making PPO even better: Value-Guided Monte-Carlo Tree Search decoding
Score: 9.2 â€˘ The paper shows that combining Monte-Carlo Tree Search (MCTS) with Proximal Policy Optimization (PPO) can improve text generation. It introduces a new algorithm, PPO-MCTS, that uses the value network from PPO to work with the policy network.

Demystifying CLIP Data
Score: 9.1 â€˘ This paper introduces MetaCLIP, a method for curating better training data for language-image models. By focusing only on data and keeping the model and training constant, MetaCLIP outperforms CLIP on key benchmarks. For example, it achieves 70.8% accuracy in zero-shot ImageNet classification, beating CLIP's 68.3%. Code and data are available.

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Score: 7.2 â€˘ The paper presents Show-1, a hybrid text-to-video model that combines pixel-based and latent-based approaches. Show-1 produces low-res video with pixel-based methods, then upscales it using latent-based methods. It achieves high-quality and well-aligned videos while being more efficient in GPU memory use. Code and weights are available.

Hyungjin Chung is a contributing writer at AlphaSignal and second year Ph.D. student @KAIST bio-imaging signal processing & learning lab (BISPL). Prior research intern at the Los Alamos National Laboratory (LANL) applied math and plasma physics group (T-5).

Thank You

How was today’s email?

Not Great      Good      Amazing

Want to promote your company, product, job, or event to 100,000+ AI researchers and engineers? You can reach out here.