• Lior's View
  • Posts
  • 🥇 Top 5 AI Papers You Should Read This Week

🥇 Top 5 AI Papers You Should Read This Week

AlphaSignal

Hey ,

Welcome back to AlphaSignal, where we bring you the latest developments in the world of AI. In the past few days, 3086 AI papers have been released, and among them, we have handpicked the top six that truly stand out.

This week, there has been a surge in papers focusing on Language and Large Models (LLMs). These papers explore new models, training strategies, and applications spanning various domains such as video, audio, and protein modeling. Given the recent NeurIPS deadline for submissions, it comes as no surprise that we are witnessing this sudden influx of interesting works.

In an exciting development, Microsoft Research introduces Orca, a groundbreaking 13B model that is slated for open-source release. Similar to other open-source models like Alpaca and Vicuna, Orca is trained using data generated from Large Foundation Models (LFMs) such as ChatGPT and GPT-4. However, what sets Orca apart is its utilization of the step-by-step thought process of LFMs and a massive dataset of 5M, surpassing previous open-sourced models. Through rigorous evaluations, the authors demonstrate that Orca not only captures the style but also outperforms ChatGPT, albeit by a small margin. The imminent release of this model has the potential to revolutionize the open-source LLM community once and for all.

There is a growing interest in research aimed at making the inference process of LLMs more cost-effective through improved quantization and compression techniques. SpQR presents a method that achieves near-lossless compression even when large models are quantized using 3-4 bits per parameter, enabling the execution of a 33B LM on a single 24GB commodity GPU. With these advancements, we may soon witness the ability to deploy decent LMs on edge devices like mobile phones.

LLMs have become a universal foundation for various modalities. This week, we discovered their versatility in speech-to-speech translation, video/audio comprehension, and protein sequence alignment. Leveraging the shared LLM token space for feature embedding has proven advantageous in most cases, but there are situations where we can explore alternative approaches that maximize the potential of LLMs.

If you have any questions, suggestions, or feedback, simply reply to this email to reach out to us, and we will promptly respond.

On Today’s Summary:

  • The Wordcloud of 1000+ papers ☁️

  • Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

  • Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

  • Orca: Progressive Learning from Complex Explanation Traces of GPT-4

  • Other notable papers

FREE Webinar: How to Streamline Workflows with AWS Intelligent Document Processing (IDP)
Are you struggling to manage and process large volumes of documents in your organization? Do you want to improve the accuracy and compliance of your document processing workflows? Then, this webinar is for you.

<img src='https://www.vpdae.com/open/6054.gif?opens=1' width='1' height='1'>

☁️ ABSTRACTS WORLDCLOUD

📄 TOP PUBLICATIONS

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Score: 9.9  Hang Zhang, Xin Li, Lidong Bing

Summary
Video-LLaMa, a multi-modal framework that can comprehend both the visual and the auditory content of the video, grounded on LLMs, is proposed. For the first time, it was shown that the model, given a video (both visual/auditory signal), can faithfully accomplish text-guided tasks given by the user.

The model consists of two major branches: 1) Vision-language branch and 2) Audio-Language Branch. 1) consists of a frozen visual encoder that maps individual frames into the feature space, and a video Q-former followed by linear layers to map the tokens into video soft prompts so that LLMs can generate the following texts. 2) consists of a frozen ImageBind audio encoder followed by an audio Q-former and linear layers to embed audio features to the space of LLM tokens.

Training of Video-LLaMa is done separately for each branch. In the first stage, large-scale vision-caption datasets such as Webvid-2M along with image-caption datasets such as CC595k are used for training the model with video-to-text generation. In the second stage, the model is fine-tuned with image/video instruction following data from e.g. MiniGPT-4, LLaVA, and Video-Chat. Surprisingly, training of the audio-language branch is also performed with visual-text data, which exhibits zero-shot comprehension property during inference, owing to the power of ImageBind that shares a common feature space for all modalities.

The resulting model is capable of integrating audio-visual signals, grasping common knowledge concepts, and capturing temporal dynamics in videos: abilities that have not been reported in the literature.

Details
Submitted on June 5th • Computer Vision

Want to promote your company, product, job, or event to 100,000+ AI researchers and engineers? You can reach out here.

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

Score: 9.2  Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi

Summary
Reinforcement Learning from Human Feedback (RLHF) serves as a critical element in modern Language and Large Models (LLMs), enabling significant enhancements in the models' ability to follow instructions. In a broader sense, reward models (RMs) offer a superior objective function compared to those utilized during the pre-training phase. Typically, a single RM is trained using preferences provided by human labelers to train the policy LM through RL. However, what if we could employ multiple finely-tailored RMs, each serving a specific purpose, instead of relying on a general preference?

The authors of this paper propose an intriguing approach: training distinct types of RMs specifically designed to assess relevance, factuality, and information completeness. During the RL phase, a weighted combination of these rewards is employed to facilitate policy updates. Moreover, departing from the conventional practice of providing holistic rewards, the authors suggest employing denser rewards on a per-sentence basis, thereby enhancing performance.

By leveraging such fine-grained rewards, the resulting model demonstrates superior performance compared to the baseline RLHF model. Furthermore, by manipulating the weights assigned to each RM, one can customize the behavior of the LM. For instance, assigning a high weight to the relevance RM produces shorter sentences, while prioritizing the info completeness RM generates longer and more informative responses.

Through this innovative approach, the researchers not only surpass the limitations of the baseline RLHF model but also provide a means to tailor the LM's behavior based on specific requirements. This opens up exciting possibilities for achieving nuanced control and optimizing the output of LMs in accordance with desired outcomes.

Details
Submitted on June 2nd • NLP • Computation and Language • LLMs

Orca: Progressive Learning from Complex Explanation Traces of GPT-4

Score: 8.5  Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawahar, Sahaj Agarwal, Hamid Palangi, Ahmed Awadallah

Summary
Recent evidence suggests that OpenAI and Google possess a distinct advantage, often referred to as a "moat," in the realm of open-source language models. While models like Alpaca and Vicuna attempt to replicate the stylistic aspects of large foundation models (LFMs), they struggle to capture the crucial content. Consequently, when subjected to comprehensive assessments beyond a limited set of benchmarks, significant disparities persist between these open-sourced models and the corporate LFMs.

Addressing this disparity, a team of researchers from Microsoft introduces Orca, an impressive 13-billion-parameter model. Unlike its counterparts, Orca not only manages to imitate the reasoning processes of LFMs but also achieves remarkable performance gains, surpassing other open-sourced models by a substantial margin and even outperforming ChatGPT on multiple benchmarks. This achievement is attributed to the identification and resolution of two key limitations observed in previous methods.

Firstly, instead of solely training the model on instruction-output pairs, Orca is trained to emulate the intricate reasoning traces exhibited by LFMs, thereby inducing comprehensive step-by-step thought processes. Secondly, while earlier approaches relied on a limited number of training samples, Orca leverages a significantly larger dataset comprising 5 million examples. Furthermore, employing a progressive learning strategy that involves initial training from ChatGPT and subsequent fine-tuning with GPT-4, Orca-13B exhibits exceptional performance on the BigBench benchmark. Notably, Orca-13B surpasses the current state-of-the-art open-source model, Vicuna-13B, by more than 100% on this benchmark. It even manages to outperform ChatGPT, albeit by a slight margin, while still lagging behind GPT-4.

Given the substantial performance gains resulting from learning through step-by-step explanations, it becomes intriguing to contemplate the potential additional improvements that could be achieved by other models in the future.

Details
Submitted on June 5th • Foundation Models • GPT

🏅 NOTABLE PAPERS

PolyVoice: Language Models for Speech to Speech Translation
Score: 8.0  Qianqian Dong, Zhiying Huang, Chen Xu, Yunlong Zhao, Kexin Wang, Xuxin Cheng,..

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation
Score: 7.6  Le Zhang, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Score: 7.0  Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos,..

How was today’s email?

Not Great      Good      Amazing

Hyungjin Chung is a contributing writer at AlphaSignal and second year Ph.D. student @KAIST bio-imaging signal processing & learning lab (BISPL). Prior research intern at the Los Alamos National Laboratory (LANL) applied math and plasma physics group (T-5).

Thank You