Top Papers - May 8th 2023

AlphaSignal

Hey ,

Greetings and welcome back to AlphaSignal!

Over the past 10 days, a staggering 1638 papers have been released. From this wealth of material, we have identified the top 6 papers that stand out from the rest.

MetaAI researchers have released a comprehensive guide to self-supervised learning (SSL), following the recent release of DinoV2. While generative AI is primarily focused on generating realistic samples, SSL concentrates on learning better representations that can be effectively used for downstream tasks. It is intriguing to observe the competition and integration of these two fields, as it sets the direction for the AI community in the coming years.

An open-source community has curated a new dataset named DATACOMP-1B that has surpassed the original CLIP model from OpenAI for the first time. Although some may argue that this is not a fair comparison, as DATACOMP relies on the filtering strategy based on OpenAI's CLIP model itself, it is still a significant achievement that was considered impossible for over two years. This win for researchers outside of large industries is particularly important in the context of the ongoing debate about the "no moat" status of these industries.

OpenAI has also introduced a powerful 3D generative model called Shap-E, which generates the parameters of a NeRF directly. The model's code and pre-trained checkpoints are open-sourced, which is unsurprising considering that Alex Nichol is one of the authors. Unlike 2D generative models, the 3D vision community has yet to agree on the best approach to building 3D generative models that balance computational and memory efficiency while maintaining high quality. Shap-E focuses on directly modeling the implicit representation, emphasizing a specific direction for this field.

If you have any questions, suggestions, or feedback, please do not hesitate to reply to this email, and we will be prompt in getting back to you.

On Today’s Summary:

  • The Wordcloud of 1000+ papers ☁️

  • A Cookbook of Self-Supervised Learning

  • Shap-E: Generating Conditional 3D Implicit Functions

  • DataComp: In search of the next generation of multimodal datasets

  • Other notable papers

☁️ ABSTRACTS WORLDCLOUD

📄 TOP PUBLICATIONS

A Cookbook of Self-Supervised Learning

Score: 9.9  Randall Balestriero, Mark Ibrahim, Vlad Sobal, Ari Morcos, Shashank Shekhar, Tom Goldstein, Florian Bordes, Adrien Bardes..

Summary
The objective of this study is to offer a comprehensive overview of the rapidly evolving field of self-supervised learning (SSL), which has recently gained significant attention. The authors categorize and summarize different SSL methods and provide practical advice for training and evaluating these models. Additionally, they conduct experiments to shed light on some unresolved issues in the field, such as the role of projectors, which demonstrates that projectors increase noise robustness through image augmentation.

The authors classify SSL methods into three categories: 1) the Deep Metric Learning (DML) family (e.g., SimCLR), 2) the Self-distillation family (e.g., BYOL, DINO), and 3) the Canonical Correlation Analysis (CCA) family (e.g., VICReg, Barlow Twins). For each section, the authors provide a historical background of each family, outlining how they originated and developed into the modern deep SSL approaches. For instance, the shift from the classical DML to the modern contrastive SSL emerged with the use of data augmentation instead of sampling, deep networks, and projectors.

Given that the literature on SSL is vast, the authors provide a summary of each major component of SSL, including data augmentation, projectors, and standard hyperparameters such as batch size and learning rate. They offer helpful tips for training SSL models on limited resources, as well as strategies for better convergence in general. This paper can serve as a valuable reference for novice SSL practitioners, allowing them to comprehend and integrate even the most recent advancements.

Details
Submitted on Apr 24 • Computation and Language • Navigate • Self-Supervised Learning

Want to promote a product, job, or event to 100,000+ AI researchers and engineers? Reach out here.

Shap-E: Generating Conditional 3D Implicit Functions

Score: 9.2  Heewoo Jun, Alex Nichol

Summary
A team of prominent researchers at OpenAI has introduced a novel 3D generative model called Shap-E, which generates the parameters of an implicit Neural Radiance Fields (NeRF) MLP directly. Unlike DreamFusion-based methods, which require training a NeRF specifically for each object at inference time, Shap-E is considerably faster, taking only 13 seconds to generate a 3D object on a V100 GPU. Shap-E can also directly generate high-resolution textured meshes without requiring any additional super-resolution modules, unlike its predecessor, Point-E. The researchers trained Shap-E using over 1 million 3D assets that were text-labeled by human labelers.

Shap-E consists of three main parts: a 3D encoder that maps both point clouds and 20-view renderings of the asset into the latent space, a latent diffusion model that models the distribution of the latents, and a NeRF MLP that uses the latents as parameters for rendering. In addition, to enable the model to generate textured 3D meshes, an STF output head is added and fine-tuned during the second stage.

The model is inherently multi-representational, as it can be rendered both as textured meshes and as NeRFs. Moreover, in Appendix D, the researchers provide a method to guide Shap-E in image space, which allows researchers to leverage the score distillation loss from DreamFusion, combining the best of both worlds. As the inference code and the model are open-sourced, it will be fascinating to see how researchers utilize this new tool.

Details
Submitted on May 3rd • Computer Vision and Pattern Recognition • Diffusion

DataComp: In search of the next generation of multimodal datasets

Score: 8.5  Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten,..

Summary
In the realm of machine learning (ML), much research has been focused on developing better algorithms and optimization strategies to improve performance on fixed benchmark datasets. However, what happens when the situation is reversed, and the goal is to design a better dataset while fixing the algorithm? With the growing prominence of large foundation models, the importance of large-scale data collection and curation has become increasingly crucial. This is the motivation behind DATACOMP, a benchmark proposed by researchers from various organizations that presents new training sets while fixing the training code.

The authors introduce COMMONPOOL, a dataset containing 12.8 billion image-text pairs collected from common crawl, as the candidate pool for DATACOMP. They apply a filtering strategy that combines CLIP score-based thresholding from LAION with image-based filtering based on ImageNet features, resulting in DATACOMP-1B, which contains 1.4 billion image-text pairs that can be used to train a state-of-the-art, open-sourced CLIP model from scratch. Remarkably, training a CLIP ViT-L/14 model with a compute budget of 12.8 billion samples achieves an ImageNet zero-shot accuracy of 79.2%.

Apart from the main findings, the paper and project also include more than 300 baseline experiments with varying compute budgets and model sizes. There is also a BYOD (bring your own data) track, which enables users to utilize external datasets in addition to the proposed benchmark datasets. As this is the start of a new generation of multimodal datasets, it will be intriguing to see what contributions this initiative brings to the community.

Details
Submitted on Apr 27 • Computer Vision and Pattern Recognition

🏅 NOTABLE PAPERS

Stable and low-precision training for large-scale vision-language models
Score: 7.5 • Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari Morcos, Ali Farhadi, Ludwig Schmidt

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
Score: 6.9 • Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, Shinji Watanabe

Patch-based 3D Natural Scene Generation from a Single Example
Score: 6.3 • Weiyu Li, Xuelin Chen, Jue Wang, Baoquan Chen

How was today’s email?

Not Great      Good      Amazing

Hyungjin Chung is a contributing writer at AlphaSignal and second year Ph.D. student @KAIST bio-imaging signal processing & learning lab (BISPL). Prior research intern at the Los Alamos National Laboratory (LANL) applied math and plasma physics group (T-5).

Thank You