Top 9 Upvoted Papers on Hugging Face

Yana Khare Last Updated : 07 Jan, 2025

8 min read

The field of artificial intelligence is changing rapidly. Therefore, to keep abreast of the most recent research, reviewing Papers on Hugging Face is essential. Hugging Face has created a unique space where researchers not only share their work but can also engage with the community by upvoting, commenting, and discussing with others. This platform helps users discover the latest breakthroughs in AI, allowing them to catch up on great discoveries. It also spotlights Papers on Hugging Face, which are considered some of the most popular and influential in the AI world. Through this article, I want to highlight the collective interests of researchers and practitioners on Hugging Face, presenting Papers on Hugging Face that have attracted attention for their innovative approaches and findings.

Language Model Reasoning
Vision-Language Models
- What matters when building vision-language models?
- ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
Generative Models
- Depth Anything V2
- Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Model Architecture
- Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length
- SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain
Conclusion

Language Model Reasoning

Recent research explores new approaches in language model reasoning, such as the SELF-DISCOVER framework, enabling models to autonomously create reasoning structures. This improves performance on complex tasks. Studies also highlight the emergence of chain-of-thought reasoning, enhancing logical consistency and model confidence without explicit prompting.

1. Self-Discover: Large Language Models Self-Compose Reasoning Structures

Self-Discover: Large Language Models Self-Compose Reasoning Structures : Top Upvoted Papers on Hugging Face

This paper introduces the SELF-DISCOVER framework, which allows LLMs to autonomously construct reasoning structures for specific tasks. The authors argue that traditional prompting methods are limited in handling complex reasoning tasks. SELF-DISCOVER enables LLMs to select from various atomic reasoning modules, like critical thinking and step-by-step reasoning. These modules are then composed into a coherent structure for task execution. The framework significantly improves performance on benchmarks like BigBench-Hard and MATH, outperforming existing methods by up to 32%. It also requires 10-40 times fewer inference steps, reducing computational effort. Additionally, the self-discovered reasoning structures align with human reasoning patterns, improving interpretability and adaptability across models like GPT-4 and Llama2.

Click here to read the paper.

2. Chain-of-Thought Reasoning Without Prompting

Chain-of-Thought Reasoning Without Prompting: Top Upvoted Papers on Hugging Face

This study investigates the potential for LLMs to engage in chain-of-thought (CoT) reasoning without explicit prompting. Traditionally, CoT prompting involves providing examples that guide models to generate logical reasoning steps prior to arriving at an answer. However, this paper posits that LLMs can inherently produce CoT paths through a modified decoding process called CoT decoding. By analyzing top-k alternative tokens during decoding rather than relying on greedy decoding, the authors find that CoT paths emerge naturally, leading to higher confidence in the model’s responses. Empirical results indicate that this approach significantly enhances performance on various reasoning benchmarks compared to standard decoding methods

Click here to read the paper.

3. ReFT: Representation Finetuning for Language Models

Representation Finetuning for Language Models: Top Upvoted Papers on Hugging Face

The research paper “Representation Finetuning for Language Models” introduces a new approach called Representation Finetuning (ReFT). This method focuses on modifying the hidden representations of large language models (LLMs) rather than changing their weights. The authors propose Low-rank Linear Subspace ReFT (LoReFT), which uses a low-rank projection matrix to learn task-specific modifications while keeping the base model frozen. LoReFT is more parameter-efficient than traditional parameter-efficient finetuning (PEFT) techniques. It achieves performance comparable to or better than existing methods, using 15 to 65 times fewer parameters across various benchmarks, including commonsense reasoning and arithmetic tasks.

The paper presents an ablation study with DiReFT, which prioritizes efficiency over performance. It situates their work within the broader context of PEFT strategies. The study shows that representation editing can enhance model control without significant computational costs. The authors advocate for further exploration of ReFT as a viable alternative to conventional finetuning methods. Their findings highlight the potential for improved interpretability of model behavior. They also provide valuable insights into the development of efficient adaptation methods for LLMs.

Click here to read the paper.

Vision-Language Models

Research in vision-language models (VLMs) focuses on key architectural decisions, showing that autoregressive models outperform cross-attention ones. The Idefics2 model sets new benchmarks, and the ShareGPT4Video initiative demonstrates how precise captions improve video understanding and generation in multimodal models.

4. What matters when building vision-language models?

The paper “What matters when building vision-language models?” by Hugo Laurençon, Léo Tronchon, Matthieu Cord, and Victor Sanh examines the critical design choices in developing vision-language models (VLMs). The authors observe that many decisions regarding model architecture, data selection, and training methods are often made without sufficient justification, hindering progress in the field. To address this, they conduct extensive experiments focusing on pre-trained models, architectural choices, data, and training methodologies. Their findings highlight that advancements in VLMs are largely driven by improvements in unimodal backbones, and they emphasize the superiority of fully autoregressive architectures over cross-attention ones, provided that training stability is maintained.

As a practical application of their research, the authors introduce Idefics2, an efficient foundational VLM comprising 8 billion parameters. Idefics2 achieves state-of-the-art performance within its size category across various multimodal benchmarks and often rivals models four times its size. The model, along with the datasets created for its training, has been made publicly available, contributing valuable resources to the research community.

Click here to read the paper.

5. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

The paper “ShareGPT4Video: Improving Video Understanding and Generation with Better Captions” introduces the ShareGPT4Video series, a comprehensive initiative aimed at enhancing video understanding in large video-language models (LVLMs) and improving video generation in text-to-video models (T2VMs) through the provision of dense and precise captions.

This series includes three key components: (1) ShareGPT4Video, a dataset with 40,000 dense video captions annotated by GPT-4V, covering videos of various lengths and sources. It was developed using meticulous data filtering and annotation strategies. (2) ShareCaptioner-Video, an efficient captioning model that annotates arbitrary videos. It has generated 4.8 million high-quality aesthetic video captions. (3) ShareGPT4Video-8B, a streamlined and effective LVLM that achieves state-of-the-art performance across advanced multimodal benchmarks.

The authors highlight the importance of high-quality, detailed captions for advancing LVLMs and T2VMs. ShareGPT4Video provides precise video descriptions to improve model performance in video comprehension and generation. By offering extensive captions, it deepens the understanding of video content. The dataset and models introduced are publicly available. These resources are valuable for the research community. They encourage further exploration and development in video understanding and generation.

Click here to read the paper.

Generative Models

Generative models like Depth Anything V2 enhance monocular depth estimation using synthetic data and large-scale pseudo-labeled images for better accuracy and efficiency. Visual Autoregressive Modeling presents a new method for scalable image generation, offering faster and more accurate results.

6. Depth Anything V2

The paper “Depth Anything V2” presents an enhanced approach to monocular depth estimation (MDE). It focuses on achieving finer and more robust depth predictions. The authors identify three key practices: replacing all labeled real images with synthetic images for label precision, scaling up the teacher model to enhance learning, and using large-scale pseudo-labeled real images to train student models. This bridges the domain gap between synthetic and real-world data. The methodology results in models that are over ten times faster and more accurate than recent models built on Stable Diffusion. The authors provide models of varying scales, from 25 million to 1.3 billion parameters, for diverse applications.

In addition to the model advancements, the authors address the limitations of current test sets, which often suffer from limited diversity and noise. To facilitate future research, they construct a versatile evaluation benchmark with precise annotations and diverse scenes. This comprehensive approach not only enhances the precision and efficiency of MDE models but also provides valuable resources for the research community to further explore and develop in the field of depth estimation.

Click here to read the paper.

7. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

The paper “Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction” introduces a novel paradigm for image generation by redefining autoregressive learning on images as a coarse-to-fine “next-scale prediction” process, diverging from the traditional raster-scan “next-token prediction” approach. This methodology enables autoregressive transformers to learn visual distributions more efficiently and generalize effectively. Notably, the proposed Visual AutoRegressive (VAR) model surpasses diffusion transformers in image generation tasks. On the ImageNet 256×256 benchmark, VAR significantly improves the Fréchet Inception Distance (FID) from 18.65 to 1.73 and the Inception Score (IS) from 80.4 to 350.2, achieving these enhancements with approximately 20 times faster inference speed.

Furthermore, the authors empirically demonstrate that VAR outperforms the Diffusion Transformer (DiT) across multiple dimensions, including image quality, inference speed, data efficiency, and scalability. Scaling up VAR models reveals clear power-law scaling laws akin to those observed in large language models, with linear correlation coefficients near -0.998, indicating strong evidence of scalability. Additionally, VAR exhibits zero-shot generalization capabilities in downstream tasks such as image in-painting, out-painting, and editing. These findings suggest that VAR has begun to emulate two crucial properties of large language models: scaling laws and zero-shot task generalization. The authors have made all models and codes publicly available to encourage further exploration of autoregressive models for visual generation and unified learning.

Click here to read the paper.

Model Architecture

The Megalodon architecture efficiently handles unlimited context lengths, improving long-sequence processing over traditional transformers. In the legal domain, SaulLM-54B and SaulLM-141B advance domain adaptation through specialized pretraining, achieving state-of-the-art results aligned with legal interpretations.

8. Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

The paper “Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length” introduces a novel architecture. It addresses Transformer limitations in handling long sequences. Traditional Transformers struggle with quadratic complexity and limited context length. Megalodon builds on the MEGA architecture with key enhancements. These include complex exponential moving average (CEMA) and timestep normalization layers. It also features normalized attention mechanisms and a pre-norm with two-hop residual configuration. These innovations allow Megalodon to efficiently process sequences with unlimited context length.

In empirical evaluations, Megalodon demonstrates superior efficiency compared to Transformers, particularly at the scale of 7 billion parameters and 2 trillion training tokens. It achieves a training loss of 1.70, positioning it between Llama2-7B (1.75) and Llama2-13B (1.67). Furthermore, Megalodon outperforms Transformers across various benchmarks, showcasing its robustness across different tasks and modalities. The authors have made the code publicly available, facilitating further research and development in efficient sequence modeling with extended context lengths.

Click here to read the paper.

9. SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain

The paper “SaulLM-54B & SaulLM-141B” introduces two LLMs for legal applications. These models feature 54 billion and 141 billion parameters. They are based on the Mixtral architecture. The models were developed with large-scale domain adaptation strategies. This includes continued pretraining on over 540 billion legal tokens. They also follow specialized legal instruction-following protocols. Their outputs are aligned with human preferences in legal interpretations. The integration of synthetic data boosts their ability to process legal texts. These models surpass previous open-source models on benchmarks like LegalBench-Instruct.

This work explores the trade-offs involved in domain-specific adaptation at such a large scale, offering insights that may inform future studies on domain adaptation using strong decoder models. Building upon the earlier SaulLM-7B, this study refines the approach to produce LLMs better equipped for legal tasks. To facilitate reuse and collaborative research, the authors have released base, instruct, and aligned versions of SaulLM-54B and SaulLM-141B under the MIT License.

Click here to read the paper.

Conclusion

This article on “Top Upvoted Papers on HuggingFace” highlights influential research. It focuses on the most upvoted papers. These papers resonate well with the Hugging Face community. The selection celebrates the work of researchers. It also promotes knowledge sharing among AI practitioners. The dynamic engagement on Hugging Face reflects current trends. This helps readers stay informed about cutting-edge AI research. As AI evolves, it is crucial for practitioners to be aware of influential studies.

Yana Khare

A 23-year-old, pursuing her Master's in English, an avid reader, and a melophile. My all-time favorite quote is by Albus Dumbledore - "Happiness can be found even in the darkest of times if one remembers to turn on the light."

Advanced Generative AI Listicle Research Paper

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Top 9 Upvoted Papers on Hugging Face

Table of contents

Language Model Reasoning

1. Self-Discover: Large Language Models Self-Compose Reasoning Structures

2. Chain-of-Thought Reasoning Without Prompting

3. ReFT: Representation Finetuning for Language Models

Vision-Language Models

4. What matters when building vision-language models?

5. ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

Generative Models

6. Depth Anything V2

7. Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Model Architecture

8. Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

9. SaulLM-54B & SaulLM-141B: Scaling Up Domain Adaptation for the Legal Domain

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth