LLMs VS Visual Encoders

We got 5 papers for you on all things LLMs & Visual Encoders

Nov 10, 2023

We launched a referral program with perks like free CodeHub AI free for 1 year (!) and 1:1 expert career coaching. You can get this stuff starting with just 1 referral!

Refer a friend

Greetings to another edition of AI Insights, your gateway to the dynamic realm of artificial intelligence. In this issue, our CTO and AI Researcher, Vishwas Mruthyunjaya, has meticulously selected 5 papers. These papers serve as your guide to the forefront of groundbreaking discoveries, inventive methodologies, and visionary concepts currently shaping the trajectory of AI.

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples (🔗 Read the Paper)

As large language models undergo extensive training with vast datasets, concerns about benchmark reliability arise due to potential contamination. This text highlights the inadequacy of prevailing decontamination methods, emphasizing the susceptibility of models to variations in test data. The authors propose a robust LLM-based decontamination method to mitigate these risks.

Insufficiency of String Matching: Traditional decontamination methods, such as n-gram overlap string matching, are proven insufficient.
Overfitting Risk: Without eliminating variations in test data, a 13B model can overfit benchmarks, achieving unexpectedly high performance, as demonstrated on benchmarks like MMLU, GSK8k, and HumanEval.
Identifying Contamination: The proposed LLM-based decontamination method reveals significant test overlap in widely used datasets, reaching 8-18% overlap with the HumanEval benchmark.
Community Action: The authors advocate for stronger decontamination approaches and encourage the active development of new exams to safeguard against unintentional contamination risks.

Pretrained Transformers from Language Models for Visual Encoding (🔗 Read the Paper)

This paper presents a surprising revelation about the latent capabilities of pretrained transformers from Large Language Models (LLMs). Despite being trained exclusively on textual data, these transformers demonstrate remarkable proficiency as encoders for visual tasks, challenging conventional uses limited to either text embeddings or tokenized outputs.

Unexpected Strength in Visual Encoding: The research showcases the untapped potential of LLMs as versatile encoders for visual data, expanding beyond their conventional roles.
Innovative Approach: The approach involves incorporating a frozen transformer block from a pre-trained LLM directly into the visual encoder, highlighting a previously overlooked strategy.
Simple Three-Step Procedure: The proposed method involves extracting and appending a frozen LLM transformer block, aligning feature dimensions with trainable linear layers, and freezing the LLM transformer while optimizing other modules during training.
Visual Representation: The paper intuitively illustrates the process, emphasizing the simplicity of the three-step approach for leveraging the power of LLMs in visual encoding.

JaSPICE: Automatic Evaluation Metric Using Predicate-Argument Structures for Image Captioning Models (🔗 Read the Paper)

This study addresses the limitations of n-gram-based metrics like BLEU and METEOR in the context of image captioning evaluations. While alternative metrics like SPICE have been proposed for English, there's a gap for non-English languages. In response, this research introduces an automatic evaluation metric, JaSPICE, specifically designed for assessing Japanese captions using scene graphs.

Limitations of N-Gram Metrics: The study acknowledges the shortcomings of traditional n-gram-based metrics in accurately reflecting human evaluation for image captioning.
Proposed Metric - JaSPICE: Introducing JaSPICE, an automatic evaluation metric tailored for Japanese captions, leveraging scene graphs for a more nuanced evaluation.
Scene Graph Generation: The method involves generating a scene graph from dependencies and predicate-argument structure, extending it with synonyms to enhance evaluation accuracy.
Experimental Validation: The study validates JaSPICE using 10 image captioning models trained on STAIR Captions and PFN-PIC, demonstrating superior performance in correlation with human evaluation compared to baseline metrics.

RobustMat: Neural Diffusion for Street Landmark Patch Matching under Challenging Environments (🔗 Read the Paper)

In the domain of autonomous vehicles (AVs), robust visual perception is indispensable, especially through camera-based methods for information processing. This research addresses the intricacies of matching landmark patches captured in diverse environmental conditions. The proposed RobustMat leverages neural differential equations to achieve superior matching results under challenging perturbations.

Importance of Visual Perception for AVs: The study underscores the significance of visual perception techniques, particularly camera-based, in facilitating information processing for autonomous vehicles.
Challenges in Landmark Patch Matching: Recognizing the difficulties posed by changing environmental conditions, the research focuses on matching landmark patches captured at different times or stored in image databases.
RobustMat Approach: Introducing RobustMat, the approach capitalizes on the robustness derived from neural differential equations. It incorporates a convolutional neural ODE diffusion module for feature representation and a graph neural PDE diffusion module for aggregating information from neighboring patches.
Evaluation and Performance: The study evaluates RobustMat on various street scene datasets, demonstrating its effectiveness in achieving state-of-the-art matching results, even under challenging environmental perturbations.

Enhancing Language Models Without Retraining (🔗 Read the Paper)

DARE Operation: Introduces a groundbreaking operation enabling the direct setting of a significant proportion (90% to 99%) of delta parameters to zeros in SFT LMs.

Sparsification Technique: Utilizes DARE as a general preprocessing technique to sparsify delta parameters across multiple SFT homologous models.
Unified Model Formation: Merges sparsified models into a single, enriched model through parameter averaging, showcasing an efficient synthesis of enhanced capabilities.
Retraining-Free Advancement: Demonstrates the potential for LM capability enhancement without the need for resource-intensive retraining or specialized GPUs.

Looking for a job? Check out HackerPulse Jobs, where tech companies are looking for ambitious talents like you!

See Jobs

HackerPulse Dispatch

Discussion about this post