does ChatGPT have 20/20 vision?

Oct 27, 2023

We launched a referral program with perks like free CodeHub AI free for 1 year (!) and 1:1 expert career coaching. You can get this stuff starting with just 1 referral!

Refer a friend

Welcome to the newest issue of AI Fridays, your gateway to the ever-evolving landscape of artificial intelligence. In this edition, we've meticulously selected 5 important papers from the field of AI.

These picks, thoughtfully curated by our CTO and AI Researcher, Vishwas Mruthyunjaya, provide insights into the latest developments, innovative approaches, and concepts shaping the AI landscape.

Before you start, we’ve launched a CV guide on ProductHunt

We put together a very useful CV guide for you and launched it on ProductHunt.

It’s packed with ChatGPT prompts, CV templates and free access to Coverdoc.ai, a cover letter generator.

Please support us so more people find the guide and have a better job search experience.

Show Support

Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (🔗 Read the Paper)

This analysis provides a comprehensive evaluation of GPT-4V(ision), a recently released Large Multimodal Model (LMM), with a focus on its Optical Character Recognition (OCR) capabilities. The study assesses its performance across various OCR tasks and discusses strategies for enhancing its efficiency.

GPT-4V's OCR performance is strong in recognizing and understanding Latin text content.
Challenges arise in multilingual scenarios and complex OCR tasks, suggesting room for improvement.
The study emphasizes the need for specialized OCR models while considering how general LMMs like GPT-4V can be effectively employed for OCR downstream tasks. It provides valuable insights for future OCR research involving LMMs.

LLM-FP4: 4-Bit Floating-Point Quantized Transformers (🔗 Read the Paper)

In this study, LLM-FP4 is introduced as a novel post-training quantization method for large language models (LLMs). It focuses on reducing both weights and activations to 4-bit floating-point values, offering more flexibility than existing integer-based quantization techniques. This research optimizes quantization parameters and addresses challenges in activation quantization, resulting in impressive performance gains for LLMs.

Flexible FP Quantization: LLM-FP4 introduces a flexible 4-bit floating-point quantization method, outperforming integer-based solutions.
Optimal Quantization Parameters: The research identifies optimal quantization parameters through an extensive search, enhancing the FP-PTQ baseline.
Per-Channel Activation Quantization: By addressing inter-channel and intra-channel variance patterns, the study introduces per-channel activation quantization.
Impressive Performance: LLM-FP4 achieves remarkable results, quantizing both weights and activations in LLaMA-13B to 4 bits and significantly outperforming the previous state-of-the-art in zero-shot reasoning tasks.

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images (🔗 Read the Paper)

This paper revolves around harnessing Creative-Commons-licensed (CC) images for training text-to-image generative models, providing competition to Stable Diffusion 2 (SD2).

Unique Dataset Creation: A dataset of CC images serves as the basis for training open diffusion models, offering qualitative competition to SD2.
Challenges Addressed: Overcoming issues such as limited captions for high-resolution CC images and the scarcity of CC images, a novel transfer learning technique is used to pair synthetic captions with curated CC images.
Resource-Efficient Recipe: A training recipe is developed that requires significantly less data compared to existing SD2 models, obtaining comparable quality.
Speed Optimizations: Several optimizations implemented in the training recipe lead to training speed improvements of about three times, facilitating rapid model development. The resulting CommonCanvas models rival SD2 performance in human evaluations.

VidChapters-7M: Video Chapters at Scale (🔗 Read the Paper)

Efficiently segmenting lengthy videos into chapters is invaluable for users to quickly access desired content. However, the lack of publicly available datasets has hindered research in this area. This problem is addressed through the introduction of VidChapters-7M, a dataset containing 817,000 user-chaptered videos, encompassing a total of 7 million chapters. Notably, this dataset is generated in an automated fashion by extracting user-annotated chapters from online videos, eliminating the need for manual annotation.

VidChapters-7M Creation: VidChapters-7M is established from online videos in a scalable manner, incorporating 817,000 user-chaptered videos and 7 million chapters. These chapters are sourced from user annotations without requiring additional manual work.
Three Key Tasks: The dataset enables the formulation of three tasks. Firstly, the video chapter generation task involves temporally segmenting videos and creating chapter titles for each segment. Two task variants are introduced, which include generating chapter titles for annotated video segments and temporally localizing chapters based on their titles.
Benchmarking and Pretraining: A comprehensive assessment involves benchmarking both basic and state-of-the-art video-language models for these tasks. It's observed that pretraining on VidChapters-7M significantly enhances performance on dense video captioning tasks, improving the state of the art in the YouCook2 and ViTT benchmarks.
Scaling Pretraining: The experiments indicate that downstream performance improves with the size of the pretraining dataset, emphasizing the scalability of this approach.

Boosting Recommender Systems with LLMs (🔗 Read the Paper)

This paper introduces the RLMRec framework, a model-agnostic approach designed to augment conventional recommenders with the power of Large Language Models (LLMs). RLMRec introduces a recommendation paradigm that merges representation learning with LLMs, focusing on capturing intricate semantic facets of user behaviors and preferences. It achieves this by incorporating auxiliary textual signals, employing a user/item profiling paradigm driven by LLMs, and aligning the semantic space of LLMs with the representation space of collaborative relational signals through a cross-view alignment framework.

RLMRec Framework: RLMRec is a model-agnostic framework dedicated to enhancing existing recommendation systems by leveraging Large Language Models (LLMs). It introduces a novel recommendation paradigm that combines representation learning with LLMs to comprehensively grasp the nuanced semantic aspects of user preferences and behaviors.
Incorporating Textual Signals: RLMRec integrates auxiliary textual signals into the recommendation process. This incorporation allows the system to harness text-based information to refine recommendations and better understand user preferences.
User/Item Profiling with LLMs: The framework employs LLMs to create user and item profiles, enriching the recommendation process. These profiles, driven by the capabilities of LLMs, offer a deeper understanding of user behavior and preferences, enhancing recommendation accuracy.
Cross-View Alignment: RLMRec aligns the semantic space of LLMs with the representation space of collaborative relational signals through a cross-view alignment framework. This alignment facilitates a cohesive and holistic understanding of user preferences, leading to more effective and context-aware recommendations.

HackerPulse Dispatch

Discussion about this post