初めに
査読前の論文がアップロードされている arXivを気になったものを見ているものの気になるものを全て探すのは大変なので API経由で検索をしていきます
記事の内容のリポジトリは以下で公開しています
開発環境
- python 3.10
- uv
詳細
やりたいこと * 気になる特定のトピックの中から最新のものを探す * タイトルおよびSummaryを見る
以下で最新の内容からトピックを探してきます
import urllib.parse import feedparser # 検索クエリを定義 query = 'all:"LLM" OR all:"Text to Speech" OR all:"Speech to Text" OR all:"AI Character"' # クエリをURLエンコード encoded_query = urllib.parse.quote(query) # ベースとなるAPIのURL base_url = 'http://export.arxiv.org/api/query?' # APIパラメータの設定 params = { 'search_query': encoded_query, 'start': 0, # 取得開始位置 'max_results': 10, # 取得する結果の最大数 'sortBy': 'submittedDate', # 提出日の新しい順にソート 'sortOrder': 'descending', } # パラメータをURLエンコードしてクエリ文字列を作成 query_string = '&'.join(f'{key}={value}' for key, value in params.items()) # 完全なAPIリクエストURLを構築 url = base_url + query_string # フィードをパース feed = feedparser.parse(url) # 各論文についてタイトルと要約を表示 for entry in feed.entries: title = entry.title summary = entry.summary.replace('\n', ' ') # 改行を削除して整形 print(f'タイトル: {title}') print(f'要約: {summary}') print('-' * 80)
結果は以下のようになります
タイトル: Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge 要約: LLM-as-a-Judge has been widely utilized as an evaluation method in various benchmarks and served as supervised rewards in model training. However, despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. Therefore, we identify 12 key potential biases and propose a new automated bias quantification framework-CALM-which systematically quantifies and analyzes each type of bias in LLM-as-a-Judge by using automated and principle-guided modification. Our experiments cover multiple popular language models, and the results indicate that while advanced models have achieved commendable overall performance, significant biases persist in certain specific tasks. Empirical results suggest that there remains room for improvement in the reliability of LLM-as-a-Judge. Moreover, we also discuss the explicit and implicit influence of these biases and give some suggestions for the reliable application of LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications. -------------------------------------------------------------------------------- タイトル: Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation 要約: Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B's win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50-75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs. --------------------------------------------------------------------------------