У нас вы можете посмотреть бесплатно The blind spot of large-scale AI! The mystery of why irrelevant information causes a 56% drop in ... или скачать в максимальном доступном качестве, видео которое было загружено на ютуб. Для загрузки выберите вариант из формы ниже:
Если кнопки скачивания не
загрузились
НАЖМИТЕ ЗДЕСЬ или обновите страницу
Если возникают проблемы со скачиванием видео, пожалуйста напишите в поддержку по адресу внизу
страницы.
Спасибо за использование сервиса ClipSaver.ru
[Compass for the AI Era] Paper Commentary Series Exploring LLM Reasoning Through Controlled Prompt Variations Giannis Chatziveroglou, Richard Yun, Maura Kelleher https://arxiv.org/abs/2504.02111 ⭐️Story Description The story of this video is about a fisherman grandfather explaining to Nyanta that AI is not good at "selecting and discarding information," and using MIT research as an example, talking about the limitations of AI and points to be careful about in the real world. It is revealed that large-scale language models are vulnerable to unnecessary information and can sometimes fall into a "death spiral." ⭐️Point Commentary 1. Main Findings: The most important finding of this study is that in the evaluation of LLM inference robustness, the introduction of irrelevant context causes the largest performance degradation (-55.89%). Surprisingly, the complexity of the inference step did not significantly affect the performance degradation for various input perturbations, and model size and tolerance were not necessarily correlated. Combinatorial perturbations were also confirmed to further worsen performance. 2. Methodology: In the study, we applied four types of input perturbations (irrelevant context, pathological instructions, relevant context, and combined perturbations) to 13 open-source and closed-source models using the GSM8K mathematical problem-solving task and evaluated them. Potential improvements include evaluating the model on a wider variety of problem types, verifying stability by running inference multiple times, and developing benchmarks to evaluate inference robustness. 3. Study limitations: The main limitations of this study are the limited number of data samples (only 4.6% of the entire GSM8K test set was used), the focus on only a single problem domain (mathematical inference), and the inability to fully evaluate some models (Anthropic) due to context window restrictions. These limitations could be addressed by using more diverse datasets, ensuring access to models without API restrictions, and expanding evaluation in multiple domains. 4. Related Work: This work extends previous work such as GSM-Symbolic, GSM-IC, and GSM-PLUS. While these studies investigated LLM inference robustness, this work analyzes the effects of input perturbations more broadly, particularly examining the effects of irrelevant context and combinatorial perturbations in detail. It also provides new insights into model behavior, such as the accidental chain of sorts phenomenon and the death spiral phenomenon, which is an important position to deepen existing knowledge. 5. Future Impact: This work provides important suggestions for improving LLM inference robustness. It is particularly likely to have a large impact on real-world applications, and will contribute to the development of robust inference capabilities in noisy environments. It is also expected that new training methods and prompting strategies will be developed to improve resistance to input perturbations. Improvements in complex context window handling and strengthening the model's filtering capabilities will be important directions for future research. ▶︎Membership only! Early access to videos here: / @compassinai ▶︎Qiita: https://qiita.com/compassinai Arxiv monthly rankings now available!