论大语言模型强化学习训练中的 KL 正则化

论文标题:A COMEDY OF ESTIMATORS: ON KL REGULARIZATION IN RL TRAINING OF LLMS 论 ...

阿里新作 Let It Flow: Agentic Crafting on Rock and Roll 深度解读

论文标题:Building the ROME Model within an Open Agentic Learning Ecosystem 论文链 ...

DeepSeek 新作 mHC:为何要把超连接约束在流形上?

论文标题:mHC: Manifold-Constrained Hyper-Connections 论文链接:https://arxiv.org/pd ...

字节 Seed 新作:通过辅助损失实现 MoE 专家与路由器的紧密耦合 (ERC Loss)

祝大家新年快乐~ 论文标题:Coupling Experts and Routers in Mixture-of-Experts via an Auxi ...

Bottom-up Policy Optimization: 自下而上的策略优化——语言模型内部潜藏的子策略

论文标题:Bottom-up Policy Optimization: Your Language Model Policy Secretly Cont ...

Google DeepMind 新作:自回归模型中的涌现时间抽象实现了分层强化学习

论文标题:Emergent temporal abstractions in autoregressive models enable hierarch ...

被牺牲的元认知:效率导向优化如何改变了模型的推理结构

论文标题:Schoenfeld’s Anatomy of Mathematical Reasoning by Language Models 论文链 ...

Who is Adam? 重新审视大模型 RLVR 阶段的优化器选择

博客标题:Who is Adam? SGD Might Be All We Need For RLVR In LLMs 博客链接:https://w ...

代码大模型的 Scaling Laws:编程语言差异性与多语言混合策略研究

论文标题:Scaling Laws for Code: Every Programming Language Matters 论文链接:https: ...

从 0.5B 到 72B:揭秘 RL Post-Training 中的计算、数据与模型规模权衡

论文标题:Scaling Behaviors of LLM Reinforcement Learning Post-Training: An Empir ...