0xC001
分享机器学习知识
168
文章
0
评论
389
获赞
Differential Smoothing——缓解 RL 微调中的分布坍缩并提升 LLM 推理能力
论文标题:Differential Smoothing Mitigates Sharpening and Improves LLM Reasoning
...
Natural Language Actor-Critic: 语言空间中的可扩展异策略学习 (NLAC)
论文标题:Natural Language Actor-Critic: SCALABLE OFF-POLICY LEARNING IN LANGUAGE
...
AAAI 2026:DeltaEdit 实现 LLM 连续知识编辑
论文标题:On the Superimposed Noise Accumulation Problem in Sequential Knowledge
...
复现 Search-R1 总是失败?GRPO 训练不稳定的幕后真凶与对策
论文标题:On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death S
...
PretrainZero:将强化学习前置到预训练阶段的主动学习框架
论文标题:PretrainZero: Reinforcement Active Pretraining
论文链接:https://arxiv.org
...
LLM-as-a-Judge 评估中的偏差修正与置信区间构建
论文标题:How to Correctly Report LLM-as-a-Judge Evaluations
论文链接:https://arxiv
...
Qwen 推出 MiniRL:关于大规模 RL 训练稳定性的研究和实践
论文标题:Stabilizing Reinforcement Learning with LLMs: Formulation and Practices
...
DeepSeek-V3.2 技术报告深度解析:架构演进、RL 扩展与 Agent 合成数据
论文标题:DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
论文链
...
Qwen3-VL 技术报告深度解析
论文标题:Qwen3-VL Technical Report
论文链接:https://arxiv.org/pdf/2511.21631
TL;
...
Qwen 团队推出 SAPO,相较于 GRPO、GSPO 稳定且更优
论文标题:Soft Adaptive Policy Optimization
论文链接:https://arxiv.org/pdf/2511.203
...