Mem$^{2}$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation

Zihao Cheng1, Zeming Liu1†, Yingyu Shan2, Xinyi Wang3, Xiangrong Zhu3, Yunpu Ma4, Hongru Wang5, Yuhang Guo2, Wei Lin3, Yunhong Wang1
1Beihang University   2Beijing Institute of Technology   3Independent Researcher
4Munich Center for Machine Learning   5University of Edinburgh

Corresponding author   Email: zihaocheng@buaa.edu.cn

Note: We are actively preparing the codebase, which are currently undergoing the company's internal process review prior to public release.

Abstract

While large language model-powered agents can self-evolve by accumulating experience or by dynamically creating new assets (i.e., tools or expert agents), existing frameworks typically treat these two evolutionary processes in isolation. This separation overlooks their intrinsic interdependence: the former is inherently bounded by a manually predefined static toolset, while the latter generates new assets from scratch without experiential guidance, leading to limited capability growth and unstable evolution. To address this limitation, we introduce a novel paradigm of co-evolutionary Capability Expansion and Experience Distillation. Guided by this paradigm, we propose Mem$^{2}$Evolve, which integrates two core components: Experience Memory and Asset Memory. Specifically, Mem$^{2}$Evolve leverages accumulated experience to guide the dynamic creation of assets, thereby expanding the agent's capability space while simultaneously acquiring new experience to achieve co-evolution. Extensive experiments across 6 task categories and 8 benchmarks demonstrate that Mem$^{2}$Evolve achieves improvement of 18.53% over standard LLMs, 11.80% over agents evolving solely through experience, and 6.46% over those evolving solely through asset creation, establishing it as a substantially more effective and stable self-evolving agent framework.

Methodology

Overview of Mem2Evolve framework

Overview of Mem$^{2}$Evolve, a self-evolving agent framework built on a Dual-Memory mechanism. The evolution proceeds in two phases. During Forward Inference, the agent recruits tools and expert agents from Asset Memory to execute the current task. When the task exceeds its current capability boundary, Experience Memory is leveraged to guide the stable creation of new assets on demand. During Backward Evolution, newly validated assets are preserved in Asset Memory to achieve persistent capability expansion, while strategic insights distilled from execution trajectories are accumulated into Experience Memory. This forward-backward loop enables the co-evolution of capabilities and experience, forming a stable self-evolving cycle.

Comparison of self-evolving agent frameworks

Comparison of self-evolving agent frameworks. Optimization indicates whether experience is used to optimize the agent (e.g., prompts). Persistence denotes whether experiences are persistently stored for future reuse. Source: agent task execution trajectory and tool creation process. Tool Crea. and Agent Crea. indicate whether the framework supports creation of tools and expert agents, respectively. Tool/Agent denotes whether the toolset and expert agents are static or dynamic. Crea. Grounding indicates the knowledge sources used for asset creation: parametric knowledge, web search information, and experience. Exp.-Guided Creation indicates whether new assets are created under the guidance of past experience.

Experiments

GAIA Embodied Multi-Hop QA Math Planning Web Interaction
Method L1 L2 L3 Total ALFWorld HotpotQA 2Wiki AIME24 AIME25 TravelPlanner WebShop Avg.
Naive-Large Language Model
GPT-5-Chat (Direct)16.9812.797.6912.4983.5850.4081.8060.0046.6738.6822.3149.49
GPT-5-Chat (CoT)24.5317.4411.5417.8483.5847.4074.4066.6756.6739.5127.4951.71
GPT-5-Chat (ReAct)26.4217.4411.5418.4786.8741.4048.4066.6760.0039.1325.1048.27
OpenAI-DeepResearch$^{\dagger}$74.2969.0647.6067.36------------------------
Experience-Centric Evolving
DyLAN24.5319.7811.5418.6291.2052.0065.0046.6743.3343.1536.4049.55
EvoAgent22.6419.7811.5417.9992.5054.4075.0066.6743.3349.2037.8054.61
AFLOW26.4217.4415.3819.7593.4060.8072.4066.6763.3353.2437.9058.44
DSPy30.1915.1211.5418.9592.8055.6076.4066.6750.0044.9035.5055.10
Capability-Centric Evolving
Alita81.1375.5846.1572.7386.1358.8077.4070.0066.6748.3230.2163.78
AgentVerse30.1916.2819.2321.9088.3238.6074.6060.0050.0047.2532.5351.65
AutoAgens35.8524.4219.2326.5087.9254.2073.8040.0036.6743.5231.4049.25
SwarmAgentic28.3018.6013.4620.4088.7956.0080.0046.6740.0059.1434.1253.14
Ours
Mem$^{2}$Evolve88.6882.5657.6976.3194.3160.8082.0076.7073.3359.2539.2070.24

Main results across 6 tasks and 8 benchmarks, reported as Pass@1 for each benchmark. The best results are highlighted in bold, and the second-best results are underlined. $^{\dagger}$ Results are from the original paper.

BibTeX

@misc{cheng2026mem2evolveselfevolvingagentscoevolutionary,
      title={Mem$^2$Evolve: Towards Self-Evolving Agents via Co-Evolutionary Capability Expansion and Experience Distillation}, 
      author={Zihao Cheng and Zeming Liu and Yingyu Shan and Xinyi Wang and Xiangrong Zhu and Yunpu Ma and Hongru Wang and Yuhang Guo and Wei Lin and Yunhong Wang},
      year={2026},
      eprint={2604.10923},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.10923}, 
}