In this post, we break down SkillOpt: Executive Strategy for Self-Evolving Agent Skills, a new Microsoft research paper that introduces a gradient-descent-like approach for automatically improving AI agent skills without fine-tuning the underlying model.

SkillOpt paper authors: Yifan Yang, Ziyang Gong, Weiquan Huang, Qihao Yang, Ziwei Zhou, Zisu Huang, Yan Li, Xuemei Gao, Qi Dai, Bei Liu, Kai Qiu, Yuqing Yang, Dongdong Chen, Xue Yang, Chong Luo (Source)

Introduction

Training Skills Like Neural Networks

What if AI skills could be trained like neural networks? That’s exactly the idea behind Microsoft’s SkillOpt paper, which turns agent skills into something that can be automatically optimized, much like neural network weights.

As AI agents become increasingly capable and common, success depends on more than just choosing the right foundation model. It also depends on the procedures surrounding that model: how tools are called, what instructions are followed, how outputs are structured and more. These procedures are often captured as agent skills, which are text documents that define the rules, guidelines, and strategies an agent should follow when performing a specific task.

But unlike large language models that go through rigorous training, agent skills are usually generated by asking a model to create them or by manually editing the skill text file ourself. Since adapting these skills is crucial for optimal agent execution, the skills themselves should become trainable. But today there is no equivalent of gradient descent for improving them.

This is exactly what Microsoft’s SkillOpt tries to do, by treating skills as trainable artifacts that can be iteratively improved.

SkillOpt High-Level Overview

Simplified SkillOpt Optimization Step Illustration

Let’s first build intuition using a simplified overview before diving into the full details.

As an example, we’ll use the SpreadsheetBench benchmark. In this benchmark, each sample consists of an Excel workbook along with a prompt describing a task that should be performed on that workbook. In practice, SkillOpt operates on batches of examples, but for simplicity, we’ll focus on a single sample.

We also start with an initial skill file. This skill contains domain-specific guidance for the agent, such as instructions to use Python libraries that can manipulate spreadsheet data. The agent receives the workbook, the task description, and the skill file, then attempts to solve the task.

In traditional model training, we would evaluate the result and use that feedback to update the model’s weights. But in SkillOpt, the underlying model remains completely fixed. Instead, we optimize the skill file.

The goal is to gradually improve the agent’s expertise at spreadsheet tasks by improving the instructions contained within the skill file. How does SkillOpt do this?

The Optimizer Model

Instead of only checking whether the final answer is correct, the entire trajectory, including tool usage, intermediate steps, and final output, is passed to another large language model that acts as an optimizer.

This optimizer analyzes the full rollout and proposes edits to the skill file. These edits might replace existing instructions, remove them, or add entirely new rules that could help solve future tasks more effectively.

For SpreadsheetBench, one example edit shown in the paper is to inspect workbook structure and formulas, then write evaluated static values instead of relying on Excel recalculation.

Neural Network Analogy

If we compare this process to neural network training:

The skill file plays the role of the model weights.
The proposed edits act like gradients suggesting how those parameters should change.

Ranking and Applying Skill Updates

SkillOpt does not blindly apply every suggested edit. The optimizer first ranks its proposed changes, and only the most promising edits are considered. You can think of the number of edits allowed in each iteration as a learning rate. Limiting the size of each update prevents the skill from changing too drastically and helps keep the optimization process stable.

Finally, we have the edits we want to apply to our skill file. But before committing to a new version of the skill, SkillOpt performs a what-if analysis. It applies the candidate edits, runs the agent again on a validation set, and measures whether performance actually improves. Only skill updates that demonstrate real improvement are accepted. Otherwise, the system keeps the previous version of the skill.

By repeating this process over many iterations, SkillOpt gradually improves the skill file, ending with a final output skill. On SpreadsheetBench, SkillOpt improves GPT-5.5 from 41.8% accuracy to 80.7%.

SkillOpt Architecture

We already understand the big picture. Now let’s see how SkillOpt actually implements it. The first half of this figure we see here should look familiar from the high-level overview. We’ll move through it quickly and focus on the additional details, before diving into the second half.

Fast Update Path

Starting from the left, we have a dataset split into training, validation, and test sets. At the bottom, we have a fixed agent together with the skill file that we want to optimize.

On each iteration, a batch of training samples is selected, and the agent processes them using the current version of the skill. This produces a collection of rollouts containing the agent’s complete execution traces and final outputs.

Optimizer Analysis

These rollouts are then divided into smaller groups and passed to an optimizer model. The optimizer should be a strong frontier language model. In the paper, GPT-5.5 is used for this role.

Each mini-batch is analyzed independently, and the optimizer proposes edits that could improve the skill instructions.

Consolidation and Ranking

At this point, multiple sets of candidate edits have been generated. Rather than applying them directly, SkillOpt performs a second optimization step that consolidates and ranks all proposed edits across the mini-batches.

Then, the learning-rate is applied. Only a limited number of the highest-ranked edits are allowed to move forward. By restricting how much the skill can change in a single iteration, SkillOpt avoids making overly aggressive updates that could destabilize the optimization process.

Validation Gate

The selected edits are then applied to create a candidate skill file. But before actually replacing the skill file, we pass the new skill file via the validation gate.

The agent is run on the validation set using the candidate skill, and the resulting performance is compared against the current skill version. If performance improves, the update is accepted and the new skill replaces the old one. If performance does not improve, the update is rejected and the previous skill remains active.

Importantly, the failed updates are shown to the optimizer in future iterations, helping it avoid repeatedly proposing changes that have already been proven ineffective.

Slow Update Path

So far, we’ve only discussed small local updates to the skill. But there is another pathway that is looking at broader patterns that emerge over many iterations, which we can see at the bottom of the figure. Unlike the fast update path, which runs continuously throughout training, the slow update path only runs once at the end of each epoch.

Comparing Beginning vs End of Epoch

To perform this analysis, SkillOpt compares the skill from the beginning of the epoch with the skill produced after all fast updates have been completed. A set of training samples is sampled and the agent is processing the same samples twice, each time using a different skill.

Categorizing Outcomes

The resulting rollouts are categorized into four groups:

Improvements: A task was previously unsuccessful but is now solved correctly.
Regressions: A task was previously solved correctly but now fails.
Persistent Failures: Both versions fail.
Stable Successes: Both versions succeed.

Epoch-Wise Reflection

These categorized examples are then analyzed by the optimizer in a step called epoch-wise reflection.

The purpose of this reflection step is not to make another small local edit. Instead, it looks for higher-level patterns. It tries to understand what kinds of changes are helping, what kinds of changes are hurting performance, and what failure modes still remain unsolved.

From this analysis, the optimizer can modify a dedicated portion of the skill that is intentionally left untouched by the fast update pathway. Just like any other update, these changes must still pass the validation gate before being accepted.

Meta-Skill and Long-Term Memory

The optimizer also maintains a meta-skill, that is a memory of which edits have worked, which ones failed, and which challenges remain unresolved. This meta-skill is used by the optimizer in the next epoch and helps guide future optimization decisions.

After several epochs, this process produces the final optimized skill file, allowing the agent to acquire stronger domain expertise without modifying the underlying model.

SkillOpt Experimental Results

Now that we understand how SkillOpt works, let’s look at the results reported in the paper.

Overall Benchmark Performance

The first table evaluates SkillOpt across six different benchmarks. The upper section uses direct chat environments, while the lower sections evaluate agents running inside Codex and Claude Code. In all cases, SkillOpt is compared against other skill-generation and optimization approaches.

Across all benchmarks and execution environments, it consistently matches or outperforms competing methods, often by a significant margin. What’s particularly impressive is that these gains are achieved without touching the weights of the underlying language model.

Does SkillOpt Help Smaller Models?

SkillOpt Benchmark Comparison - Smaller Models — SkillOpt Benchmark Comparison – Smaller Models (Source)

An important question is whether SkillOpt only benefits the strongest frontier models or whether the same idea can help smaller models as well.

The above table shows similar performance gains also for smaller models.

Transferability Experiments

Another interesting question is whether these optimized skills are tied to a specific model. If a skill is truly capturing useful expertise, we would hope that it remains valuable even when used with a different model. The above table investigates exactly this question.

Model Transferability

The upper section evaluates model transferability. Here, a skill optimized using GPT-5.4 is transferred to its mini and nano variants.

First, every transferred skill performs better than the baseline without a skill at all. As expected, directly optimizing a skill for a specific model generally produces the strongest results. However, a large portion of the improvement is often preserved after transfer. Yet, we do see cases here where a smaller portion of the improvement is preserved. So, transferability works, but not very consistent.

Harness Transferability

The middle section analyzes harness transferability. Here, skills optimized using Codex are transferred to Claude Code, and vice versa.

The pattern is very similar. Direct optimization remains the strongest option. Transferred skills still provide meaningful improvements over the baseline in most cases. In other words, the skill appears to contain portable knowledge that remains useful even when the surrounding agent framework changes.

Benchmark Transferability

Finally, the bottom section explores benchmark transferability. In this setting, a skill optimized for one benchmark is applied to a related benchmark. Once again, the optimized skills continue to provide gains. While the absolute numbers are not exceptionally large, we can see a consistent gain comparing to the baselines.

Results & Links

Paper Page
Code
Join our newsletter to receive concise 1-minute read summaries for the papers we review – Newsletter

All credit for the research goes to the researchers who wrote the paper we covered in this post.