Skip to main content

Developer Experience Insights from the State of AI Survey

· 8 min read

Watercolor illustration of a developer surveying an AI landscape of models, clouds, and tools from a hilltop desk

I love the State of AI survey results put together by Devographics. I participate every year and I genuinely look forward to the results — wondering how closely my own experience lines up with the industry, and what signals I can find to affirm or change direction. This year's report is based on over 7,000 respondents and there's a lot to dig into. I'm going to call out a few areas I find most interesting, but please explore it yourself — the data rewards time.

Model Providers vs. Models vs. Runtime Location

You can read a lot into how a survey is designed, not just the results. One area I've been focusing heavily on in the last six months of my own development is what I call model value — using the right model for the right purpose.

Most survey respondents (and most headlines) are focused on cloud models from the big-name vendors. Anthropic and OpenAI are obvious winners in that category, and the survey backs that up clearly. But I'm interested in extending that conversation to include model catalogs and model runtime location. That means places like HuggingFace, NVIDIA, Ollama, and Azure AI Foundry.

This isn't because I work at Microsoft — though I do appreciate where Foundry is going, especially its commitment to making the full breadth of the model ecosystem accessible in one place. It's because I genuinely appreciate the diversity that comes from having hundreds of models with different strengths, sizes, and licenses available to me. They all have value.

I also appreciate that I can do real work on my local machine with a downloaded small language model (SLM) and not worry about token costs. That's not a fallback strategy — that's a workflow. Tools like Azure Foundry Local make it straightforward to run these models on your own hardware. For certain tasks: summarization, local code review, quick drafts, the SLM on my laptop is fast, free, and private.

The survey supports the idea of local AI, though it's tucked in a back corner under Usage: Local AI — you have to go looking for it. I expect that section to grow substantially. As more developers want to maximize the compute they already own and reduce API spend, they'll branch out into other model providers and runtime locations beyond the cloud. Local-first isn't fringe thinking anymore.

My prediction: Next year's survey will have more questions that clearly differentiate model vendor, model provider, model runtime, and model cost. These are four distinct dimensions and right now they're blurred together.

Risk, Reward, and the Agents Question

The survey also asked for opinions on broader concerns — whether AI poses an existential risk, whether people are worried about job security. I found those questions interesting but not surprising in their results. What will be interesting is how those answers shift next year as the technology matures and the hype cycle normalizes.

The section I'm most confident will grow and evolve is Agents and Assistants. "Agents" can mean a lot of things depending on who you ask. In my own use and development, an agent is something that completes well-defined work autonomously — not a chatbot, not a suggestion engine, but a teammate with a scope.

A year ago, that meant LangChain and LangGraph. Those are still valid tools. But now the agent framework landscape includes many more options, including the GitHub Copilot SDK, which I've been using heavily.

One of my favorite tools in this space is Squad — a team of AI agents, each with different expertise and a different work surface, collaborating to help me complete work. Squad isn't just a productivity tool. It's changed how I think about AI and how I interact with it. I went from asking an AI a question to managing an AI team with defined roles. That shift in mental model matters.

My prediction: Next year's survey will have significantly more questions about agent frameworks — not just "do you use agents" but what frameworks, what patterns, what governance.

New Tools Adopted from AI's Demands

Here's something the survey doesn't capture yet but I see every day: AI hasn't just changed how I write code — it's changed what tools I need to write code.

Take Dev Containers as the clearest example. Instead of installing tools locally and hoping versions stay consistent, I define my entire development environment as a Docker container — declarative, reproducible, isolated. I run these containers as development environments on my own compute. Every project gets exactly the dependencies it needs, and nothing bleeds between them.

I didn't adopt Dev Containers because it was trendy. I adopted them because AI agents need consistent, predictable environments — and they need compute that doesn't stop. When you have multiple agents running builds concurrently, or a cloud agent like GitHub Copilot provisioning its own environment from scratch, you need something better than "I hope the right version of Go is on PATH." A devcontainer.json defines exactly what's available, every time, no drift. The agent gets the same environment I do.

The same pattern extends to DevBox — the key difference is where the compute lives and whether it shuts down. Dev Containers run on my own machine, always available, always under my control. DevBox moves the environment to cloud compute that can be spun up and torn down on demand. Both solve the same core problem — declarative, reproducible environments that agents can parse and modify — but the tradeoff is local persistence versus elastic cloud resources. These aren't tools I sought out for their own sake — they're tools that became necessary because the work exposed the fragility of manually-managed environments.

My prediction: Next year's survey will start asking about developer environment tooling in the context of AI — not just "what IDE do you use" but "how do you manage environments for AI-assisted development?" The infrastructure layer is shifting underneath us, and the tools we adopt next will be shaped by what AI needs to function well alongside us.

What's Missing: The Survey's Blind Spots

This is the part I think about every year with Devographics surveys — the sections I expected to find but didn't. A few stand out.

Fundamentals: MCPs, skills, and pipelines. Model Context Protocol (MCP) has become the connective tissue of modern AI development — it's how tools talk to models, how context flows between systems, how you compose AI capabilities. It wasn't in the survey. Skills and pipeline composition weren't either. These are core developer experience concerns right now.

Thought leadership and learning sources. Who are developers reading and listening to in order to keep up? What communities are they in? Is large language model / small language model the only path people see toward AGI, or are there other architectures getting attention? These questions would reveal a lot about how the community sees the next few years.

Security. AI in the developer space has huge and immediate security implications — prompt injection, model poisoning, data exfiltration through context, supply chain concerns with models themselves. This deserves its own section in a developer survey.

My prediction: Next year's report will address most of these gaps. I've seen Devographics do exactly this with previous surveys — State of JS, State of CSS — where year two and three add the depth that year one reveals is missing. The State of AI survey is following the same natural progression, and that's a good thing.


If you haven't read the State of AI survey yet, carve out an hour. Look at the cross-tabs, read the opinions section, and notice what questions they asked and didn't ask. The shape of the survey tells you as much as the answers do.

I'll be participating again next year and hoping to see my predictions validated — or proven wrong, which is honestly just as useful.

GitHub Copilot: From Basics to AI Agents

· 22 min read

Watercolor illustration of a woodworker in blue meeting his first AI helper in green at a furniture workshop

Imagine a furniture workshop. You're the craftsperson in the blue shirt — the one with the vision, the taste, the final say. The helpers in green shirts? Those are your AI agents. At first there's just one, handing you the right chisel at the right moment. By the end of this journey, you'll have a whole crew in green building furniture to your specifications while you direct, decide, and review.

A year ago, I was tab-completing function signatures. Today, I manage a team of named AI agents that handle PR reviews, documentation sweeps, and infrastructure audits.

That sounds like a sales pitch. It's not. It's a progression that happened one level at a time, each building on the last. And the best part? You can start the same journey in about 15 minutes.

Here's the path I took — four levels, from "ooh that's cool" to "wait, this changes everything."

The TL;DR

LevelWhat ChangesTime to Value
1. First DayYou get an AI pair programmer (IDE + CLI)15 minutes
2. Making It YoursCopilot learns YOUR codebase (instructions, MCPs, skills)1-2 hours
3. SquadA team of agents working in concert1 day
4. Autonomous OpsFully defined work executes itself2-3 days

Each level builds on the previous one, and each is independently useful. Once you see what's possible at each stage, you'll want to keep climbing.

Badge legend: 🖥️ VS Code · ⌨️ CLI · 👤 Interactive · 🤖 Autonomous · 💻 Local · ☁️ Cloud · 🌐 GitHub.com


Level 1: Your First Day with Copilot

🖥️ VS Code · ⌨️ CLI · 👤 Interactive · 💻 Local

Watercolor illustration of a blue-shirted craftsperson at the workbench while a green-shirted helper steadies the joint

Your first day in the workshop. You're at the bench with your mallet (blue shirt), fitting a dovetail joint. Your one helper in green steadies the piece, hands you the right tool before you ask, and suggests a better angle — but you swing the mallet.

This is where everyone starts — and honestly, where most of the immediate productivity gains live. Level 1 spans two environments: Copilot in your IDE (VS Code, JetBrains, etc.) and the standalone Copilot CLI in your terminal.

In the IDE: Inline Completions & Inline Chat

🖥️ VS Code · 👤 Interactive · 💻 Local

Inline completions — the thing most people think of as "Copilot." You type, it suggests. But it's more than autocomplete. It reads your open files, your comments, your function signatures, and generates contextually aware suggestions. This happens directly in your editor as you type.

Inline chat — highlight code, press Ctrl+I, ask a question. "Explain this regex." "Refactor this to use async/await." "Add error handling." It edits in place within the current file.

In the IDE: Copilot Chat Panel

🖥️ VS Code · 👤 Interactive · 💻 Local

The Chat panel (Ctrl+Shift+I or the sidebar) opens a conversation with Copilot that has broader awareness:

  • Open file context — ask questions about the file you're looking at: "What does this function do?" "Find the bug in this logic."
  • @workspace — ask about the entire repository: "Where is authentication handled?" "Show me all API routes." Copilot searches across your project.
  • @terminal — get help with shell commands without leaving the IDE: "How do I find large files?" "What's the git command to squash commits?"
  • Agent mode — Copilot Chat also has an "agent" mode where it can make multi-step edits, run terminal commands, and iterate. This is powerful for IDE-based workflows, but note: this is different from the Squad "agents" discussed later. Agent mode is a single AI working iteratively; Squad agents are specialized team members working in concert.

The Standalone Copilot CLI

⌨️ CLI · 👤 Interactive · 💻 Local

The copilot command brings the full Copilot agent to your terminal — file editing, shell commands, sub-agents, and more:

# Non-interactive prompt mode:
copilot -p "extract a .tar.gz file preserving permissions"

# Ask about git:
copilot -p "undo my last commit but keep the changes"

# Start an interactive session:
copilot

The standalone CLI (copilot) is a full agent runtime — it can read/write files, run commands, and orchestrate complex tasks from your terminal. It's distinct from the IDE chat panel but equally powerful.

When to Use Each

ContextBest For
Inline completionsFlow-state coding, writing new functions
Inline chat (Ctrl+I)Quick edits to selected code
Chat panel (open file)Understanding code you're reading
Chat panel (@workspace)Finding things across a project
Chat panel (@terminal)Shell command help inside IDE
Agent mode (IDE)Multi-step edits within a project
copilot CLITerminal-first workflows, scripting, automation

Try This Now

  1. Install GitHub Copilot in VS Code
  2. Open any project, start a new file, write a comment:
// Parse a CSV string into an array of objects using the first row as headers

Copilot will generate the implementation. Tab to accept.

  1. Install the standalone Copilot CLI and try:
copilot -p "explain why this Node.js app leaks memory when processing large CSV uploads"

What I Learned at Level 1

The biggest gain wasn't the code generation — it was the velocity shift in unfamiliar territory. Working in a language I don't know well? Copilot bridges the gap between "I know what I want" and "I know the syntax." It turned 30-minute research into 30-second completions.

The limitation: Copilot at this level generates generic best-practice code. It knows nothing about your specific conventions or preferences. That leads to ...


Level 2: Making Copilot Yours

🖥️ VS Code · ⌨️ CLI · 👤 Interactive · 💻 Local

Watercolor illustration of a blue-shirted craftsperson alone, setting up custom jigs and labeled drawers

No green shirts in sight — this is setup time. You're alone at the bench, labeling drawers, building custom jigs, and pinning reference cards to the pegboard. You're not building furniture yet; you're building the system that makes your workshop uniquely yours. When the green-shirted helpers return, they'll know exactly where everything goes.

Level 1 Copilot is smart but generic. Level 2 is where it starts feeling like a teammate who's read your wiki. This level works in both the IDE and CLI — the same instruction files and MCP configs are picked up by Copilot Chat in the IDE and Copilot CLI.

Custom Instruction Files

Drop instruction files in your repo and Copilot learns your conventions:

.github/copilot-instructions.md — global instructions for all Copilot interactions:

# Project Conventions

- Use TypeScript strict mode with explicit return types
- Prefer `Result<T, Error>` pattern over throwing exceptions
- All API responses follow our envelope format: `{ data, error, meta }`
- Tests use vitest with the `describe/it` pattern
- Never use `any` — prefer `unknown` with type guards

AGENTS.md — agent instructions that can live anywhere in your repo. Unlike copilot-instructions.md (which must be in .github/), you can place multiple AGENTS.md files at different directory levels — the nearest one in the directory tree takes precedence. This makes it ideal for monorepos where each package needs its own agent behavior:

my-monorepo/
├── AGENTS.md ← shared team-wide instructions
├── packages/
│ ├── frontend/
│ │ └── AGENTS.md ← React-specific agent rules (wins here)
│ └── backend/
│ └── AGENTS.md ← API-specific agent rules (wins here)

Every suggestion Copilot makes now respects these rules. No more "helpful" suggestions that violate your architecture.

MCP Servers: Giving Copilot New Abilities

Model Context Protocol (MCP) servers let you plug external data sources and tools into Copilot's context. Think of them as APIs that Copilot can call mid-conversation — in both the IDE and CLI.

// .copilot/mcp.json
{
"mcpServers": {
"azure": {
"command": "npx",
"args": ["-y", "@azure/mcp@latest", "server", "start"]
}
}
}

Now Copilot can query your Azure resources, check deployment status, or read your database schema — all within the conversation.

Some MCP servers I use daily:

Skills: Repeatable, Deterministic Work

Skills are the underrated powerhouse of Level 2. A skill is a directory with a SKILL.md file that defines a repeatable pattern — including deterministic steps from scripts and code.

.<directory>/skills/
├── pr-review/
│ └── SKILL.md # "Run lint, check test coverage, review diff"
├── doc-sync/
│ └── SKILL.md # "Compare API surface to docs, flag drift"
└── sdk-sample-check/
└── SKILL.md # "Validate all samples compile and match SDK version"

Read the Visual Studio documentation for the best directory location for your skill usage.

Skills differ from instructions in that they define executable workflows — not just preferences. A skill can include shell commands to run, files to check, and decision trees to follow. They're reusable across sessions and agents.

Why skills matter:

  • Repeatable — same process every time, no drift
  • Composable — skills can reference other skills
  • Deterministic where needed — embed scripts and validation steps that always run the same way
  • Shareable — check them into your repo, the whole team benefits

Try This Now

  1. Create .github/copilot-instructions.md with your project's conventions
  2. Add an MCP server for a tool you use daily (Azure, database, etc.)
  3. Create a .github/skills/quick-review/SKILL.md that describes your code review checklist

Then open Copilot Chat or run copilot and notice the difference — it follows YOUR patterns now.

What I Learned at Level 2

Custom instructions are absurdly high-leverage. A 50-line markdown file eliminates 80% of the "no, not like that" moments. MCP servers bridge "Copilot that knows code" and "Copilot that knows your infrastructure." Skills turn tribal knowledge into executable processes.

The limitation: everything is still per-session. Copilot doesn't automatically carry context between sessions — it won't remember decisions from yesterday's refactor. It doesn't have persistent context about your project's evolving state. It doesn't coordinate with other instances of itself.

Enter Squad.


Level 3: Squad — A Team Working in Concert

🖥️ VS Code · ⌨️ CLI · 👤 Interactive · 💻 Local

Watercolor illustration of craftspeople collaborating at a shared workbench in a woodworking shop

The workshop is getting busy. You're at the bench, studying the blueprint. Around you, a small team of helpers is assembling a cabinet together — one holds the frame, another drives the dowels, another checks the level. Each knows their role. Each stays in their lane. The work moves faster because they each know their job and coordinate with each other, not just with you.

This is where the mental model shifts from "AI assistant" to "AI team."

Squad gives you a team of specialized agents working in concert to complete tasks, where each member's expertise and boundaries positively impact the result. It's not just "named agents with memory" — it's a coordinated system where the reviewer's standards shape the coder's output, and the docs writer's perspective catches gaps the implementer missed.

Squad runs on the Copilot CLI (copilot --agent squad) and adds the organizational layer that makes agents feel like a real team rather than a single assistant wearing different hats.

What Makes Squad Different

FeatureRegular CopilotSquad
MemorySession-basedPersistent across sessions
IdentityGeneric assistantNamed agents with charters
CoordinationYou manage contextAgents hand off to each other
SpecializationSame agent for everythingDomain-specific agents with boundaries
Result qualityOne perspectiveDiverse perspectives improve output

Installing Squad

# Install Squad CLI
npm install -g @bradygaster/squad-cli

# Initialize in your project
npx @bradygaster/squad-cli init

# Start Copilot with Squad (standalone CLI)
copilot --agent squad

This scaffolds a .squad/ directory:

.squad/
├── agents/
│ ├── ralph/ # Orchestrator
│ │ └── charter.md
│ ├── reviewer/
│ │ └── charter.md
│ └── docs-writer/
│ └── charter.md
├── ceremonies/
│ └── sweep.md
└── memory/
└── decisions.md

Agent Charters: Expertise + Boundaries

Each agent has a charter — a markdown file that defines who they are and what they do and, critically, what they won't do:

# Reviewer Agent Charter

## Identity
You are the code reviewer for this project. You focus on:
- Security vulnerabilities
- Performance anti-patterns
- Consistency with project conventions

## What I Own
- TypeScript files and build system

## Boundaries
- Never approve your own changes
- Escalate architectural concerns to the team lead
- Don't refactor code that isn't in the PR scope

The boundaries matter as much as the expertise. A reviewer that knows when to escalate produces better outcomes than one that tries to handle everything. The interplay between agents — where one's output becomes another's input — is what makes Squad feel like a team rather than parallel solo workers.

Ceremonies: On-Demand Structured Workflows

Ceremonies are repeatable workflows you trigger when needed:

# Ceremonies & Rituals

## Design Review

**When:** Before PRD implementation begins
**Who:** <list of named agents>
**Purpose:** Validate requirements, issue templates, and process flow before work starts

## Retrospectives

**When:** After major deliveries (GitHub Projects setup, issue templates, Actions automation)
**Who:** All team members
**Facilitator:** <single agent name>
**Purpose:** Reflect on what worked, what didn't, continuous improvement

## Cross-Repo Sync

**When:** As needed
**Owner:** <single agent name>
**Purpose:** Ensure coordination across all projects (reads repos.json for scope)

Ceremonies encode your team's best practices into executable workflows that any agent can run consistently.

Try This Now

With the Squad open in a Copilot CLI interactive chat, assign work to the squad.

# Then talk to the team:
"Team, fan out and review this PR for security issues"
"Ralph go"

What I Learned at Level 3

The agents and charter system is what makes Squad click. With it, you have agents that maintain consistent behavior, remember decisions, and build expertise over time. Without it, you have "Copilot with extra steps."

The real insight: diversity, expertise, coordination, and boundaries create quality.When the reviewer can't approve its own work, when the docs writer must verify against actual code, when the security agent escalates instead of guessing — the team produces better results than any single agent could alone.

The honest trade-off: Squad requires investment in codifying your work patterns and practices. A poorly-defined agent is worse than no agent because it gives inconsistent results. Spend the time upfront.


Level 4: Autonomous Operations

🖥️ VS Code · ⌨️ CLI · 🤖 Autonomous · 💻 Local · ☁️ Cloud

Watercolor illustration of workers building furniture at separate workbenches in a woodworking shop

The helpers are working independently across the shop — each at their own bench, each building a different piece from your specifications. One saws, one hammers, one planes. You glance across the room and trust the work because you wrote clear blueprints for your team of experts. They don't need you hovering.

Level 4 is where the work has been fully defined and you just need it completed. You've already figured out what needs to happen — now you hand it off and let the system execute.

This is the difference between "AI that helps me work" and "AI that does the work I've specified."

Five Ways to Run Autonomously

1. VS Code Agent Mode

🖥️ VS Code · 🤖 Autonomous · 💻 Local

In VS Code, Copilot's agent mode executes multi-step tasks — reading files, running commands, editing code — without manual intervention. You describe the outcome, and agent mode figures out the steps:

# In VS Code Copilot Chat (agent mode):
"Refactor all API handlers to use the new error envelope format"

Agent mode uses your custom instructions and MCPs from Level 2, so it already knows your project's conventions. Best for: well-scoped tasks while you're in the IDE.

2. Copilot CLI Agent Mode

⌨️ CLI · 🤖 Autonomous · 💻 Local

The standalone CLI provides the same autonomous execution outside VS Code:

# CLI agent mode — executes the full task autonomously
copilot -p "Refactor all API handlers to use the new error envelope format"

Best for: well-scoped tasks from the terminal, scripted workflows, or when you prefer the command line over the IDE.

3. "Ralph, go" — Squad Work Queue (In-Session)

⌨️ CLI · 🤖 Autonomous · 💻 Local

Ralph, the Squad work monitor, processes your entire work queue autonomously within a Copilot session. First, connect Squad to your repo's issues ("pull issues from owner/repo"). Then Ralph triages those issues, assigns work to the right specialist agents, monitors progress, and keeps going until the board is clear:

# In a Copilot CLI session with Squad:
copilot --agent squad

# Then:
"Ralph, go" # → Starts processing the work queue
"Ralph, status" # → Shows what's open, stalled, or ready to merge

Ralph monitors GitHub issues, triages incoming work, and drives tasks through your agent team without you intervening. It doesn't stop between tasks — it keeps cycling until everything is done.

Best for: in-session work queue processing, multi-agent coordination, and batching related tasks.

4. Squad Watch — Persistent Local Monitoring

⌨️ CLI · 🤖 Autonomous · 💻 Local

When you're away from the keyboard but your machine is on, squad watch provides persistent polling of your GitHub issues:

# Polls every 10 minutes (default)
npx @bradygaster/squad-cli watch

# Custom intervals
npx @bradygaster/squad-cli watch --interval 5 # every 5 minutes
npx @bradygaster/squad-cli watch --interval 30 # every 30 minutes

This runs as a standalone local process (not inside Copilot) that auto-triages issues from your connected repo, assigns work based on team roles and keywords, and routes issues to agents or @copilot for pickup. It runs until you Ctrl+C. (Requires the same repo connection set up via Squad.)

Best for: overnight monitoring, catching issues while you're in meetings, and persistent triage between active sessions.

5. Copilot Cloud Agent (GitHub Issues)

☁️ Cloud · 🤖 Autonomous · 🌐 GitHub.com

Assign a GitHub issue to Copilot and it works independently — no terminal open, no local setup. The cloud agent runs in a GitHub Actions-powered ephemeral environment: it researches your repo, creates a plan, makes code changes, and opens a PR.

Trigger it from a GitHub issue comment:

<!-- In a GitHub issue comment: -->
@copilot implement this

Or assign the issue to Copilot directly from the GitHub Issues UI, VS Code, JetBrains, or the GitHub CLI.

The cloud agent works best for well-scoped, clearly described issues: "add a new endpoint that follows the existing pattern," "write tests for this module," "update the config schema to support the new field." Think of this as a single async task you hand off — describe the outcome clearly, and come back to a PR.

It uses GitHub Actions minutes and Copilot premium requests, so you're trading compute for time. No local session required; the work happens entirely on GitHub's infrastructure.

When to Use Each

ApproachBest ForRuns OnRequires Active Session?
VS Code agent modeIDE-scoped tasksYour machineYes
copilot CLITerminal-scoped tasksYour machineYes
"Ralph, go"Work queue + coordinationYour machine (Squad)Yes
squad watchPersistent monitoringYour machine (background)No — standalone process
Copilot cloud agentIssue-driven implementationGitHub's cloudNo — fully async

The Autonomy Spectrum

These options form a spectrum from "I'm here watching" to "I'm asleep":

You're present              You're away              You're asleep
─────────────────────────────────────────────────────────────────
VS Code agent mode → squad watch → Copilot cloud agent
Copilot CLI → (machine on) → (GitHub cloud)
Ralph, go

What I Learned at Level 4

The key insight: autonomous execution requires fully-defined work. The quality of autonomous output is directly proportional to how clearly the task was specified. Vague issues get vague results. A well-written issue with acceptance criteria, examples, and constraints? That's where autonomous execution shines.

The cloud agent on GitHub is the lowest-friction option — no local setup, just assign an issue. squad watch bridges the gap between active sessions and cloud — your machine monitors and triages even when you're not in a Copilot session. Ralph is best when you're actively working through a backlog and want coordinated multi-agent execution.


Finding Your Path

Not everyone takes the same route through these levels:

RoleStart HereQuick WinLevel Up
EngineerLevel 1 (completions + CLI)Custom instructions for your stackSkills for repeatable reviews
PM/ContentLevel 1 (chat for drafting)Custom instructions for voice/styleSquad ceremonies for sweeps
Team LeadLevel 2 (instructions + MCPs)Skills for team processesSquad for coordinated reviews
PlatformLevel 2 (MCP + infra context)Squad for monitoringSquad for always-on monitoring

The Ecosystem at a Glance

The Copilot ecosystem is growing fast. Here are the key resources:

Essential Tools

Learning & Community

Infrastructure


What Actually Changed for Me

I want to be honest about what's different after six months at Level 3+:

What improved:

  • PR turnaround dropped from days to hours (the green shirts handle first-pass review)
  • Documentation stays in sync with code (sweep ceremonies catch drift)
  • I work in unfamiliar codebases with dramatically less ramp-up time
  • Boilerplate tasks that used to take 30 minutes take 2 minutes
  • Skills encode my best practices — I define a process once, it runs the same way forever

What didn't change:

  • Architecture decisions still require human judgment
  • Debugging subtle logic errors still requires deep thought
  • Agent output needs review — trust but verify
  • Writing good charters and instructions is a skill that takes time to develop, update, and improve

The mental model shift: I stopped thinking "what code do I need to write?" and started thinking "what work needs to happen, and who should do it?" Sometimes the answer is me — blue shirt at the bench, swinging the mallet. Often it's a green shirt with clear instructions and a well-scoped task.


Start Today

You don't need to plan all four levels. Start where you are:

Never used Copilot? → Install the extension, write a comment, press Tab. That's it.

Using Copilot but it's generic? → Write a copilot-instructions.md file and one skill. 10 minutes, massive payoff.

Want more than autocomplete? → Install Squad CLI, write one agent charter, run copilot --agent squad.

Ready for autonomous execution? → Try copilot --autopilot on a well-defined task, or assign an issue to the cloud agent.

The progression is natural. Each level solves a real problem you'll discover at the previous one. And unlike most "AI transformation" pitches, you can validate the value at every step before investing in the next.

The future of development isn't AI replacing developers. It's developers who know how to orchestrate AI systems outperforming those who don't. The tools are here. The ecosystem is open source. The only question is which level you start at.

Want to go further? The next post in this series covers Cloud-Scale Agent Fleets for Level 5 — coming soon.


📣 GitHub Copilot Dev Days — Next Week!

Want to go deeper? GitHub Copilot Dev Days are happening next week with sessions in multiple languages and time zones:

These are free, virtual events covering the latest in Copilot extensibility, agentic development, and the ecosystem tools discussed in this post. See you there!


Have questions or want to share your own journey? Find me on GitHub at @dfberry or check out my other posts on the Copilot ecosystem.

Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills

· 13 min read

Watercolor illustration of a craftsperson&#39;s workbench being tidied and organized

Optimizing skills felt less like deleting content and more like reorganizing a workshop — same tools, better drawers.

I'd been adding to the .copilot/skills/ directory for a while without taking inventory. Every feature or domain onboarding meant a new skill — sometimes three. The assumption was obvious: more skills = more consistency. For the first few dozen, that was true.

Here's what's weird: I had no actual count.

When I finally looked: 413,591 tokens across 136 skill and reference files (117 distinct skills). Just measuring it revealed the bloat:

  • 6 SDK sample review skills: 140K tokens (34% of total budget)
  • Dead redirect stubs: consuming tokens for no routing purpose
  • Duplicated prose: same guidance repeated across language variants

Not a disaster, but the kind of creeping growth that happens when you build fast and don't audit.

Why this matters: Skills load on demand, so optimizing them doesn't free your active context window. But faster agent spawns and cheaper skill loading? That's a different lever, and I wanted to pull it.

The patterns I found — reference extraction, checklist compression, shared references — work with any tool. I used GitHub Copilot CLI with Squad orchestration to run them in parallel, but you could apply them manually in any editor. The techniques are the point, not the tooling.

Measuring Token Usage with microsoft/waza

The first move: measure. Waza is a skill quality toolkit, and waza_tokens count does exactly that — scans your skills directory and gives you sorted token usage. No guessing. Here's the breakdown:

$ waza_tokens count .copilot/skills/
┌─────────────────────────────────┬────────┐
│ Skill │ Tokens │
├─────────────────────────────────┼────────┤
│ data-plus-ai-sdk-java-sample... │ 25,841 │
│ data-plus-ai-sdk-python-samp... │ 23,921 │
│ ... │ │
│ dina-small-utility │ 312 │
├─────────────────────────────────┼────────┤
│ Total: 117 skills │413,591 │
└─────────────────────────────────┴────────┘

25K tokens for a single skill. That's the starting point. Waza has other tools too — waza_tokens suggest for optimization ideas, waza_quality to verify post-changes, and waza_dev --copilot for frontmatter — but for this work, count was the diagnostic tool.

Planning the Work

The strategy: decompose into phases, ordered by savings potential. Clear the big ones first; small wins come after.

PhaseTargetEst. Savings
1. Kill stubs3 empty redirect skills~73 tokens
2. Refactor giants6 SDK review skills (140K!)~120K tokens
3. Optimize large14 skills (5K–10K each)~30–50K tokens
4. Optimize medium60 skills (1K–5K each)~10–20K tokens
5. Trim small20 skills (under 1K each)minimal
6. Audit referencesLarge reference files~10–15K tokens

Why this order matters: I started with stubs not because they saved much, but because they reduced noise before the real work. Phases 2–3 capture the bulk of savings. Phases 4–5 are diminishing returns per skill, but we completed them efficiently by applying the same patterns we'd already proven earlier.

Phase 1: Killing the Stubs

Problem: Three skills were redirect stubs — they pointed to other skills and had fewer than 50 tokens of actual content. No routing logic, no value.

Action: Deleted them.

Result: −73 tokens. Barely registers numerically, but this is the "boring is good" work. A clean directory is easier to maintain, and stubs confuse future maintainers.

Phase 2: The Giants

Problem: Six SDK sample review skills (Java, Python, Go, .NET, TypeScript, Rust) had identical structure: 15–16 detailed rule sections + full code examples inline. Total: 140K tokens (34% of budget). Agents loaded everything every time, even when they only needed one language's rules.

Technique: Reference Extraction. Move verbose rules and examples to references/ files. Keep SKILL.md slim — just routing info, a quick checklist, and blockers. Agents load the overview immediately, fetch detailed rules on demand.

Before:

java-sdk-review/SKILL.md (25,841 tokens)
├── Routing info (2K)
├── Error handling rules (8K + full examples)
├── Concurrency rules (7K + full examples)
├── Async patterns (6K + full examples)
└── ... 12 more sections ...

After:

java-sdk-review/SKILL.md (1,541 tokens)
├── Routing: detect Java SDK samples
├── Quick checklist
│ ├── Error handling: caught, logged, meaningful messages
│ ├── Concurrency: thread-safe, no race conditions
│ ├── Async patterns: proper callback/future chaining
│ └── ... 5 more items
└── Reference files in references/java/ (loaded on demand)

Two-tier architecture. Same content, loaded smarter.

Execution: Ran all six in parallel:

SkillBeforeAfterReduction
Java25,8411,54194%
Python23,9211,08395%
Go24,3551,81593%
.NET23,3551,37894%
TypeScript21,5431,52593%
Rust21,3031,64392%
Total140,3188,985~131K saved

Trade-off: Agents now navigate a two-tier structure (SKILL.md → references/) instead of one flat file. Discoverability costs something. But these skills are used frequently enough that agents will learn the pattern. Zero content was removed — every rule and example is still there, just reference-extracted.

Phase 2 complete: SDK skills before/after showing 94%+ reduction per language

Phase 3: Large Skills

Problem: 14 more skills in the 5K–10K range had the same structure: verbose sections that could be extracted. Examples: azure-mcp-content-generation, dina-reskill, context-diagnostics.

Action: Applied the same reference extraction pattern in 4 parallel batches.

Result: −68,084 tokens (76% reduction)

Running total:

Phase 1 (stubs):   −73 tokens
Phase 2 (giants): −131,333 tokens
Phase 3 (large): −68,084 tokens
────────────────────────────────
Subtotal saved: ~199,490 tokens (48% of starting budget)
Remaining: ~214,101 tokens

At this point, the curve was clear. Phases 4–5 (medium and small skills) would yield diminishing returns per unit effort — but having proven the techniques in Phases 1–3, we already knew how to apply them efficiently at scale.

The PR and the Review

Problem: The PR touched 106 files, 18,571 deletions, 12,176 insertions. Before shipping, we needed to verify:

  • Structural integrity (paths, syntax, references valid)
  • Quality didn't regress (waza_quality scores)
  • Routing logic still precise
  • Didn't over-trim skills below usefulness

Action: Ran four automated review passes:

  1. Structural integrity check — passed
  2. Waza quality verification — passed with notes
  3. Trigger precision validation — passed with notes
  4. Adversarial over-trimming check — caught 2 real issues

Issues found and fixed:

  1. Reference file with broken relative path (in Phase 2)
  2. Skill trimmed below ~800 tokens (lost routing context entirely)

Second pass: ✅ SHIP

Pull request showing 65% Copilot skills token reduction across 106 files

Key finding: Don't reduce a SKILL.md below ~800 tokens for standalone skills. Below that threshold, you lose enough routing context that agents can't determine when or how to use the skill. Exception: Skills with strong internal routing logic (like the unified SDK skill at 469 tokens) can go lower because their dispatch logic compensates.

The ~800-token floor is a practical boundary, discovered through testing.

Phases 4–6: The Curve Flattens

After Phase 3, ~214K tokens remained. Phases 4–6 brought that down to 143K — another ~70K saved by applying the same patterns (checklist compression, reference extraction, deduplication) at smaller scale:

PhaseSkillsTechniqueSavings
4. Medium60 skills (1K–5K each)Checklist compression, dedup~45K
5. Small20 skills (under 1K each)Light trimming~10K
6. Reference auditLarge reference filesConsolidation~15K

The per-skill ROI drops in later phases, but having proven the techniques in Phases 1–3, the work was mechanical — same patterns, smaller targets. The current 143K total is sustainable for the usage pattern.

Final Numbers

Before:  413,591 tokens (117 skills, 136 files)
After: 143,354 tokens (114 skills, 130 files)
Saved: 270,237 tokens (65.3% reduction)

This reflects the main optimization PR. The workbench is cleaner — every tool is still there, but they're in labeled drawers instead of piled on the surface. The Bonus Round consolidation happened in a separate session and is described next.

Bonus Round: From Shared References to Unified Skills

After the optimization PR shipped, I ran waza_quality on the six SDK skills and noticed: isolation violation. Each skill had its own SKILL.md, its own routing, its own boilerplate duplicated across six files. That's a pattern violation — not atomic, not clean.

So I rethought it. Three options existed:

  1. Accept the low score — good-enough for a domain-specific exception
  2. Inline shared content — copy the 14 shared reference files into each skill, double the maintenance burden
  3. Single skill with language dispatch — collapse all 7 (6 SDK languages + 1 quickstart) into one unified skill with language auto-detection

I went with option 3. Not because it was obvious, but because it matched the actual use case: agents almost never review samples in all languages simultaneously. They review one language based on the codebase. A single skill with smart routing was actually more correct than the multi-skill pretense.

Result: azure-sdk-sample-review/ — one skill, 469 tokens in SKILL.md, language auto-detection via prompt analysis. Structure:

azure-sdk-sample-review/
├── SKILL.md (469 tokens) — routing + dispatch logic
├── evals/ (7 tasks, 100% passing)
├── references/
│ ├── shared/ (14 files: generic best practices)
│ ├── dotnet/, go/, java/, python/, rust/, typescript/
│ └── quickstart/

Before: 86 files across 6 language folders — Java, Python, Go, .NET, TypeScript, Rust — each with its own copy of the same 14 generic reference files. 45% pure duplication. Update one? Remember to update five more.

After: One shared/ folder holds the 14 generic files. Each language folder keeps only what's actually unique to that SDK. One update, one place. Six duplicated routing SKILL.md files collapsed to a single dispatch mechanism. Waza compliance achieved — no more isolation violations. All 7 behavioral evals running at 100%.

This is the evolution: reference extraction → shared references → unified skill with internal routing. Each step felt right at the time. Looking back, the final architecture is simpler and more correct. Wish I'd seen it from the start. (Work captured in a follow-up PR.)

Dogfooding: The Reskill Skill

I captured the optimization pipeline as a skill — dina-reskill — documenting the 8-pattern workflow (reference extraction, checklist compression, example pruning, and so on).

Then I ran it on itself, because apparently I can't leave well enough alone:

SKILL.md:  2,085 → 1,163 tokens (44% reduction)
Total: 5,401 → 4,288 tokens (21% reduction)

Three review passes: two approvals, one note flagged and fixed. The skill practices what it documents. The SDK skills themselves evolved further after this (described in Bonus Round) — from six separate skills down to a single unified skill. So while dina-reskill captures the second-pass improvements here, the SDK consolidation shows how those patterns continue to evolve as you live with them.

What Actually Worked: The Patterns

If you're building a skills optimization workflow, here are the patterns ranked by impact:

1. Reference Extraction

Principle: Move detailed rules, code examples, and verbose explanations into references/ files. The SKILL.md becomes a slim routing layer — overview, quick checklist, blocker list. Agents load references on demand.

When to use: For any skill over 5K tokens, this should be your first move. Start here, not somewhere else.

Example: The Java SDK skill went from 25,841 tokens (inline rules + examples) to 1,541 tokens (routing + checklist) by extracting ~24K into references/java/.

2. Checklist Compression

Principle: Turn paragraph-style guidance into concise checklists. Same information, fraction of the tokens.

Example:

  • Before: "When reviewing error handling, ensure that all errors are properly caught, logged with appropriate context, and returned with meaningful messages to the caller"
  • After: "✅ Errors: caught, logged with context, meaningful messages"

3. Example Pruning

Principle: One good example per pattern. If your skill has 3 examples of the same concept, keep the clearest one and move the others to references.

4. Shared References → Unified Skill Routing

Principle: If multiple skills share common guidance, the first instinct is to extract it once and link. That works, but it's often a stepping stone to something better: collapsing near-identical skills into one skill with internal dispatch logic.

When this works: When you have N near-identical skills differing only by one dimension (language, framework, etc.), a unified skill with auto-detection is cleaner than N separate skills with shared references.

Trade-off: One SKILL.md, one set of evals, one routing boundary. Zero isolation violations. But your SKILL.md becomes more complex.

5. Stub Elimination

Principle: If a skill just redirects to another skill, delete it. The router doesn't need a placeholder, and stubs confuse future agents trying to decide what to use.

Honest Lessons: How I Should Have Run This

The work happened over 8 user messages and 2 hours. Here's what went sideways and what would have prevented it:

What HappenedWhat Would Have Been Better
SDK dedup discovered late (turn 6–8)Mention "deduplicate shared content" upfront as a known phase
Asking about PR + review + results separatelyBundle deliverables: "PR, team review, results file" in one request
Phases 4–5 required separate promptsFront-load scope: "all phases including medium skills" keeps momentum

The pattern that would have worked: Front-load three things upfront:

  1. The technique or tool (waza_tokens, reference extraction, etc.)
  2. Full scope with known edge cases (all 6 phases, ~800-token floor, etc.)
  3. All deliverables you want at the end

Front-load scope, technique, and deliverables in one message. The AI doesn't lose patience — you do. Every "keep going" prompt is a planning failure you're paying for at execution prices.

The Setup

For reference, here's what I was running:

Where to Go From Here

To determine if your own skills directory needs this treatment: run waza_tokens count and see the total. If it's over 100K tokens, you have meaningful room to optimize. If you have skills over 5K tokens, reference extraction is almost always worth it.

Everyone's skill architecture is different — the interesting work is figuring out which patterns actually fit your setup. If you try these and discover something that works or something that breaks, I'd be curious to hear what you found.

The same workbench, now organized — tools on pegboard, labeled drawers, clean surface

Same workshop. Same tools. Better organized. That's what 270K tokens of optimization looks like.

Main optimization session: May 11, 2026. 8 user messages, ~2 hours, 270K tokens saved. The Bonus Round consolidation happened in a follow-up session.