2 posts tagged with "skills"

Running AI at Work: A Field Guide to Cost, Craft, and Guardrails

June 28, 2026 · 15 min read

For the engineers building with AI every day — and the leaders setting the guardrails around them.

Take a few minutes to think about where we are. We're past the demo phase. The exciting question used to be "can a model even do this?" — and now it's the much less glamorous "how do we run this every day, at a price we can actually justify, without handing an autonomous process the keys to everything?"

That question really has three threads tangled together, and it helps to pull them apart up front:

The money — don't pay twice, and use the right tool for the job.
The craft — move from one-shot prompts to repeatable, testable processes.
The guardrails — least privilege, real observability, and accountability for things that aren't people.

There's one idea underneath all three: the graduation handoff. Almost everything here describes a single moment — a skill or agent growing up from "you're watching it work, turn by turn" into "it runs in production on its own." Before that moment, you are the scaffolding that earns the trust. After it, your standard production stack — telemetry, observability, authorization — becomes the permanent structure that keeps it. Keep that handoff in the back of your mind. It's the spine of everything below.

That watching phase—where you're present for each turn—is the subject of what comes next. It's not forever, but it's the cost of earning trust.

A woman with pink hair stands at a rocky coastline, watching and thinking

Part 1 — The money: stop paying twice

Pay for tokens once

Here's the thing about tokens: they're wonderful for exploring and terrible as a permanent runtime. Once you really understand a workflow, push it toward determinism — scripts, apps, tests, CI — so you pay the model once to figure it out, not on every single run.

I want to be straight with you, though: determinism isn't a free lunch, it's a trade. Scripts rot. APIs change, schemas drift, dependencies break. You're swapping a per-run token cost for a smaller, recurring maintenance cost — not erasing the cost.

So when do you actually harden something? It comes down to stakes. If a workflow is quick and clearly defined, move it to determinism early. If the consequences are bigger, give it a middle ground: let it mature with AI first, and as individual parts settle, move those parts into hardened, deterministic form — within the real-world limits of security, maintenance, and interoperability. The two failure modes to avoid: hardening a moving target, and paying frontier prices for something that's already settled.

Use the right tool for the job — on a cost gradient

"Use AI for everything" isn't a strategy. The real principle is the boring, durable one you already know: right tool for the job. And the tools sit on a gradient — a classic deterministic tool, then a small or local model, then a frontier model. Pick the cheapest rung that actually does the work.

Choosing the right one saves money, but more importantly, it saves cognitive load. You're not wrestling with overkill; you're matching capability to need.

A woman in a forest selecting the right tool from many options

Spelling and grammar is a nice illustration, precisely because it usually lands on the deterministic rung. Mature tools handle it cheaply and predictably, so you'd never reach for a frontier model — same instinct as "pay once." The interesting rung is the one above deterministic but below frontier, and that's where small or local models genuinely earn their keep: classification, routing, redaction, embeddings — work that needs more flexibility than a fixed tool can give but doesn't need state-of-the-art reasoning.

A concrete example I like: redacting PII from your logs before they're stored. A regex can't reliably catch a name or address it's never seen. A frontier model is overkill on every log line. A small local model is right in the sweet spot — flexible enough to generalize, cheap enough to run on every write, and local so the sensitive data never leaves your boundary.

Here's how to think about it systematically:

Cost gradient diagram

Turn that gradient into a system: model routing

Don't leave "good enough vs. state of the art" as a gut feeling. Systematize it with model routing, or cascades: try the cheap model first, and escalate to the frontier one only when confidence is low. Same instinct as above — but now it's a measurable, tiered spend you can actually reason about instead of a vibe.

Don't pay for compute you don't need — and remember local buys more than savings

If you don't need cloud or remote access, a small or local model lets you skip the ongoing API bill. But please don't fall for "local is free" — it's a myth. You're really just swapping one kind of cost for two others: you trade a pay-as-you-go bill for a big upfront purchase plus a new set of ongoing costs — hardware, electricity, wear-and-tear, and the engineer-hours to run and patch the thing. For low or spiky volume, cloud is often both cheaper and more secure than a box you babysit.

So find the real break-even: volume, latency, and data-residency or compliance constraints — not cost alone. And notice the two things local buys you beyond money. First, privacy and residency — the data never leaves your boundary, which is honestly the real reason to go local more often than cost is. Second, a chance to work in an emerging space. Local model management is still young — tools like Ollama, LM Studio, and Docker's Model Runner have made a real start, but enterprise basics are still thin: per-user access control, audit logging, versioning, and cost tracking. If you like building tools, there's real room here.

Treat AI spend like any other tech spend

After all this talk of cost, let me be clear about the goal: it isn't to fret over every dollar. It's the opposite. AI spend deserves the same deliberate budgeting you give every other part of your stack — it's real, it's recurring, and it should be a line item you own on purpose, tracked per team and per workflow.

Part 2 — The craft: from prompt to process

AI builds the prototype; you harden what matters

One of the most durable patterns here is an old one wearing new clothes. AI builds the prototype fast; once it works, you decide what becomes a repeatable, testable, secure process.

What's the artifact of that transition? Today it's a skill, an agent, or some other markdown file — but the format is incidental. What matters is the information that persists: a durable, reviewable source of truth that outlives whatever wrapper happens to hold it.

The work is real and tangible—you're shaping raw materials into something refined and usable:

A woman with pink hair shaping and crafting materials at a table with plants

And here's a bonus that's easy to miss: the hardening boundary is also the line between your two test regimes. What you've hardened gets a deterministic unit test. What stays probabilistic gets an eval — golden sets, LLM-as-judge, acceptance bands. Deciding what to harden is deciding how each piece gets tested. The test surface didn't shrink when AI showed up; it grew to cover both. Good news: the frameworks for this already exist, so you don't have to invent them.

Move from one-shot prompts to repeatable processes — and give them an owner

Wrap your deterministic scripts in AI skills that lean toward repeatable processes anyone can pick up. But reuse without ownership just turns into shadow IT — handy, but unowned and ungoverned. So give every reusable thing a home.

A frame I borrowed from a governance thinker in the company helps here: think federal, state, and local, where you, the individual contributor, are local. Federal assets are organization-wide. State builds on federal. Local builds on both. The pipeline updates assets as they change upstream — a local copy can be refreshed from its non-local source instead of drifting out of date. The point is that every reusable unit has an owning tier and an update path. That's the difference between a library and a junk drawer.

Chain and gate skills into workflows — but only what you can watch

You can absolutely chain skills into full workflows — just respect the math, because reliability compounds downward. Five steps at 95% each lands you around 77% end to end. So chain only what you can observe and gate, and keep your chains short.

What's a "gate"? Any defined checkpoint — a human in the loop or an AI, as long as it's explicit. The simplest version: look at the output or log of the last skill; if it has everything the next skill needs and shows no failures, move on. And a gentle reminder: more agents doesn't automatically produce better results.

A team of specialists beats one do-it-all generalist

There's a lot of research showing that teams of specialists outperform teams of generalists, and the same holds for AI agents. If you need a team, a set of focused specialist agents will beat one general-purpose agent copied over and over to fill every seat.

The usual objection to agent teams is that things get lost in the handoff — when one agent passes work to the next, the second agent doesn't know what the first one already figured out. That objection is fading fast. Most AI platforms now give agents shared memory (Squad does this for me today), so the specialists all read from and write to the same memory. Nothing has to be re-explained and lost at each step; the context is simply there for whoever needs it next.

So what's actually left to weigh? Mostly cost. A team of agents costs more than a single agent, because each one does its own thinking and runs up its own bill. That's the real trade-off — not lost context — and it's usually worth paying when each specialist does its part better than a generalist would.

The only time to reach for a single agent is when a team would be overkill: a small, simple job where spinning up specialists adds cost and coordination for no real gain. That's a simplicity call, not a limit on what specialists can do.

Capture tribal knowledge — turn repeatable work into skills

Here's a simple test: if you had to hand your work to someone else before going on vacation, and you'd explain a process step by step, that process is probably a skill. The moment you find yourself writing "first do this, then do that, then check for this" — you're describing something a skill can hold. Capture it.

Not everything passes the test. Codify the parts that are repetitive and stable; leave the genuinely judgment-heavy or one-off work to a person. Repetition is the tell — when you catch yourself doing the same thing the same way again and again, that's the signal.

Part 3 — The guardrails: things that aren't people

Least privilege for agents — and the third thing that's easy to miss

You already put real effort into deciding what access your people should have. Do the same for agents — don't let them roam your systems with your full authorization. And notice there are three distinct ideas here, not two:

Identity — who the agent is. It'll likely look a lot like an app or service principal (Entra Agent ID and similar platforms point that way).
Authorization — what it's allowed to do. Scope it as carefully as you'd scope a person.
Accountability — who answers when it causes harm.

With an employee, all three exist: a badge, access grants, and a liable person with intent, a contract, and consequences. With an agent, even identity is messier than it looks. Either the agent borrows a real person's identity — so every action it takes shows up as that human doing it — or it runs as a service principal that owns resources outright, with no person attached at all. Neither option cleanly separates the three ideas, and accountability has no home in either. When a correctly-scoped agent still deletes prod, leaks a secret, or does something destructive a human would've paused on, who's liable? The agent has no intent. The IC who launched it didn't author the step. The team that built the skill didn't run it. The vendor's model chose the action.

This is what separation of concerns looks like in practice—distance and clarity between who's watching and who's doing:

People watching and considering at a rocky shore, separated by distance and intention

The principle is simple—scope each agent to the single, smallest set of permissions it actually needs. Nothing more. Think of it as giving someone only what they need to do the job, not the keys to everything. There's one more wrinkle worth knowing about: an agent can be tricked into misusing access it legitimately has. A well-documented pattern (often called prompt injection) is when text the agent reads — a web page, a file, an email — contains hidden instructions, and the agent treats them as if you'd asked for them. The access was scoped correctly; the agent just used it on the wrong instructions. Weigh it when you decide how much an agent should be allowed to do on its own.

Log the output your gates depend on

"Log everything" needs a sharper edge, because prompt-and-response logs are a huge new sensitive-data surface. So let me be specific about what I actually mean: not the prompts and responses, but the core output of each step — the data flowing through your scripts.

That's the logging that makes chained gates effective. A gate can only ask "did the last step produce what the next one needs, with no failures?" if that output is captured. It's also what tells you when something went wrong and how much it affected. Without those logs, you can't even see how far the damage spread. And that's really the point: before anything runs on its own in production, you map out how far a failure could spread and build up trust in the process over a good stretch of running it with a person watching.

AI won't stay a black box — until it earns the right to be one

You're going to have to understand what's happening at levels you used to happily ignore. The usual car analogy — turn the key, drop it in gear, hit the gas, never think about the engine — actually cuts the other way right now. Cars earned that abstraction by working reliably for a century. AI hasn't earned it yet.

You need to see inside the box until it consistently succeeds. There's a whole range from "barely works" to "works every single time," and watching turn by turn is the discipline for the immature end of it. The watching doesn't disappear as the system matures — it changes form. When a skill or agent graduates from "watch every turn" to "production-automated," your bespoke attention gets handed off to the standard production stack: telemetry, observability, authorization. Manual vigilance is the scaffolding you use to earn trust; mature observability is the permanent structure that keeps it. That's the graduation handoff again — the very same moment that hardening, scoped authorization, and step-output logging each describe from their own corner.

Where this leaves us

Read straight through, these aren't a dozen scattered tips — they're one lifecycle. A capability starts life as an expensive, non-deterministic, hand-watched experiment, and if it proves out, it graduates into something cheap, deterministic, governed, and observable. The money tells you which rung to run it on. The craft tells you how to harden and test it. The guardrails tell you what it can touch and who answers when it goes wrong. And the handoff — your scaffolding giving way to standard ops — is where all three meet.

That journey is real work—intentional, deliberate, with a clear view of where you're building:

A woman at a mountain cabin workspace, designing and developing work with view and intention

And this is how it unfolds end to end—from experiment to automation, from manual scaffolding to permanent structure:

Graduation lifecycle diagram

So that's the real work in front of teams and organizations right now. It isn't adopting AI. It's graduating it. If you're already wrestling with any of this, I'd genuinely love to hear what's working for you and what isn't — that's how all of us find the real value faster.

Optimizing Copilot Skills: 65% Token Reduction Across 117 Skills

May 11, 2026 · 13 min read

Watercolor illustration of a craftsperson's workbench being tidied and organized

Optimizing skills felt less like deleting content and more like reorganizing a workshop — same tools, better drawers.

I'd been adding to the .copilot/skills/ directory for a while without taking inventory. Every feature or domain onboarding meant a new skill — sometimes three. The assumption was obvious: more skills = more consistency. For the first few dozen, that was true.

Here's what's weird: I had no actual count.

When I finally looked: 413,591 tokens across 136 skill and reference files (117 distinct skills). Just measuring it revealed the bloat:

6 SDK sample review skills: 140K tokens (34% of total budget)
Dead redirect stubs: consuming tokens for no routing purpose
Duplicated prose: same guidance repeated across language variants

Not a disaster, but the kind of creeping growth that happens when you build fast and don't audit.

Why this matters: Skills load on demand, so optimizing them doesn't free your active context window. But faster agent spawns and cheaper skill loading? That's a different lever, and I wanted to pull it.

The patterns I found — reference extraction, checklist compression, shared references — work with any tool. I used GitHub Copilot CLI with Squad orchestration to run them in parallel, but you could apply them manually in any editor. The techniques are the point, not the tooling.

Measuring Token Usage with microsoft/waza

The first move: measure. Waza is a skill quality toolkit, and waza_tokens count does exactly that — scans your skills directory and gives you sorted token usage. No guessing. Here's the breakdown:

$ waza_tokens count .copilot/skills/
┌─────────────────────────────────┬────────┐
│ Skill                           │ Tokens │
├─────────────────────────────────┼────────┤
│ data-plus-ai-sdk-java-sample... │ 25,841 │
│ data-plus-ai-sdk-python-samp... │ 23,921 │
│ ...                             │        │
│ dina-small-utility              │    312 │
├─────────────────────────────────┼────────┤
│ Total: 117 skills               │413,591 │
└─────────────────────────────────┴────────┘

25K tokens for a single skill. That's the starting point. Waza has other tools too — waza_tokens suggest for optimization ideas, waza_quality to verify post-changes, and waza_dev --copilot for frontmatter — but for this work, count was the diagnostic tool.

Planning the Work

The strategy: decompose into phases, ordered by savings potential. Clear the big ones first; small wins come after.

Phase	Target	Est. Savings
1. Kill stubs	3 empty redirect skills	~73 tokens
2. Refactor giants	6 SDK review skills (140K!)	~120K tokens
3. Optimize large	14 skills (5K–10K each)	~30–50K tokens
4. Optimize medium	60 skills (1K–5K each)	~10–20K tokens
5. Trim small	20 skills (under 1K each)	minimal
6. Audit references	Large reference files	~10–15K tokens

Why this order matters: I started with stubs not because they saved much, but because they reduced noise before the real work. Phases 2–3 capture the bulk of savings. Phases 4–5 are diminishing returns per skill, but we completed them efficiently by applying the same patterns we'd already proven earlier.

Phase 1: Killing the Stubs

Problem: Three skills were redirect stubs — they pointed to other skills and had fewer than 50 tokens of actual content. No routing logic, no value.

Action: Deleted them.

Result: −73 tokens. Barely registers numerically, but this is the "boring is good" work. A clean directory is easier to maintain, and stubs confuse future maintainers.

Phase 2: The Giants

Problem: Six SDK sample review skills (Java, Python, Go, .NET, TypeScript, Rust) had identical structure: 15–16 detailed rule sections + full code examples inline. Total: 140K tokens (34% of budget). Agents loaded everything every time, even when they only needed one language's rules.

Technique: Reference Extraction. Move verbose rules and examples to references/ files. Keep SKILL.md slim — just routing info, a quick checklist, and blockers. Agents load the overview immediately, fetch detailed rules on demand.

Before:

java-sdk-review/SKILL.md (25,841 tokens)
├── Routing info (2K)
├── Error handling rules (8K + full examples)
├── Concurrency rules (7K + full examples)
├── Async patterns (6K + full examples)
└── ... 12 more sections ...

After:

java-sdk-review/SKILL.md (1,541 tokens)
├── Routing: detect Java SDK samples
├── Quick checklist
│   ├── Error handling: caught, logged, meaningful messages
│   ├── Concurrency: thread-safe, no race conditions
│   ├── Async patterns: proper callback/future chaining
│   └── ... 5 more items
└── Reference files in references/java/ (loaded on demand)

Two-tier architecture. Same content, loaded smarter.

Execution: Ran all six in parallel:

Skill	Before	After	Reduction
Java	25,841	1,541	94%
Python	23,921	1,083	95%
Go	24,355	1,815	93%
.NET	23,355	1,378	94%
TypeScript	21,543	1,525	93%
Rust	21,303	1,643	92%
Total	140,318	8,985	~131K saved

Trade-off: Agents now navigate a two-tier structure (SKILL.md → references/) instead of one flat file. Discoverability costs something. But these skills are used frequently enough that agents will learn the pattern. Zero content was removed — every rule and example is still there, just reference-extracted.

Phase 2 complete: SDK skills before/after showing 94%+ reduction per language

Phase 3: Large Skills

Problem: 14 more skills in the 5K–10K range had the same structure: verbose sections that could be extracted. Examples: azure-mcp-content-generation, dina-reskill, context-diagnostics.

Action: Applied the same reference extraction pattern in 4 parallel batches.

Result: −68,084 tokens (76% reduction)

Running total:

Phase 1 (stubs):   −73 tokens
Phase 2 (giants):  −131,333 tokens
Phase 3 (large):   −68,084 tokens
────────────────────────────────
Subtotal saved:    ~199,490 tokens (48% of starting budget)
Remaining:         ~214,101 tokens

At this point, the curve was clear. Phases 4–5 (medium and small skills) would yield diminishing returns per unit effort — but having proven the techniques in Phases 1–3, we already knew how to apply them efficiently at scale.

The PR and the Review

Problem: The PR touched 106 files, 18,571 deletions, 12,176 insertions. Before shipping, we needed to verify:

Structural integrity (paths, syntax, references valid)
Quality didn't regress (waza_quality scores)
Routing logic still precise
Didn't over-trim skills below usefulness

Action: Ran four automated review passes:

Structural integrity check — passed
Waza quality verification — passed with notes
Trigger precision validation — passed with notes
Adversarial over-trimming check — caught 2 real issues

Issues found and fixed:

Reference file with broken relative path (in Phase 2)
Skill trimmed below ~800 tokens (lost routing context entirely)

Second pass: ✅ SHIP

Pull request showing 65% Copilot skills token reduction across 106 files

Key finding: Don't reduce a SKILL.md below ~800 tokens for standalone skills. Below that threshold, you lose enough routing context that agents can't determine when or how to use the skill. Exception: Skills with strong internal routing logic (like the unified SDK skill at 469 tokens) can go lower because their dispatch logic compensates.

The ~800-token floor is a practical boundary, discovered through testing.

Phases 4–6: The Curve Flattens

After Phase 3, ~214K tokens remained. Phases 4–6 brought that down to 143K — another ~70K saved by applying the same patterns (checklist compression, reference extraction, deduplication) at smaller scale:

Phase	Skills	Technique	Savings
4. Medium	60 skills (1K–5K each)	Checklist compression, dedup	~45K
5. Small	20 skills (under 1K each)	Light trimming	~10K
6. Reference audit	Large reference files	Consolidation	~15K

The per-skill ROI drops in later phases, but having proven the techniques in Phases 1–3, the work was mechanical — same patterns, smaller targets. The current 143K total is sustainable for the usage pattern.

Final Numbers

Before:  413,591 tokens (117 skills, 136 files)
After:   143,354 tokens (114 skills, 130 files)
Saved:   270,237 tokens (65.3% reduction)

This reflects the main optimization PR. The workbench is cleaner — every tool is still there, but they're in labeled drawers instead of piled on the surface. The Bonus Round consolidation happened in a separate session and is described next.

Bonus Round: From Shared References to Unified Skills

After the optimization PR shipped, I ran waza_quality on the six SDK skills and noticed: isolation violation. Each skill had its own SKILL.md, its own routing, its own boilerplate duplicated across six files. That's a pattern violation — not atomic, not clean.

So I rethought it. Three options existed:

Accept the low score — good-enough for a domain-specific exception
Inline shared content — copy the 14 shared reference files into each skill, double the maintenance burden
Single skill with language dispatch — collapse all 7 (6 SDK languages + 1 quickstart) into one unified skill with language auto-detection

I went with option 3. Not because it was obvious, but because it matched the actual use case: agents almost never review samples in all languages simultaneously. They review one language based on the codebase. A single skill with smart routing was actually more correct than the multi-skill pretense.

Result: azure-sdk-sample-review/ — one skill, 469 tokens in SKILL.md, language auto-detection via prompt analysis. Structure:

azure-sdk-sample-review/
├── SKILL.md (469 tokens) — routing + dispatch logic
├── evals/ (7 tasks, 100% passing)
├── references/
│   ├── shared/ (14 files: generic best practices)
│   ├── dotnet/, go/, java/, python/, rust/, typescript/
│   └── quickstart/

Before: 86 files across 6 language folders — Java, Python, Go, .NET, TypeScript, Rust — each with its own copy of the same 14 generic reference files. 45% pure duplication. Update one? Remember to update five more.

After: One shared/ folder holds the 14 generic files. Each language folder keeps only what's actually unique to that SDK. One update, one place. Six duplicated routing SKILL.md files collapsed to a single dispatch mechanism. Waza compliance achieved — no more isolation violations. All 7 behavioral evals running at 100%.

This is the evolution: reference extraction → shared references → unified skill with internal routing. Each step felt right at the time. Looking back, the final architecture is simpler and more correct. Wish I'd seen it from the start. (Work captured in a follow-up PR.)

Dogfooding: The Reskill Skill

I captured the optimization pipeline as a skill — dina-reskill — documenting the 8-pattern workflow (reference extraction, checklist compression, example pruning, and so on).

Then I ran it on itself, because apparently I can't leave well enough alone:

SKILL.md:  2,085 → 1,163 tokens (44% reduction)
Total:     5,401 → 4,288 tokens (21% reduction)

Three review passes: two approvals, one note flagged and fixed. The skill practices what it documents. The SDK skills themselves evolved further after this (described in Bonus Round) — from six separate skills down to a single unified skill. So while dina-reskill captures the second-pass improvements here, the SDK consolidation shows how those patterns continue to evolve as you live with them.

What Actually Worked: The Patterns

If you're building a skills optimization workflow, here are the patterns ranked by impact:

1. Reference Extraction

Principle: Move detailed rules, code examples, and verbose explanations into references/ files. The SKILL.md becomes a slim routing layer — overview, quick checklist, blocker list. Agents load references on demand.

When to use: For any skill over 5K tokens, this should be your first move. Start here, not somewhere else.

Example: The Java SDK skill went from 25,841 tokens (inline rules + examples) to 1,541 tokens (routing + checklist) by extracting ~24K into references/java/.

2. Checklist Compression

Principle: Turn paragraph-style guidance into concise checklists. Same information, fraction of the tokens.

Example:

Before: "When reviewing error handling, ensure that all errors are properly caught, logged with appropriate context, and returned with meaningful messages to the caller"
After: "✅ Errors: caught, logged with context, meaningful messages"

3. Example Pruning

Principle: One good example per pattern. If your skill has 3 examples of the same concept, keep the clearest one and move the others to references.

4. Shared References → Unified Skill Routing

Principle: If multiple skills share common guidance, the first instinct is to extract it once and link. That works, but it's often a stepping stone to something better: collapsing near-identical skills into one skill with internal dispatch logic.

When this works: When you have N near-identical skills differing only by one dimension (language, framework, etc.), a unified skill with auto-detection is cleaner than N separate skills with shared references.

Trade-off: One SKILL.md, one set of evals, one routing boundary. Zero isolation violations. But your SKILL.md becomes more complex.

5. Stub Elimination

Principle: If a skill just redirects to another skill, delete it. The router doesn't need a placeholder, and stubs confuse future agents trying to decide what to use.

Honest Lessons: How I Should Have Run This

The work happened over 8 user messages and 2 hours. Here's what went sideways and what would have prevented it:

What Happened	What Would Have Been Better
SDK dedup discovered late (turn 6–8)	Mention "deduplicate shared content" upfront as a known phase
Asking about PR + review + results separately	Bundle deliverables: "PR, team review, results file" in one request
Phases 4–5 required separate prompts	Front-load scope: "all phases including medium skills" keeps momentum

The pattern that would have worked: Front-load three things upfront:

The technique or tool (waza_tokens, reference extraction, etc.)
Full scope with known edge cases (all 6 phases, ~800-token floor, etc.)
All deliverables you want at the end

Front-load scope, technique, and deliverables in one message. The AI doesn't lose patience — you do. Every "keep going" prompt is a planning failure you're paying for at execution prices.

The Setup

For reference, here's what I was running:

GitHub Copilot CLI v1.0.40
Squad v0.9.4-insider.1 for multi-agent orchestration
microsoft/waza for skill quality analysis
Model: Claude Opus 4.6 with 200K context window

Where to Go From Here

To determine if your own skills directory needs this treatment: run waza_tokens count and see the total. If it's over 100K tokens, you have meaningful room to optimize. If you have skills over 5K tokens, reference extraction is almost always worth it.

Everyone's skill architecture is different — the interesting work is figuring out which patterns actually fit your setup. If you try these and discover something that works or something that breaks, I'd be curious to hear what you found.

The same workbench, now organized — tools on pegboard, labeled drawers, clean surface

Same workshop. Same tools. Better organized. That's what 270K tokens of optimization looks like.

Main optimization session: May 11, 2026. 8 user messages, ~2 hours, 270K tokens saved. The Bonus Round consolidation happened in a follow-up session.

Part 1 — The money: stop paying twice​

Pay for tokens once​

Use the right tool for the job — on a cost gradient​

Turn that gradient into a system: model routing​

Don't pay for compute you don't need — and remember local buys more than savings​

Treat AI spend like any other tech spend​

Part 2 — The craft: from prompt to process​

AI builds the prototype; you harden what matters​

Move from one-shot prompts to repeatable processes — and give them an owner​

Chain and gate skills into workflows — but only what you can watch​

A team of specialists beats one do-it-all generalist​

Capture tribal knowledge — turn repeatable work into skills​

Part 3 — The guardrails: things that aren't people​

Least privilege for agents — and the third thing that's easy to miss​

Log the output your gates depend on​

AI won't stay a black box — until it earns the right to be one​

Where this leaves us​

Measuring Token Usage with microsoft/waza​

Planning the Work​

Phase 1: Killing the Stubs​

Phase 2: The Giants​

Phase 3: Large Skills​

The PR and the Review​

Phases 4–6: The Curve Flattens​

Final Numbers​

Bonus Round: From Shared References to Unified Skills​

Dogfooding: The Reskill Skill​

What Actually Worked: The Patterns​

1. Reference Extraction​

2. Checklist Compression​

3. Example Pruning​

4. Shared References → Unified Skill Routing​

5. Stub Elimination​

Honest Lessons: How I Should Have Run This​

The Setup​

Where to Go From Here​