Skip to main content

Running AI at Work: A Field Guide to Cost, Craft, and Guardrails

· 15 min read

For the engineers building with AI every day — and the leaders setting the guardrails around them.

Take a few minutes to think about where we are. We're past the demo phase. The exciting question used to be "can a model even do this?" — and now it's the much less glamorous "how do we run this every day, at a price we can actually justify, without handing an autonomous process the keys to everything?"

That question really has three threads tangled together, and it helps to pull them apart up front:

  • The money — don't pay twice, and use the right tool for the job.
  • The craft — move from one-shot prompts to repeatable, testable processes.
  • The guardrails — least privilege, real observability, and accountability for things that aren't people.

There's one idea underneath all three: the graduation handoff. Almost everything here describes a single moment — a skill or agent growing up from "you're watching it work, turn by turn" into "it runs in production on its own." Before that moment, you are the scaffolding that earns the trust. After it, your standard production stack — telemetry, observability, authorization — becomes the permanent structure that keeps it. Keep that handoff in the back of your mind. It's the spine of everything below.

That watching phase—where you're present for each turn—is the subject of what comes next. It's not forever, but it's the cost of earning trust.

A woman with pink hair stands at a rocky coastline, watching and thinking

Part 1 — The money: stop paying twice

Pay for tokens once

Here's the thing about tokens: they're wonderful for exploring and terrible as a permanent runtime. Once you really understand a workflow, push it toward determinism — scripts, apps, tests, CI — so you pay the model once to figure it out, not on every single run.

I want to be straight with you, though: determinism isn't a free lunch, it's a trade. Scripts rot. APIs change, schemas drift, dependencies break. You're swapping a per-run token cost for a smaller, recurring maintenance cost — not erasing the cost.

So when do you actually harden something? It comes down to stakes. If a workflow is quick and clearly defined, move it to determinism early. If the consequences are bigger, give it a middle ground: let it mature with AI first, and as individual parts settle, move those parts into hardened, deterministic form — within the real-world limits of security, maintenance, and interoperability. The two failure modes to avoid: hardening a moving target, and paying frontier prices for something that's already settled.

Use the right tool for the job — on a cost gradient

"Use AI for everything" isn't a strategy. The real principle is the boring, durable one you already know: right tool for the job. And the tools sit on a gradient — a classic deterministic tool, then a small or local model, then a frontier model. Pick the cheapest rung that actually does the work.

Choosing the right one saves money, but more importantly, it saves cognitive load. You're not wrestling with overkill; you're matching capability to need.

A woman in a forest selecting the right tool from many options

Spelling and grammar is a nice illustration, precisely because it usually lands on the deterministic rung. Mature tools handle it cheaply and predictably, so you'd never reach for a frontier model — same instinct as "pay once." The interesting rung is the one above deterministic but below frontier, and that's where small or local models genuinely earn their keep: classification, routing, redaction, embeddings — work that needs more flexibility than a fixed tool can give but doesn't need state-of-the-art reasoning.

A concrete example I like: redacting PII from your logs before they're stored. A regex can't reliably catch a name or address it's never seen. A frontier model is overkill on every log line. A small local model is right in the sweet spot — flexible enough to generalize, cheap enough to run on every write, and local so the sensitive data never leaves your boundary.

Here's how to think about it systematically:

Cost gradient diagram

Turn that gradient into a system: model routing

Don't leave "good enough vs. state of the art" as a gut feeling. Systematize it with model routing, or cascades: try the cheap model first, and escalate to the frontier one only when confidence is low. Same instinct as above — but now it's a measurable, tiered spend you can actually reason about instead of a vibe.

Don't pay for compute you don't need — and remember local buys more than savings

If you don't need cloud or remote access, a small or local model lets you skip the ongoing API bill. But please don't fall for "local is free" — it's a myth. You're really just swapping one kind of cost for two others: you trade a pay-as-you-go bill for a big upfront purchase plus a new set of ongoing costs — hardware, electricity, wear-and-tear, and the engineer-hours to run and patch the thing. For low or spiky volume, cloud is often both cheaper and more secure than a box you babysit.

So find the real break-even: volume, latency, and data-residency or compliance constraints — not cost alone. And notice the two things local buys you beyond money. First, privacy and residency — the data never leaves your boundary, which is honestly the real reason to go local more often than cost is. Second, a chance to work in an emerging space. Local model management is still young — tools like Ollama, LM Studio, and Docker's Model Runner have made a real start, but enterprise basics are still thin: per-user access control, audit logging, versioning, and cost tracking. If you like building tools, there's real room here.

Treat AI spend like any other tech spend

After all this talk of cost, let me be clear about the goal: it isn't to fret over every dollar. It's the opposite. AI spend deserves the same deliberate budgeting you give every other part of your stack — it's real, it's recurring, and it should be a line item you own on purpose, tracked per team and per workflow.

Part 2 — The craft: from prompt to process

AI builds the prototype; you harden what matters

One of the most durable patterns here is an old one wearing new clothes. AI builds the prototype fast; once it works, you decide what becomes a repeatable, testable, secure process.

What's the artifact of that transition? Today it's a skill, an agent, or some other markdown file — but the format is incidental. What matters is the information that persists: a durable, reviewable source of truth that outlives whatever wrapper happens to hold it.

The work is real and tangible—you're shaping raw materials into something refined and usable:

A woman with pink hair shaping and crafting materials at a table with plants

And here's a bonus that's easy to miss: the hardening boundary is also the line between your two test regimes. What you've hardened gets a deterministic unit test. What stays probabilistic gets an eval — golden sets, LLM-as-judge, acceptance bands. Deciding what to harden is deciding how each piece gets tested. The test surface didn't shrink when AI showed up; it grew to cover both. Good news: the frameworks for this already exist, so you don't have to invent them.

Move from one-shot prompts to repeatable processes — and give them an owner

Wrap your deterministic scripts in AI skills that lean toward repeatable processes anyone can pick up. But reuse without ownership just turns into shadow IT — handy, but unowned and ungoverned. So give every reusable thing a home.

A frame I borrowed from a governance thinker in the company helps here: think federal, state, and local, where you, the individual contributor, are local. Federal assets are organization-wide. State builds on federal. Local builds on both. The pipeline updates assets as they change upstream — a local copy can be refreshed from its non-local source instead of drifting out of date. The point is that every reusable unit has an owning tier and an update path. That's the difference between a library and a junk drawer.

Chain and gate skills into workflows — but only what you can watch

You can absolutely chain skills into full workflows — just respect the math, because reliability compounds downward. Five steps at 95% each lands you around 77% end to end. So chain only what you can observe and gate, and keep your chains short.

What's a "gate"? Any defined checkpoint — a human in the loop or an AI, as long as it's explicit. The simplest version: look at the output or log of the last skill; if it has everything the next skill needs and shows no failures, move on. And a gentle reminder: more agents doesn't automatically produce better results.

A team of specialists beats one do-it-all generalist

There's a lot of research showing that teams of specialists outperform teams of generalists, and the same holds for AI agents. If you need a team, a set of focused specialist agents will beat one general-purpose agent copied over and over to fill every seat.

The usual objection to agent teams is that things get lost in the handoff — when one agent passes work to the next, the second agent doesn't know what the first one already figured out. That objection is fading fast. Most AI platforms now give agents shared memory (Squad does this for me today), so the specialists all read from and write to the same memory. Nothing has to be re-explained and lost at each step; the context is simply there for whoever needs it next.

So what's actually left to weigh? Mostly cost. A team of agents costs more than a single agent, because each one does its own thinking and runs up its own bill. That's the real trade-off — not lost context — and it's usually worth paying when each specialist does its part better than a generalist would.

The only time to reach for a single agent is when a team would be overkill: a small, simple job where spinning up specialists adds cost and coordination for no real gain. That's a simplicity call, not a limit on what specialists can do.

Capture tribal knowledge — turn repeatable work into skills

Here's a simple test: if you had to hand your work to someone else before going on vacation, and you'd explain a process step by step, that process is probably a skill. The moment you find yourself writing "first do this, then do that, then check for this" — you're describing something a skill can hold. Capture it.

Not everything passes the test. Codify the parts that are repetitive and stable; leave the genuinely judgment-heavy or one-off work to a person. Repetition is the tell — when you catch yourself doing the same thing the same way again and again, that's the signal.

Part 3 — The guardrails: things that aren't people

Least privilege for agents — and the third thing that's easy to miss

You already put real effort into deciding what access your people should have. Do the same for agents — don't let them roam your systems with your full authorization. And notice there are three distinct ideas here, not two:

  • Identitywho the agent is. It'll likely look a lot like an app or service principal (Entra Agent ID and similar platforms point that way).
  • Authorizationwhat it's allowed to do. Scope it as carefully as you'd scope a person.
  • Accountabilitywho answers when it causes harm.

With an employee, all three exist: a badge, access grants, and a liable person with intent, a contract, and consequences. With an agent, even identity is messier than it looks. Either the agent borrows a real person's identity — so every action it takes shows up as that human doing it — or it runs as a service principal that owns resources outright, with no person attached at all. Neither option cleanly separates the three ideas, and accountability has no home in either. When a correctly-scoped agent still deletes prod, leaks a secret, or does something destructive a human would've paused on, who's liable? The agent has no intent. The IC who launched it didn't author the step. The team that built the skill didn't run it. The vendor's model chose the action.

This is what separation of concerns looks like in practice—distance and clarity between who's watching and who's doing:

People watching and considering at a rocky shore, separated by distance and intention

The principle is simple—scope each agent to the single, smallest set of permissions it actually needs. Nothing more. Think of it as giving someone only what they need to do the job, not the keys to everything. There's one more wrinkle worth knowing about: an agent can be tricked into misusing access it legitimately has. A well-documented pattern (often called prompt injection) is when text the agent reads — a web page, a file, an email — contains hidden instructions, and the agent treats them as if you'd asked for them. The access was scoped correctly; the agent just used it on the wrong instructions. Weigh it when you decide how much an agent should be allowed to do on its own.

Log the output your gates depend on

"Log everything" needs a sharper edge, because prompt-and-response logs are a huge new sensitive-data surface. So let me be specific about what I actually mean: not the prompts and responses, but the core output of each step — the data flowing through your scripts.

That's the logging that makes chained gates effective. A gate can only ask "did the last step produce what the next one needs, with no failures?" if that output is captured. It's also what tells you when something went wrong and how much it affected. Without those logs, you can't even see how far the damage spread. And that's really the point: before anything runs on its own in production, you map out how far a failure could spread and build up trust in the process over a good stretch of running it with a person watching.

AI won't stay a black box — until it earns the right to be one

You're going to have to understand what's happening at levels you used to happily ignore. The usual car analogy — turn the key, drop it in gear, hit the gas, never think about the engine — actually cuts the other way right now. Cars earned that abstraction by working reliably for a century. AI hasn't earned it yet.

You need to see inside the box until it consistently succeeds. There's a whole range from "barely works" to "works every single time," and watching turn by turn is the discipline for the immature end of it. The watching doesn't disappear as the system matures — it changes form. When a skill or agent graduates from "watch every turn" to "production-automated," your bespoke attention gets handed off to the standard production stack: telemetry, observability, authorization. Manual vigilance is the scaffolding you use to earn trust; mature observability is the permanent structure that keeps it. That's the graduation handoff again — the very same moment that hardening, scoped authorization, and step-output logging each describe from their own corner.

Where this leaves us

Read straight through, these aren't a dozen scattered tips — they're one lifecycle. A capability starts life as an expensive, non-deterministic, hand-watched experiment, and if it proves out, it graduates into something cheap, deterministic, governed, and observable. The money tells you which rung to run it on. The craft tells you how to harden and test it. The guardrails tell you what it can touch and who answers when it goes wrong. And the handoff — your scaffolding giving way to standard ops — is where all three meet.

That journey is real work—intentional, deliberate, with a clear view of where you're building:

A woman at a mountain cabin workspace, designing and developing work with view and intention

And this is how it unfolds end to end—from experiment to automation, from manual scaffolding to permanent structure:

Graduation lifecycle diagram

So that's the real work in front of teams and organizations right now. It isn't adopting AI. It's graduating it. If you're already wrestling with any of this, I'd genuinely love to hear what's working for you and what isn't — that's how all of us find the real value faster.

Why Squad?

· 5 min read

If you already use GitHub Copilot, the real question is simple: why would Squad feel different enough to change your workflow?

The first time I used Squad, I got a result that was noticeably better than what I got from a single Copilot chat. It was the same task and basically the same intent, but the coverage was better.

That was the moment it clicked for me: Squad is a multi-agent orchestrator. Instead of asking one AI generalist to do everything, I can route the same task through specialists with clear roles.

This post came from a coworker asking me, "Why Squad?"

Watercolor illustration of a sunlit woodworking workshop with a team working on projects.

Why Squad feels different from Copilot

Before Squad, my workflow was simple: open Copilot and ask one agent for everything. That works well for plenty of tasks, but it is still one perspective.

When I ran Squad on the same docs task, the output came back with fewer blind spots. It was not just more polished. It covered angles the generalist missed.

That pushed me to design my own team. I started with obvious roles like builder, reviewer, and debugger, then added project-specific specialists with strong opinions about quality. I now run with about 30 agents.

I am directing now, not doing

With Squad, my job shifts from doing the task to directing the process.

I set the plan, the constraints, and the definition of done. Squad coordinates which agents should build, review, challenge assumptions, and test. Sometimes I direct at a high level with goals and standards. Sometimes I assign specific work.

I ask Squad to make better decisions with less intervention from me.

Why build and review should be separate teams

For me, the biggest quality jump came from separating build and review.

At first, I only wanted better output than one generalist could produce. Later, I built a dedicated review team where each reviewer had a different job:

  • One reviewer hunts for security risks.
  • One checks factual correctness.
  • One stress-tests logic and edge cases.
  • One reviews performance tradeoffs.

I do not want eight approvals. I want eight different objections.

I also want those objections resolved through agent collaboration, not by pulling me into every pass. Just like a real PR review, reviewers should send issues back to build, build should improve the work, and the review should run again until everyone agrees the bar is met.

Watercolor illustration of a team examining the same cabinet from different angles

A single generalist reviewing its own output gives me consistency with itself. That is not the same as complete and correct.

For this process, done means reviewer consensus after iterative fixes. If reviewers still disagree, the PR is not done.

"Did it finish?" is the wrong question

Before Squad, I asked:

  • Did it work?
  • What needs to change?
  • Is it done?

With Squad, the better question is: How can this be more consistently correct?

Who else should review this? Which assumptions need pressure testing? What would a performance engineer challenge? What would a security reviewer reject?

That changes my role. I am not just relieved that something exists. I am trying to improve the system that produces it.

Try this once with the same task

If you are curious, run a simple experiment:

  1. Pick a task you already gave to one Copilot agent.
  2. Run it again with a build team.
  3. Then run a separate review team on that output.
  4. Compare what the review team catches.

That comparison is why Squad is interesting to me. I care less about getting one answer and more about running a process that catches weak spots before I ever see the final output. That is the core difference in Squad vs. Copilot for me.

Management becomes the core skill

Once you have a team of agents, management becomes the multiplier.

I think in terms of plans, process, and quality gates. Projects need clear definitions and constant refinement. Squad builds to the plan, reviews against the plan, and makes quality visible.

Each project moves from manual execution toward a repeatable process. For me, the next step is a project-level agent that manages implementation of that plan by treating skills as atomic process definitions. It can chain those skills together, enforce quality gates between phases, and produce reports that show what was supposed to happen, what actually happened, and whether the completion criteria were met.

Summary

Squad changed how I work. What mattered was not getting a slightly better answer from one agent. What mattered was getting a repeatable multi-agent process. Separate build and review teams, explicit skills, quality gates, and shared reports create a clearer path to "done" with less guesswork. Copilot helps me generate; Squad helps me run the work.


Copilot Configuration Files: What They Are, Where They Live, and When to Use Them

· 18 min read

When I work with repos that use Copilot, I keep running into the same issue: people have these files scattered across .github/, but I'm not sure what each one does, why they exist separately, or how they're supposed to work together.

I'll see .github/copilot-instructions.md and .github/agents/ in the same repo, then a prompts/ folder. The files exist, but my mental model doesn't.

This guide is my attempt to create that mental model. I want to know what each file is for, where real repos keep them, and how they are used.

Three matryoshka nesting dolls opened and displayed from largest to smallest — representing the governance layers of instructions, agents, and skills

At first glance, these configuration files look related — but how they nest relate to each other isn't obvious until you see the full picture.

Quick Reference: The Six File Families

#FileLives atPurposeWhen to use
1Copilot instructions.github/copilot-instructions.mdRepository-wide governance (always-on context)"All Copilot work in this repo should follow these rules"
2Path-specific instructions.github/instructions/*.instructions.mdScoped rules for specific directories or file patterns"My monorepo has areas with different conventions"
3Agent definitions.github/agents/Specialist persona with its own scope, tools, and constraints"This task needs a reviewer or specialist with tighter boundaries"
4Skills.github/skills/{name}/SKILL.mdReusable workflow package with instructions, resources, and optional scripts"This is a repeatable procedure Copilot should know how to do"
5Prompt files / prompt docsOfficial prompt files: .github/prompts/*.prompt.md; repo-local docs often live in .github/prompts/*.mdReusable prompt template or task-specific reference context"I need either a reusable prompt in the IDE or a deeper reference doc for a specific task"
6Workflows.github/workflows/GitHub Actions automation and remote triggers"I need this to run on a label, schedule, dispatch, or PR event"

Configuration files structure across the six .github families

This structure diagram turns the six file families into one navigable map so you can see which pieces are baseline, specialized, reusable, or event-driven.

Stylized .github home showing where instructions, agents, skills, prompts, and workflows live

The folder-house metaphor emphasizes that these files live together, but each room has a different job.

Those are the most commonly confused files. There are also AGENTS.md (an OpenAI/Anthropic agent convention — not a GitHub Copilot feature), hooks, and MCP configs.

The Copilot Instructions File: Governance Layer

/.github/copilot-instructions.md is the ambient context that sits behind Copilot work in a repo. It's not something I invoke manually. It's the baseline.

What it contains:

  • Coding standards and conventions
  • Architecture guidelines
  • Documentation expectations
  • Review requirements
  • What "done" looks like
  • Security constraints
  • Naming conventions

Real examples:

Microsoft MCP Copilot Instructions — a focused baseline file. The heading is Coding Instructions for GitHub Copilot, not "Comprehensive Overview," and it goes straight into always-apply rules, build expectations, and PR guidance.

Azure SDK for JavaScript Copilot Instructions — a denser governance file. The Azure SDK guidance gets specific about API design, implementation choices, and when to explain a deviation from the house style. In my experience, this kind of baseline makes Copilot more likely to produce code aligned with review expectations.

When you use it in Copilot:

  • Open Copilot chat in the repo
  • Ask it to "write a PR description"
  • Copilot includes the repo instructions automatically

How long should it be? Short enough that a human can still scan it. MCP's file is compact. Azure's is more opinionated. If the instructions file starts reading like a runbook instead of repo policy, I move the task-specific detail somewhere else.

Key rule: Instructions describe outcomes and standards. They are the repo's rules of engagement, not the step-by-step workflow.

Path-Specific Instructions: Scoped Rules

I missed this feature entirely until I was reviewing a monorepo that had three different instruction files and I couldn't figure out why Copilot kept changing tone between the frontend and backend code. Turns out, Copilot supports path-specific instruction files in .github/instructions/ — and they're additive with the repo-wide file.

Format: {name}.instructions.md files inside .github/instructions/

Example structure:

.github/
copilot-instructions.md # repo-wide (always-on)
instructions/
frontend.instructions.md # applies to /frontend
backend.instructions.md # applies to /backend
tests.instructions.md # applies to test files

Each file uses YAML frontmatter with an applyTo key to declare which paths it covers:

---
applyTo: "src/frontend/**"
---
Use React functional components with TypeScript.
Prefer CSS modules over inline styles.

Key behavior: When I'm working in a file that matches a path-specific instruction, Copilot combines both the repo-wide instructions and the path-specific ones. It doesn't replace — it layers. This is different from how I initially assumed it worked.

When to use path-specific instructions instead of the repo-wide file:

  • Your monorepo has distinct language/framework areas with different conventions
  • Test files need different guidance than production code
  • API code has stricter rules than internal scripts

When NOT to use them:

  • Rules that apply everywhere (keep those in copilot-instructions.md)
  • Task-specific procedures (those belong in skills)
  • One-off guidance (just say it in chat)

Official docs: Adding path-specific custom instructions


Agent Definitions: Specialist Persona

.github/agents/ is where I expect to find specialist agents — focused reviewers or operators with their own scope, tool boundaries, and output format.

What an agent file contains:

  • What the agent specializes in
  • Which tools it can access
  • Task-specific constraints
  • Expected output format
  • When to use it over the default assistant

Real examples:

Azure SDK: archie.agent.md — architecture review specialist. The file defines the reviewer's purpose, allowed tools, scope, and output format. It also references ../prompts/architecture-review-guidelines.md to load its review guidelines.

Azure SDK: scribe.agent.md — documentation review specialist. Same pattern, different charter: README, CHANGELOG, snippets, samples, and TSDoc.

How to invoke it: The exact path depends on the surface. GitHub's customization cheat sheet is the clearest map I found: select the custom agent in the UI, use /agent in Copilot CLI, or reference the agent naturally in chat.

How it differs from instructions:

  • Instructions: Always-on, repo-wide governance
  • Agent: Specialized, on-demand, with its own boundaries

I don't need agents for every repo. A lot of projects are fine with instructions alone. But once I have a recurring task like architecture review or docs review, an agent gives that work a clear owner.


Skills: Reusable Procedures

A skill is a folder with a SKILL.md file plus whatever supporting templates, scripts, or resources the workflow needs. GitHub's docs describe skills as packages that Copilot can load when they're relevant to a task. That detail matters. A skill is not just a long prompt. It's a reusable procedure.

What a skill contains:

  • When to use the skill
  • What it does
  • Input requirements
  • Output expectations
  • Links to deeper references
  • Templates or helper assets
  • Scripts or automation, if the workflow needs them

Real examples:

OpenAI Agents Python: code-change-verification — a concrete verification skill. It tells Copilot when to run the verification stack and points to scripts that execute the checks.

Vercel Next.js: authoring-skills — a nice example of a skill that explains what belongs in a skill versus in AGENTS.md.

How it gets used:

  • In surfaces that support skills, Copilot may auto-load it when the task matches
  • You can ask for that workflow explicitly
  • A custom agent can rely on it as part of a larger procedure

Where they live:

  • .github/skills/ — the GitHub Copilot canonical path
  • .agents/skills/ — the cross-agent open standard path (used by OpenAI, Vercel, and others)

Skill vs. Agent vs. Instructions — the mental model:

GitHub's documentation describes what each feature does, but doesn't frame the relationship as explicitly as I'd like. Here's my mental model: Instructions set the rules. Agents decide. Skills execute. This is my interpretation, not GitHub's official framing, but it helps me remember which tool to reach for.


Skills vs. Prompts: The Critical Difference

This is the distinction that tripped me up most, because people use the word "prompt" to mean two different things.

AspectSkillPrompt file / prompt doc
What it isAgent-skill package with instructions, resources, and sometimes scriptsEither an official .prompt.md template or a repo-local markdown doc used as reference context
How it gets usedIn supported surfaces, Copilot may auto-load it when relevant, or you ask for it explicitlyOfficial prompt files are typically user-selected or referenced in supported IDEs; repo-local docs can also be linked from agents or workflows
What's insideSteps, commands, templates, helper assetsExamples, patterns, checklists, input variables, edge cases
CLI supportYesOfficial .prompt.md prompt files are IDE-focused, not Copilot CLI

The caveat that matters: Official GitHub prompt files use .prompt.md and are an IDE feature. Azure's .github/prompts/*.md files are different. They're repo-local reference docs that agents or workflows load, not the product feature GitHub documents as prompt files.

That means my old rule of thumb was too neat. Prompts are not always passive reference, and skills are not always the only place with step-by-step guidance. The better distinction is this: skills are packaged workflows Copilot can use as skills; prompt files and prompt docs are reusable context.

Decision tree for choosing between skills, official prompts, prompt docs, or no file

The decision tree compresses the distinction into one question sequence: reusable workflow first, reusable prompt second, reference doc third.


Prompts: Task-Specific Expertise

.github/prompts/ is where many repos store task-specific context. Sometimes those files are official GitHub prompt files ending in .prompt.md. Sometimes, and this is what Azure JS does, they are plain markdown docs that reviewers load as guidance.

What a prompt file or prompt doc contains:

  • Patterns and examples
  • Detailed guidance for a specific task
  • Decision rules
  • Known edge cases
  • Sometimes input placeholders if it's an official prompt file

Real examples:

Azure SDK: architecture-review-guidelines.md — a repo-local reference doc. It defines the review scope and checklist for architecture review, and Archie points to it directly.

GitHub gh-aw: create-agentic-workflow.md — another repo-local guide. It explains the markdown workflow structure used by gh-aw. Useful context, yes. An official GitHub prompt-file feature, no.

How agents use them: Agents or workflows can read these docs at runtime. In Azure JS, the architecture review guidance also points back to .github/copilot-instructions.md, which is a nice example of the layers staying connected instead of duplicating each other.

Where they live:

  • Official prompt files: .github/prompts/*.prompt.md
  • Repo-local reference docs: often .github/prompts/*.md
  • Organized by task: architecture review, dependency review, release prep, and so on

Why separate files? Because the detail belongs close to the task, not smeared across every Copilot interaction. When instructions become hard to scan, moving task-specific context into separate files keeps the baseline clean.


Workflows: Automation Triggers

.github/workflows/ is standard GitHub Actions. I include it here because people often blame Copilot for behavior that is really coming from a workflow trigger.

What a workflow does:

  • Listens for triggers such as push, label, schedule, or manual dispatch
  • Runs remote automation on GitHub
  • Starts a reviewer or other automation path
  • Posts status, comments, or checks back to the PR

Real example:

Azure SDK for JavaScript reviewer workflows — not one big controller that fans everything out. Each reviewer has its own workflow file. Archie and Dexter both trigger on pull_request_target with types: [labeled], and Dash says plainly that its job is to review a PR for performance regressions.

How it connects to agents and skills: The workflow is the remote trigger. In Azure JS, each reviewer workflow posts its own findings back to the PR. In other repos, a workflow might also call a custom agent or load a skill, but that's not the part I can prove from this example.


How They Wire Together: A Real Workflow

Here's the version of the Azure SDK flow I can actually defend with links.

1. A label triggers the run In Azure JS, the review workflows listen for pull_request_target with types: [labeled], not for PR creation. You can see that in archie.md and dexter.md.

2. Each reviewer has its own workflow Archie, Dash, Dexter, and Scribe are separate files under .github/workflows/. They can run independently when their trigger label appears. This is not one workflow spawning four reviewers in parallel.

3. The workflow itself states the review job Dash literally says it reviews the PR for performance regressions and anti-patterns. Archie and Dexter follow the same pattern for architecture and dependency review.

4. Agent files and prompt docs are the reusable layer Separately from the workflow trigger, archie.agent.md defines the architecture reviewer as a reusable custom agent and references ../prompts/architecture-review-guidelines.md. That prompt doc then points back to the repo instructions in architecture-review-guidelines.md.

5. Findings land on the PR from each workflow Archie and Scribe both declare create-pull-request-review-comment and submit-pull-request-review in safe-outputs, with run messages for the PR review lifecycle. So the accurate mental model is: each reviewer workflow posts its own findings. There isn't a separate "collector" workflow in this example.

6. Skills are adjacent, not proven by this example Skills still matter to the overall ecosystem, but Azure's review workflows are not the example I would use to claim that agents are invoking skills behind the scenes. If I want to show skills, I use an actual SKILL.md repo.


Using Copilot with These Files: Chat and CLI

In Copilot Chat (Browser or VS Code)

Scenario: You're in a repo and you open Copilot chat.

You: write a PR description

Copilot reads .github/copilot-instructions.md and picks up things like:

  • PR title format
  • Required sections
  • Review expectations

Result: the draft is much more likely to match the repo's stored rules.

Scenario 2: You want the specialist reviewer.

You: use the archie agent to review my API changes

Depending on the surface, you might select archie in the agent picker, use /agent in Copilot CLI, or reference it naturally in chat. The important part is the same: the agent file gives that review task its own scope and constraints.

In Copilot CLI

Scenario: You're working locally and you want a code review from the terminal.

copilot
/review focus on bugs and security issues in my current changes

GitHub documents /review as the CLI code review path in its Copilot CLI docs and agentic code review guide. This is the replacement for the old gh copilot pr-review mental model I had in my head.

Scenario 2: You want to trigger the remote GitHub Actions reviewer from your terminal.

gh workflow run archie.md -f item_number=42

That command is a remote workflow dispatch through GitHub CLI, not local Copilot behavior. Azure's reviewer workflows expose workflow_dispatch with an item_number input in archie.md, so gh can start the run on GitHub.


Decision Table: Which File for What?

Decision tree for placing work in instructions, agents, skills, prompts, or workflows

This placement tree turns the file choice into a routing problem: match the task shape, then drop it in the right folder.

I need to...Put it in...Why
Set coding standardsCopilot instructionsAlways-on, repo-wide governance
Require testing notes on all PRsCopilot instructionsIt's a rule, not a workflow
Automate PR review on labelsWorkflows + reviewer definitionsLabels are a GitHub Actions trigger, and the reviewer needs its own scope
Store detailed architecture patternsPrompt docsToo bulky for baseline instructions; reviewers can load them when needed
Make code generation or verification repeatableSkillsCopilot can load the packaged workflow when the task matches
Handle a one-time taskChatSome work doesn't need a file at all

Minimal Repo Setup

If I'm starting fresh and I don't want to overengineer this, I keep it boring.

Minimal viable setup:

  • .github/copilot-instructions.md (recommended baseline)
  • No agents yet unless I have a recurring specialist task
  • No skills yet unless I keep repeating the same workflow
  • No prompt docs yet unless the task-specific context is making the instructions file noisy

As the repo grows:

  • Add agents when a task needs its own tool restrictions or review charter
  • Add skills when a workflow deserves packaging
  • Add prompt docs when a reviewer or task needs deeper context than the baseline file should carry

Start minimal, then add layers only when each layer removes confusion.


How to Find Examples in Real Repos

Here are the examples I would study first.

Azure SDK for JavaScript — the clearest end-to-end example I found

Microsoft MCP — a readable baseline instructions file

GitHub gh-aw — a good example of repo-local workflow guidance that looks prompt-like

Vercel Next.js — useful for skill design patterns


The One-Sentence Rule

Woman viewing four landscape paintings on a wall — the same mountain at sunrise, noon, rain, and sunset — representing the config files seen from different angles

If I'm stuck deciding where something belongs, this is still the fastest check I know:

  • If it reads like a rule, put it in instructions.
  • If it reads like a reusable workflow, put it in a skill.
  • If it reads like task-specific context, put it in a prompt doc or prompt file.
  • If it has to run on GitHub events, put it in a workflow.