Prompts as playbooks; how infrastructure teams codify operational knowledge

Prompts as playbooks; how infrastructure teams codify operational knowledge
Cline - Prompts as playbooks; how infrastructure teams codify operational knowledge

Every infrastructure engineer I know runs prompts in their head. They just don't call them that.

You're deploying a new service to production. Without thinking about it, you follow a known sequence to ensure a service will run without impacting existing production workloads. You've done this dozens of times. You don't consult documentation. The checklist lives in your muscle memory, refined by every deployment that went sideways at 2am.

Now think about what happens when you leave. Or when a new engineer joins. Or when the on-call rotation hits someone who's never touched that particular stack. The knowledge doesn't transfer. It sits in chat threads, in wiki pages that were accurate six months ago, and in the heads of others who are currently asleep.

This is the runbook problem. Most infrastructure teams have accepted it as a cost of doing business. I don't think they need to.

The runbook is a document, the playbook is executable

Traditional runbooks that outline procedures for common operations. The problem isn't always that they're wrong or out of date (though they frequently are), it's that they can't enforce anything. A runbook says "verify the VPC peering connection is active",  but it can't check whether you actually did it. It can't stop you from skipping ahead when you're tired or over confident.

The insight that changed how I work is that the gap between a runbook and a prompt is minimal. A runbook says "check the pod logs for OOMKilled events". A prompt says the same thing, except now Cline actually checks the logs, parses the output, and tells you what it found. The step becomes executable instead of theoretical.

What makes a prompt a playbook

Not every prompt is a playbook. Most prompts are one-off requests: "fix this error" or "why is this pod crashlooping". Those are fine for ad-hoc work, but they're ephemeral. They disappear when the conversation ends.

A playbook is a prompt that's been hardened. Specifically, it's been tested against real infrastructure, not just theorized about. It's version-controlled alongside the code it operates on. It's composable with other playbooks for multi-step operations. And critically, it's executable by someone who didn't write it.

The mechanism for this in Cline is .clinerules: markdown files that live in your project's .clinerules/ directory. When toggled on, their content gets appended directly to Cline's system prompt. They're not passive documentation, they actively shape how Cline reasons about your infrastructure. Since they're just files in your repo, they go through the same PR review, version control, and CI processes as everything else.

The progression from prompt to playbook follows a similar pattern to writing traditional playbooks or automations. You solve a problem with a one-off prompt then you solve the same problem again a week later and realize you're repeating yourself. You extract the pattern into a .clinerules file. Your teammate uses it and finds an edge case you missed. They update the rule. Now the team's collective operational knowledge lives in version-controlled markdown instead of someone's head.

Starting simple with Terraform module review

The easiest place to start is codifying the review criteria you already apply mentally. Here's a .clinerules file I use for Terraform work:

# terraform-review.md

When reviewing or generating Terraform code, enforce these standards:

Variable naming: use snake_case. Resource names must include the environment
prefix (e.g., prod_, staging_). No hardcoded values for regions, zones, or
machine types; these must be variables with sensible defaults.

Provider constraints: every module must pin provider versions with ~> operator.
No >= without an upper bound. Check that required_providers blocks exist and
are current.

State safety: flag any resource that uses lifecycle { prevent_destroy = false }
on stateful resources (databases, persistent disks, encryption keys). These
should default to prevent_destroy = true unless there's an explicit comment
explaining why.

Security patterns: no inline IAM policies. Use google_project_iam_member with
custom roles over primitive roles. Flag any use of roles/editor or roles/owner.
Workload Identity over JSON key files, always. If you see a
google_service_account_key resource, flag it and suggest Workload Identity
instead.

Cost awareness: flag n1-standard machine types (suggest n2 or e2 equivalents).
Flag any persistent disk over 500GB without a comment justifying the size.
Check for committed use discount eligibility on long-running instances.

This isn't sophisticated, it's a checklist similar to the one I used to run in my head. The difference is that every engineer on the team now applies the same criteria. Cline catches things humans skip when they're in a hurry, and the rules evolve as we learn. That knowledge becomes permanent and codified so it can be applied in future changes without being overlooked. 

Incident triage as a playbook

Incident response is where tribal knowledge costs the most. The engineer who knows everything about a production system can’t help when they’re on vacation and that system goes down. 

Here's a sample incident triage playbook that encodes that kind of knowledge:

# incident-triage.md

When triaging a production incident, follow this sequence. Do not skip steps
or jump to remediation before completing diagnosis.

## Step 1, scope the blast radius: identify which services are affected. Check
the ingress controller logs, not just the application logs. Determine whether
the issue is isolated to a single pod, a node, a namespace, or cluster-wide.
Report the blast radius before proceeding.

## Step 2, establish a timeline: find the first anomalous signal. Check
monitoring dashboards for the affected service, then work backward. Common
first signals in our stack: connection pool exhaustion shows up as elevated
/healthz latency 3-5 minutes before user-facing errors. OOMKilled events
precede pod restarts. Certificate expiration warnings appear in ingress
controller logs 30 days before failure but are often ignored.

## Step 3, identify the change: check the deployment history for the affected
namespace. Run kubectl rollout history. Cross-reference with recent PRs merged
to the relevant environment branch. Check if any Terraform applies ran in the
last 24 hours. Infrastructure changes and application deploys are equally
likely culprits.

## Step 4, propose remediation: based on the diagnosis, propose a specific
remediation. For each proposed action, state what it will change, what the
rollback path is, and what verification confirms success. Do not propose
"restart the pod" without first understanding why it crashed.

## Step 5, dry run before applying: if the remediation involves infrastructure changes (Terraform, Helm, kubectl apply), show the plan/diff output first.
Never apply directly. If the remediation involves a rollback, verify the
target revision is known-good.

This is the knowledge that a senior engineer carries implicitly. Written down as a runbook, it gets skimmed. As a .clinerules playbook, it becomes the actual flow Cline follows when you say "help me triage this outage." Cline checks the logs in the right order, establishes the timeline before jumping to conclusions, and proposes remediation with rollback paths because the playbook requires it.

Workflows and hooks for multi-step operations

Simple .clinerules files work well for always-on guidance, but some infrastructure operations need more structure: a defined sequence of steps, explicit approval gates, and precise control over which commands run. This is where Cline's Workflows and Hooks come in.

A workflow is a markdown file that lives in .clinerules/workflows/ and gets invoked on demand by typing /workflow-name.md in the chat. Unlike rules (which are always active), workflows run only when you call them. They combine natural language instructions with XML tool syntax for precise control, so you can mix high-level reasoning steps with exact commands that must run verbatim. Here's what an incident response workflow looks like:

# incident-response.md

Guided incident response workflow. Do not skip steps or jump to remediation
before completing diagnosis.

## Step 1: Detect and scope the incident
Check the ingress controller logs and application logs to identify affected
services. Determine whether the issue is isolated to a single pod, node,
namespace, or cluster-wide.
<execute_command>
<command>kubectl get events --sort-by=.lastTimestamp -A | head -50</command>
</execute_command>
Report the blast radius before proceeding.

## Step 2: Snapshot context
Capture the current state of the affected resources so we have a baseline
for comparison after remediation.
<execute_command>
<command>kubectl get pods -A -o wide | grep -v Running</command>
</execute_command>

## Step 3: Establish timeline and propose action
Find the first anomalous signal. Cross-reference with deployment history
and recent Terraform applies. Based on the diagnosis, propose a specific
remediation. For each proposed action, state what it will change, what the
rollback path is, and what verification confirms success.

## Step 4: Approval gate
<ask_followup_question>
<question>Here is the proposed remediation. Proceed with dry run?</question>
<options>["Yes, run dry run", "Modify the plan", "Abort"]</options>
</ask_followup_question>

## Step 5: Dry run
Execute the remediation in dry-run mode. Show the plan/diff output.
Never apply directly without review.

## Step 6: Apply and verify
After approval, apply the change and verify recovery. Confirm the affected
services are healthy before marking the incident resolved.

Type /incident-response.md and Cline walks through each step in sequence, pausing for your approval at each stage. The <ask_followup_question> tool creates an explicit approval gate; Cline won't proceed until you choose. The <execute_command> blocks ensure specific diagnostic commands run exactly as written rather than being interpreted loosely.

For programmatic enforcement that goes beyond what a workflow can express, Cline has Hooks: executable scripts that run automatically at key lifecycle moments (before a tool executes, when a task starts, after a command completes). A PreToolUse hook could block kubectl apply unless a corresponding kubectl diff ran first, or prevent Terraform from applying outside a change window. Hooks return JSON to either allow or cancel the operation, so guardrails are enforced by code rather than instructions.

Workflows and hooks complement each other. They cover the same ground as a structured pipeline: detection, diagnosis, approval, execution, and verification, all version-controlled alongside your infrastructure code.

The decisions are the product

There's a deeper point here that goes beyond tooling. When you build a .clinerules playbook for code review, you're encoding decisions that took years to accumulate. Which CIDR ranges don't conflict across environments? What machine types balance cost and performance for your workload? Which IAM patterns are secure enough for production? What does your team's deployment flow actually look like when it works?

These decisions are the hard part, the code that implements them is comparatively straightforward. A junior engineer, or Cline, can apply a well-specified code change correctly, but knowing which module to use, what the variables should be, and what the security implications are? That takes experience. 

Playbooks are how you package that experience so it compounds instead of evaporating. Quality playbooks lead to faster onboarding, fewer incidents, and more confidence in the automation. It's the same logic behind infrastructure-as-code, applied one layer up to the operational knowledge that tells you how and when to use the infrastructure code.

Getting started

If you're an infrastructure engineer thinking about this, here's where I'd begin.

Pick one recurring task you repeat at some regular interval. Deployments, cert rotations, log analysis, capacity reviews, anything with a repeatable structure. Write the steps you follow as a .clinerules markdown file. Don't overthink it, write it like you're explaining the task to a competent engineer who's never seen your stack. Drop it in your project's .clinerules/ directory, toggle it on in Cline, and run through the task. You'll immediately see where the playbook is too vague (Cline will ask clarifying questions) and where it's too rigid (Cline will do exactly what you said even when the situation calls for something different). Refine it. Commit the updated version.

Then share it. The Cline Community Prompts repository exists specifically for this. If you've built playbooks for Terraform, Kubernetes, CI/CD pipelines, observability, incident response, or any other infrastructure domain, the community benefits from seeing how you've structured them.

Infrastructure knowledge shouldn't be tribal. It should be version-controlled, peer-reviewed, and executable. That's all a playbook is.

Share your infrastructure playbooks in the cline/prompts repo, or discuss what's working for your team on Discord and Reddit. If you want to dig deeper into .clinerules and workflows, start with the Cline documentation.