🛠️ Infrastructure, SRE & Incident Operations

Production reliability workflows for Kubernetes, incidents, observability, backups, deploy safety, infrastructure drift, alerts, and runbook-driven debugging.

Live page: https://agentskillexchange.com/industry-skills/#infrastructure-sre-incident-operations
Homepage access: Curated Collections on https://agentskillexchange.com/

Who this is for

SREs, platform engineers, DevOps leads, and infrastructure operators responsible for production reliability.
Teams that need Kubernetes, observability, backup, deployment, and incident response workflows joined into one operating shelf.

Jobs covered

Investigate production incidents across Kubernetes, logs, cloud signals, and observability backends.
Review and test infrastructure changes before deploys, Helm releases, Terraform diffs, or cluster manifests cause outages.
Run backup, restore, retention, and alert routing workflows with operator-visible evidence.
Keep cluster operations, dashboards, and runbooks close enough for on-call use.

Workflow Stacks

Kubernetes incident triage: Open cluster context → tail affected pod logs → inspect crash signals → correlate observability → capture runbook notes
Safe infrastructure change: Review Terraform or Helm diff → lint automation → run cluster tests → sync GitOps target → watch rollout health
Backup and recovery readiness: Schedule backups → verify retention → test restore path → route failure alerts → record evidence

Recommended Picks

Skill	What it does here	Persona	Install	Stars
Deploy Kubernetes-native agents with kagent	Adds a Kubernetes-native agent control plane for cluster automation experiments that still need operator boundaries.	Platform engineer / SRE lead	High	2.9k
Control Kubernetes infrastructure through natural-language MCP workflows	Lets operators inspect and act on Kubernetes resources through MCP instead of brittle one-off kubectl prompts.	SRE / platform operator	High	898
Run Kubernetes cluster operations through MCP	Covers cluster-level MCP operations for routine checks, triage, and controlled remediation.	Kubernetes operator	High	1.4k
Operate Kubernetes and OpenShift clusters through MCP	Extends the cluster-ops workflow to Kubernetes and OpenShift environments where enterprise platform teams need consistent runbooks.	Platform engineer / OpenShift admin	High	1.6k
Investigate production incidents across Kubernetes and cloud signals with HolmesGPT	Pulls Kubernetes and cloud signals into an incident investigation loop instead of leaving responders to correlate dashboards manually.	Incident commander / SRE	Medium	2.3k
Tail multi-pod Kubernetes logs by label during incidents with Stern	Gives responders a fast multi-pod log tailing step during label-scoped outages.	On-call engineer	Low	4.6k
Kubernetes Pod Crash Diagnostics	Turns CrashLoopBackOff and pod failure evidence into a focused diagnosis packet for responders.	Kubernetes support engineer	Medium	121.7k
K9s Kubernetes Terminal Dashboard	Provides a fast terminal cockpit for cluster inspection during normal operations and incidents.	SRE / cluster operator	Low	33.2k
Polaris Kubernetes Best Practices Validator	Checks Kubernetes workload readiness and policy hygiene before manifest mistakes reach production.	Platform reliability reviewer	Low	3.4k
Lint Ansible playbooks and roles before automation breaks in prod with ansible-lint	Prevents configuration automation failures by linting playbooks before they run against infrastructure.	DevOps engineer	Low	3.9k
ArgoCD GitOps Sync Automator	Automates GitOps sync checks and actions for teams running ArgoCD-backed production deploys.	Release engineer / platform operator	Medium	22.5k
Helm Chart Diff & Upgrade Manager	Makes Helm upgrades reviewable by surfacing chart diffs before cluster changes land.	Kubernetes release engineer	Medium	29.7k
Plan and apply many Helm releases from one declarative state before cluster changes drift out of sync with Helmfile	Coordinates many Helm releases from a single desired state so platform drift is visible before apply.	Platform engineer	High	5.1k
Run declarative Kubernetes test suites against clusters before operator or manifest changes merge with KUTTL	Runs declarative Kubernetes tests before operator or manifest changes become incident fuel.	Platform QA / SRE	Medium	804
Terraform Plan Diff Analyzer	Turns Terraform plan output into an infrastructure-review artifact before risky changes merge.	Infrastructure reviewer	Medium	48.1k
Terraform Drift Detector	Detects drift between declared Terraform state and real infrastructure before incidents or surprise costs appear.	Cloud platform engineer	Medium	48.1k
Orchestrate database backup, restore, retention, and failure-notification runbooks through Databasement	Covers backup, restore, retention, and failure notification as an auditable runbook rather than an ad hoc database chore.	Database operator / SRE	Medium	315
Schedule and retain cross-database backups from one self-hosted control plane with Databasus	Centralizes scheduled backups and retention across databases for smaller platform teams.	Infrastructure operator	Medium	6.6k
Restic Fast Encrypted Backup Program	Adds encrypted file and repository backup workflows for production recovery planning.	SRE / backup owner	Low	32.9k
Netdata Real-Time Infrastructure Monitoring and Alerting	Gives infrastructure teams immediate host and service telemetry for alerting and triage.	SRE / infra operator	Medium	78.4k
SigNoz Open-Source Observability Platform	Provides logs, metrics, and traces for production reliability investigations without a proprietary-only observability stack.	Observability engineer	High	26.5k
OpenObserve Cloud-Native Observability Platform for Logs Metrics and Traces	Adds a cloud-native observability backend for logs, metrics, traces, and incident search.	Observability platform owner	High	18.5k
Convert SMTP-only alerts into routed notification deliveries with Mailrise	Routes legacy SMTP-only alerts into modern notification channels so incidents reach the right responders.	On-call tooling owner	Low	1.5k
Datadog Integration Connector	Connects Datadog signals into agent workflows for incident context, monitor review, and dashboard triage.	SRE / observability engineer	Medium	791
Trace and debug agent runs with AgentOps	Adds run-level tracing for agent systems that SRE teams need to debug after failures or noisy automation.	SRE / agent platform operator	Medium	5.6k
Trace LLM and agent workflows with OpenLLMetry	Connects LLM and agent traces to OpenTelemetry-style incident investigation workflows.	Observability engineer / platform SRE	Medium	7.2k
Evaluate and monitor LLM workflows with Agenta	Adds monitoring and evaluation coverage for production LLM workflows that need regression signals during incidents.	AI platform SRE / reliability owner	Medium	4.2k

Editorial Notes

This collection crosses Runbooks, Monitoring, CI/CD, Code Quality, Developer Tools, and Integrations because incident work rarely stays inside one category.
Listed Kubernetes agent picks are included only where they represent a distinct operator workflow not covered by stronger security-reviewed alternatives.
Avoid autonomous remediation by default; the collection is built around reviewable signals, controlled runbooks, and operator handoff.
Keep this persona-based for SRE and platform operations; do not turn it into a generic Monitoring or CI/CD category mirror.

Adjacent Collections

← Back to industry collections

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🛠️ Infrastructure, SRE & Incident Operations

Who this is for

Jobs covered

Workflow Stacks

Recommended Picks

Editorial Notes

Adjacent Collections

FilesExpand file tree

infrastructure-sre-incident-operations.md

Latest commit

History

infrastructure-sre-incident-operations.md

File metadata and controls

🛠️ Infrastructure, SRE & Incident Operations

Who this is for

Jobs covered

Workflow Stacks

Recommended Picks

Editorial Notes

Adjacent Collections