Skip to content

Latest commit

 

History

History
73 lines (57 loc) · 9.87 KB

File metadata and controls

73 lines (57 loc) · 9.87 KB

🛠️ Infrastructure, SRE & Incident Operations

Production reliability workflows for Kubernetes, incidents, observability, backups, deploy safety, infrastructure drift, alerts, and runbook-driven debugging.

Who this is for

  • SREs, platform engineers, DevOps leads, and infrastructure operators responsible for production reliability.
  • Teams that need Kubernetes, observability, backup, deployment, and incident response workflows joined into one operating shelf.

Jobs covered

  • Investigate production incidents across Kubernetes, logs, cloud signals, and observability backends.
  • Review and test infrastructure changes before deploys, Helm releases, Terraform diffs, or cluster manifests cause outages.
  • Run backup, restore, retention, and alert routing workflows with operator-visible evidence.
  • Keep cluster operations, dashboards, and runbooks close enough for on-call use.

Workflow Stacks

  • Kubernetes incident triage: Open cluster context → tail affected pod logs → inspect crash signals → correlate observability → capture runbook notes
  • Safe infrastructure change: Review Terraform or Helm diff → lint automation → run cluster tests → sync GitOps target → watch rollout health
  • Backup and recovery readiness: Schedule backups → verify retention → test restore path → route failure alerts → record evidence

Recommended Picks

Skill What it does here Persona Install Stars
Deploy Kubernetes-native agents with kagent Adds a Kubernetes-native agent control plane for cluster automation experiments that still need operator boundaries. Platform engineer / SRE lead High 2.9k
Control Kubernetes infrastructure through natural-language MCP workflows Lets operators inspect and act on Kubernetes resources through MCP instead of brittle one-off kubectl prompts. SRE / platform operator High 898
Run Kubernetes cluster operations through MCP Covers cluster-level MCP operations for routine checks, triage, and controlled remediation. Kubernetes operator High 1.4k
Operate Kubernetes and OpenShift clusters through MCP Extends the cluster-ops workflow to Kubernetes and OpenShift environments where enterprise platform teams need consistent runbooks. Platform engineer / OpenShift admin High 1.6k
Investigate production incidents across Kubernetes and cloud signals with HolmesGPT Pulls Kubernetes and cloud signals into an incident investigation loop instead of leaving responders to correlate dashboards manually. Incident commander / SRE Medium 2.3k
Tail multi-pod Kubernetes logs by label during incidents with Stern Gives responders a fast multi-pod log tailing step during label-scoped outages. On-call engineer Low 4.6k
Kubernetes Pod Crash Diagnostics Turns CrashLoopBackOff and pod failure evidence into a focused diagnosis packet for responders. Kubernetes support engineer Medium 121.7k
K9s Kubernetes Terminal Dashboard Provides a fast terminal cockpit for cluster inspection during normal operations and incidents. SRE / cluster operator Low 33.2k
Polaris Kubernetes Best Practices Validator Checks Kubernetes workload readiness and policy hygiene before manifest mistakes reach production. Platform reliability reviewer Low 3.4k
Lint Ansible playbooks and roles before automation breaks in prod with ansible-lint Prevents configuration automation failures by linting playbooks before they run against infrastructure. DevOps engineer Low 3.9k
ArgoCD GitOps Sync Automator Automates GitOps sync checks and actions for teams running ArgoCD-backed production deploys. Release engineer / platform operator Medium 22.5k
Helm Chart Diff & Upgrade Manager Makes Helm upgrades reviewable by surfacing chart diffs before cluster changes land. Kubernetes release engineer Medium 29.7k
Plan and apply many Helm releases from one declarative state before cluster changes drift out of sync with Helmfile Coordinates many Helm releases from a single desired state so platform drift is visible before apply. Platform engineer High 5.1k
Run declarative Kubernetes test suites against clusters before operator or manifest changes merge with KUTTL Runs declarative Kubernetes tests before operator or manifest changes become incident fuel. Platform QA / SRE Medium 804
Terraform Plan Diff Analyzer Turns Terraform plan output into an infrastructure-review artifact before risky changes merge. Infrastructure reviewer Medium 48.1k
Terraform Drift Detector Detects drift between declared Terraform state and real infrastructure before incidents or surprise costs appear. Cloud platform engineer Medium 48.1k
Orchestrate database backup, restore, retention, and failure-notification runbooks through Databasement Covers backup, restore, retention, and failure notification as an auditable runbook rather than an ad hoc database chore. Database operator / SRE Medium 315
Schedule and retain cross-database backups from one self-hosted control plane with Databasus Centralizes scheduled backups and retention across databases for smaller platform teams. Infrastructure operator Medium 6.6k
Restic Fast Encrypted Backup Program Adds encrypted file and repository backup workflows for production recovery planning. SRE / backup owner Low 32.9k
Netdata Real-Time Infrastructure Monitoring and Alerting Gives infrastructure teams immediate host and service telemetry for alerting and triage. SRE / infra operator Medium 78.4k
SigNoz Open-Source Observability Platform Provides logs, metrics, and traces for production reliability investigations without a proprietary-only observability stack. Observability engineer High 26.5k
OpenObserve Cloud-Native Observability Platform for Logs Metrics and Traces Adds a cloud-native observability backend for logs, metrics, traces, and incident search. Observability platform owner High 18.5k
Convert SMTP-only alerts into routed notification deliveries with Mailrise Routes legacy SMTP-only alerts into modern notification channels so incidents reach the right responders. On-call tooling owner Low 1.5k
Datadog Integration Connector Connects Datadog signals into agent workflows for incident context, monitor review, and dashboard triage. SRE / observability engineer Medium 791
Trace and debug agent runs with AgentOps Adds run-level tracing for agent systems that SRE teams need to debug after failures or noisy automation. SRE / agent platform operator Medium 5.6k
Trace LLM and agent workflows with OpenLLMetry Connects LLM and agent traces to OpenTelemetry-style incident investigation workflows. Observability engineer / platform SRE Medium 7.2k
Evaluate and monitor LLM workflows with Agenta Adds monitoring and evaluation coverage for production LLM workflows that need regression signals during incidents. AI platform SRE / reliability owner Medium 4.2k

Editorial Notes

  • This collection crosses Runbooks, Monitoring, CI/CD, Code Quality, Developer Tools, and Integrations because incident work rarely stays inside one category.
  • Listed Kubernetes agent picks are included only where they represent a distinct operator workflow not covered by stronger security-reviewed alternatives.
  • Avoid autonomous remediation by default; the collection is built around reviewable signals, controlled runbooks, and operator handoff.
  • Keep this persona-based for SRE and platform operations; do not turn it into a generic Monitoring or CI/CD category mirror.

Adjacent Collections


← Back to industry collections