Perform root cause analysis for production errors.
Design procedures for system troubleshooting and maintenance.
Requirements:
Experience with container technologies (such as Docker, Kubernetes, Swarm).
Design and implement the monitoring and alerting strategies to support SLAs.
Build a continuous deployment environment.
Implements monitoring/alarming tools, develops/reviews KPIs, identifies issues, errors, inconsistencies, anomalies, to ensure system health and works with lead engineers to plan and scale services as necessary.