Pros
This role provided hands-on involvement in coordinating operational readiness for production systems across multiple regions. I regularly worked with SRE, security, and platform teams to prepare incident response plans, validate monitoring coverage, and review on-call escalations. The position allowed me to identify recurring failure patterns and propose changes to deployment checklists and change management workflows. There was strong exposure to real-world system behavior under load and during outages, which helped build a practical understanding of infrastructure reliability.
Cons
A major challenge was the reactive nature of the work. Many issues were escalated only after customer impact had already occurred, leaving limited time for proper analysis or prevention. Documentation was often outdated, and ownership of legacy systems was unclear, which slowed down resolution. I was frequently pulled into parallel incidents, making it difficult to follow through on post-incident action items. This reduced the long-term effectiveness of improvements.