I figured every SRE / Systems / DevOps / Infrastructure Engineer brings down production system some time or the other. So did I, that too on a monday morning at 11:00 am.
I think this might actually be the biggest highlight of my career till now. I’ve solved countless problems in scalability, infrastructure design and security but this incident makes me happy than any other.
I am not going to write this as a postmortem, because that’s what everyone does. I’ll rather write it as a story that might be helpful to someone who surfs randomly and gets little context even he/she doesn’t have a detailed technical background.
I picked up a ticket on my board to move secrets from the codebase to vault. Vault is a tool/software that keeps secrets as secrets. Since there are no secrets in the codebase, a platform is a bit secure.
Everything went pretty okay, figured out how vault was interacting with the containers deployed, tested changes on dev and stg on a friday and rolled out my changes to prd. The hardest part of this ordeal was learning syntax in nodejs since the service i was editing was the monolith frontend of our platform.
Everything seemed to be going okay until a person asked in a platform problem slack room if the we were aware of the outage and the platform was down. I knew this was me.
You see, you really don’t want to be the guy who is 6 months on the job and bring down production. The issue was i. had forgotten to add the secrets to the prd environment of the vault. I added and redeployed the containers and everything started working.
A lead infra engineer once told me, the biggest learning curve engineers ever have are during outages. Well said sir, well said.
An outage will teach you things about your system that 6 months on the job wont teach you.