SRE: What I think happened to Robinhood trading app when it went down

I was waiting for Robinhood to post a postmortem but they aren’t and have never been transparent to post a public postmortem and shoot themselves in the foot by being in a sector to lose customers on a post. And the post by the founders is the most bizarre thing I’ve ever read as an SRE/System Administrator.

There are a few assumptions that I want to get out of the way.

1] It wasn’t a leap year error.

Don’t believe anything you read on the internet. And if you do, try to get the bottom of it before sharing it blindly.

2] They did not get hacked.

There are a few websites that share if SSN’s and bank accounts are hacked on the dark web. I did not see anything or any news so I’m assuming they are not hacked. These kinds of stuff spread fast over tor.

3] They weren’t DDoS’d either. Let me explain.

DDoS or denial of service attacks is targeted. Unless some guy or a group of hackers really knew that stock was going to shoot up pretty high on that very day, they would not have done that. I earlier thought it could have been a DDoS but they host most of their infra on AWS and I’m sure AWS can help out in this case. Also a DDoS to a trading platform would be really complex compared to a website or application. Whenever there is a DDoS attack, the first priority is to make sure the system does not go down. This is only possible in two ways, scrub of illegitimate traffic or scale to an extent that your infra can take the incoming hit. Neither of these was possible in the case of Robinhood, or at least not needed.

AWS has its own DDoS protection as route 53 DNS and shield. Not only that they have numerous load balancers sitting in front of all their servers so most attacks are stopped way before they can hit the actual servers. In the event of a company hosting its own servers, they use on-call or on-demand DDoS protection services like prolexic or cloudflare. What happens is all traffic is routed through these company servers where bad requests are filtered and only good traffic is allowed to hit the original servers. Also if and when there is a need to scale, they start redirecting traffic, load sharing (not balancing, because that is a terrible idea) https://divyendra.com/blog/load-balancing- vs-load-sharing/1534/ .

There are also multiple ways to blackhole and sinkhole traffic in an event of a DDoS, the bad side of this is you don’t really know if good traffic will get inside these. Imagine a Robinhood user placing a put option and his request going through a sinkhole. The number of lawsuits RH will face is exponential and will close the shop the next day.

Github faced a 1.3 terabytes of DDoS attack last year and prolexic came out as their hero that day where a Memcache vulnerability was used to DDoS them. It is much easy to do this than you think. Again nobody noticed because nobody LOOKED. So no, RH did not get DDos’d.

4] So what exactly must have happened?

There is a saying in engineering, “Engineering is never simple, if it seems simple, you are probably doing it wrong”. Also, the famous quote from John Gall saying “A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.”

RH seemed to have faced a domino effect, something they did not see coming. Honestly, even I would not have seen this coming if my company got hit like this overnight.

Reasons: The sheer number of orders placed pre-market on that very day + The sheer number of limit orders placed by users + The number of users that tried to log in on that morning (This kinda became worse because users started sharing that platform is down and more users tried to log in) +Trading systems are complex, I know this because few of my friends work at trading companies and hedge funds, the last they care about is security and the first thing they care about is speed, reliability, and efficiency. The transactions and orders need to be executed in micro/nanoseconds or as fast as possible.

They do this by talking to other brokerage and trading firms. Most firms use GPUs for processing, with a lot of FPGA programming and yada yada. I don’t really know about this in much detail so 0 clue about this. And hence the number of transactions that were going on that very morning was just too much for the servers to handle and they kept going down. Their load balancers wouldn’t have been able to handle it + their NICs would have just borked. + they use and have built a lot of internal tools, the problem with internal tooling is that when shit hits the fan, only the engineers who build it can solve it. And when everything is hitting the fan, the last thing you want to fix is your internal tools. (Keep in mind there are multiple levels of virtualization even on AWS, Aws hardware > AWS hypervisor > Ec2 > docker/Kubernetes or whatever ppl use so I am not really sure how they have it to give more context)

And the first question that is usually thrown is, why don’t you scale?!!!

You see, you can only scale a system that is up, if the system is not really up even for a few minutes, it cannot in theory, “Scale”. Hence you need your scaling logic to be sitting somewhere else from your actual infrastructure. If your entire infrastructure is just trashing by the amount of incoming traffic and zillion transactions that are happening, you can forget about scaling such a system. Fundamentally you cannot scale a system that is not really working. Why not allow a few transactions to go? Why not queue? Why not back off?
That is definitely not how high-frequency trading works folks.

https://www.investopedia.com/terms/h/high-frequency-trading.asp

It’s like you and your neighbor order something from amazon and you have to wait even if you order the same exact thing for the same exact price, at the same exact day at the same exact time. Nobody would like that.

What were the DevOps and SRE team’s doing?

RH core infra was built by two DevOps engineers https://aws.amazon.com/solutions/case-studies/robinhood/. I don’t know much about what they were doing but they were trying their best to handle the situation and probably we shouldn’t be making jokes on them. I’m sure they did the best they could. Why not hire more DevOps engineers? Because that is not what DevOps do. If you hear a person in a product company of 1000 says we have 30 DevOps engineers, the person is completely out of his mind and def does not know what he is doing. You need max 3-4 experienced people who know the in and outs of the system, and if they are good enough, that is all that is needed. Throwing engineers at a problem never solves the problem. Compass and GoDaddy have “teams” of DevOps engineers, we all know how they are performing. Coming back, Robinhood dropped the ball on the worst day in the history of the markets to drop the ball. This should never have happened and it happened. They recovered only “after-hours” which is kinda self explainable, and they must have gotten ready to take the hit only to go down again, which honestly “I saw this coming” when they went down the next day.

They do have a bug bounty program which I think is cool, and they don’t cheap out on stuff, https://robinhood.engineering/bug- bounty-790a2f1a3223https://robinhood.engineering/ So is all this 100% true? No. I tried to gather as much info I can from as many people and make an educated guess.

No hacker group came forward for DDoS/database hack. No one at RH whistleblowing the leap year error.
The message from the founders.
0 talks on the dark web with something bad going on.The reason I think the notice given by the founders is bs because they said “a failure at DNS level”. Which DNS level? No clue. Yes, their infra failed but they seemed to be certainly hiding a lot behind for coming forward and that is probably for a good reason.
They are not in a place to come clean and expect the dust to settle assuming they had an IPO coming soon.

Thank you for coming to my ted talk.