Incident summary — Mainnet 29/12/2019
The full article was originally published by Jakub Cech on Medium. Read the full article here.
This post serves as a summary of the IOTA Mainnet incident that occurred on December 29, 2019.
The incident occurred at 2:50 AM UTC on December 29, and was fully resolved by 2:50 AM UTC on December 30. It resulted in the halting of value transfer processing on the IOTA Mainnet.
The incident was caused by an edge case in transaction structuring. An unusual set of transactions, which may have been constructed as an attack, disrupted ledger state calculation. When the nodes faced issues calculating a consistent ledger state, they reverted to rejecting milestones as a safety mechanism. Individual funds were never in danger during the incident due to measures such as this, and others implemented in the node software.
No individual change in the node software, or any other components of the network, led to this event. It occurred due to the absence of transaction processing logic for an unusual set of transactions. The node software did not gracefully process the transactions, so any additional milestones were not accepted and the Coordinator ceased to issue further milestones.
From 2:50 AM UTC on December 29, 2019, until 2:50 AM UTC on December 30, users were unable to confirm transactions on the IOTA Mainnet network. Operators of the IOTA Reference Implementation (IRI) node experienced an issue whereby their node was unable to process the last milestone issued by the network Coordinator. The beta-phase Hornet node implementation was not affected by this issue.
December 29, 2019
- 02:50 UTC — The Coordinator issues milestone #1293082. The milestone is not accepted by Mainnet nodes. The Coordinator ceases issuing further milestones.
- 02:55 UTC — A pager alert for Mainnet confirmation tied to our infrastructure and a PagerDuty service failed to fire and notify our team; Investigation in progress.
- 07:35 UTC — DevOps team begins investigating the issue of the Coordinator not issuing milestones.
- 8:30 UTC — The IRI team joins the investigation.
- 8:50 UTC — The DevOps and IRI teams investigate the state of the nodes and the machine logs to assess the situation.
- 9:00 UTC — The SecOps team joins the investigation.
- 9:15 UTC — Two possible causes of the incident are identified; A tip selection algorithm error, or a ledger state calculation error.
- 9:30 UTC — We begin debugging and building tools for extracting further log data to assist in identifying the root cause.
- 10:30 UTC — The Hornet team joins the investigation. At this point, we believe that comparing how Hornet and IRI nodes process transactions in the last issued milestones will allow us to identify the root cause of the issue.
- 16:30 UTC — The Hornet and IRI team identify the transaction bundle causing the incident.
- 17:20 UTC — We identify the root cause of the incident.
- 18:08 UTC — We conclude on a remediation that involves patching the IOTA Reference Implementation (IRI) to fix the issue.
- 18:55 UTC — A first implementation of the fix is tested by the IRI team. The first iteration does not allow the nodes to fully recover using their milestone repair mechanism. Further changes and testing are needed.
- 23:52 UTC — After initial testing, we start reviewing the branch containing the fix.
December 30, 2019
- 00:39 UTC — We create pull request 1699 containing a fix for the IRI ledger service implementation.
- 01:22 UTC — The pull request is approved.
- 01:25 UTC — We start the IRI release process. Including DockerHub release.
- 01:28 UTC — We start upgrading internal IRI nodes.
- 01:58 UTC — A new version of IRI that fixes the issue is released.
- 02:32 UTC — The Coordinator service has resumed.
- 02:33 UTC — The Coordinator successfully issues milestone #1293082.
- 02:43 UTC — We begin spamming internal nodes to test stability.
- 02:45 UTC — Grafana dashboards report a healthy confirmation rate and stable milestone issuance rate.
- 02:45 UTC — At this point, a large portion of external nodes have already upgraded using an automated upgrade service and have resumed operation.
- 02:51 UTC — We stop spamming internal nodes.
- 02:55 UTC — We announce the situation resolved. Mainnet has resumed operation.
- 02:55 UTC — We announce a new version of IRI on Twitter and Discord.
The IOTA Reference Implementation (IRI) did not handle an edge case where transactions are shared between multiple distinct bundles. Once IRI marked a transaction as “already accounted for” in one bundle, it was ignored in the next bundle. This led to a corrupt ledger state from which the node was unable to recover.
- A need to ensure the situation with our PagerDuty alerts does not reoccur.
- It is highly beneficial to involve teams from different node implementations as soon as possible. This speeds up debugging issues across the implementations.
- Our incident management process proved to be successful in identifying and fixing the root cause of the situation.
Alternative variants for this edge case have been considered and accommodated for.
There are no previous known incidents with the same root cause like this one. An incident with similar symptoms occurred on Devnet on August 26. But the two incidents did not share a root cause.
This incident has helped us to refine our response protocols. We would like to thank our community for their support throughout the difficult day. A huge thank you also goes to our engineering team, and the Hornet team, who worked late into the night until the issue was fully resolved on a Sunday during the Holidays and on their time off. Thanks to their effort, we were able to resolve the incident in a reasonable timeframe.