Outgoing Atlassian CTO Sri Viswanath has said that the firm will implement a “soft delete” policy across all systems as one of numerous measures to avoid a repeat of the devastating outage that stopped several cloud services and took over two weeks to resolve.
According to Viswanath, the disruption was caused by a recent maintenance script that resulted in the rapid deletion of 883 sites, representing 775 customers. Customers were unable to file support tickets as usual on the erased sites, and Atlassian was unable to reach impacted customers promptly, according to Viswanath.
Atlassian, on the other hand, claimed it had taken a number of quick steps to avoid similar scenarios in the future after evaluating the occurrence. This includes blocking the deletion of non-soft-delete client data and metadata. Instead, all new processes that involve deletion will be evaluated first on Atlassian’s own sites to validate the technique, and then clients will be progressively moved through the same process.
“Deletion of an entire site should be disallowed,” Viswanath wrote in a blog post, “and soft-delete should require multi-level controls to prevent error.”
“We will implement a soft delete policy to prevent external scripts or systems from deleting customer data in a production environment. Our Soft delete policy will allow for sufficient data retention so that data recovery can be performed quickly and safely, and the data will be deleted from the production environment only after a retention period has expired.”
Atlassian added that any activity that soft-delete data must also have a validated rollback procedure.
Atlassian also stated that it will speed up its disaster recovery methodology so that restoration may be automated for multi-site, multi-product deletion events for a broader group of customers, and that the process will be tested and updated on a regular basis to reduce recovery time.
According to Viswanath, Atlassian will also rewrite its large-scale incident management approach and execute a simulated exercise, as well as strengthen essential contact backup and retrofit support tooling so customers without a valid site URL or Atlassian ID may still contact technical support directly.
Atlassian said it will invest in a unified, account-based escalation system and workflows that allow multiple objects such as tickets and tasks to be stored beneath a single customer account object, as well as revisiting the company’s incident communication playbook and executing an escalation management function that is globally consistent across all geographies for customers.
On April 5, Atlassian announced the outage on its Status Page. It took until April 18 for the corporation to restore service to all affected consumers.