Platform Issues
Incident Report for Crazy Ant Labs
Postmortem

The issues on AWS affected our product as well as Heroku and even Intercom, our main communication channel with our customers.

In Cron To Go, new issues arised after the AWS issues were fixed since Cron To Go was executing jobs but still received errors from Heroku APIs and retried execution. This caused the job queue to bloat up more than the workers' capacity and caused jobs to run out of schedule (due to retries). We purged the queue and job execution is back to normal.

Note that if jobs were meant to run during the incidents, they may have no executed at all so you may want to run them manually.

Posted Dec 08, 2021 - 08:37 UTC

Resolved
This incident has been resolved.
Posted Dec 08, 2021 - 02:06 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 07, 2021 - 23:33 UTC
Update
We are continuing to monitor for any further issues.
Posted Dec 07, 2021 - 23:21 UTC
Update
The AWS team is still working on getting their services fully recovered. We are using API Gateway, DynamoDB and EventBridge which are still experiencing impact. We will keep monitoring for these services recovery rates and keep updating accordingly.
Posted Dec 07, 2021 - 23:12 UTC
Monitoring
AWS have mitigated the underlying issue that caused some network devices in the US-EAST-1 Region to be impaired. All services are now independently working through service-by-service recovery. We will keep monitoring the recovery rates and keep you posted.
Posted Dec 07, 2021 - 22:54 UTC
Update
We are continuing to work on a fix for this issue.
Posted Dec 07, 2021 - 22:38 UTC
Update
AWS have identified the root cause of this issue to be an impairment of several network devices. They have executed a mitigation which is showing significant recovery in the US-EAST-1 Region. We still do not have an ETA for full recovery at this time.
Posted Dec 07, 2021 - 22:28 UTC
Identified
Our engineers have detected that Heroku and our upstream provider (AWS) are experiencing elevated error rates in the US region. This may impact opening our add-on dashboards, communicating with the API, Webhooks, authentication and more. We keep to work closely with them for a full path to mitigation.
Posted Dec 07, 2021 - 20:45 UTC
Investigating
We are investigating availability issues for a significant portion of our platform due to issue with our upstream provider (AWS) in the us-east-1 region.
Posted Dec 07, 2021 - 17:10 UTC
This incident affected: SFTP To Go (Web Interface, API Requests, Webhooks, Heroku), Cron To Go (Web Interface, API Requests, Webhooks, Heroku), Activity To Go (Web Interface, Notifications, API Requests, Webhooks, Amazon S3, Heroku), and Mailer To Go (Web Interface, API Requests, Heroku).