Increased response times and error rates
Incident Report for Greenhouse
Postmortem

WHAT HAPPENED?

A Greenhouse Recruiting update was released on September 9, 2020 at 16:45 UTC that led to 19 minutes of increased error rates for Greenhouse Recruiting customers between 17:02-17:15 UTC and 18:00-18:06 UTC. The incident was fully resolved on September 9, 2020 at 23:51 UTC.

TIMELINE

September 9, 2020

17:02 UTC: Increased errors started to occur for Greenhouse Recruiting

17:15 UTC: Greenhouse Recruiting performance was restored

17:38 UTC: Greenhouse Predicts was temporarily disabled to improve stability

18:00 UTC: Increased errors began again

18:06 UTC: Greenhouse Recruiting performance was restored

18:48 UTC: Interview Stats was temporarily disabled to improve stability

23:01 UTC: Interview Stats was restored

23:51 UTC: Incident was fully resolved

September 11, 2020

16:12 UTC: Greenhouse Predicts was restored

WHAT WAS THE EFFECT?

Some requests to the Greenhouse Recruiting application failed or took a long time to complete. A portion of customers experienced increased error rates for up to 19 non-contiguous minutes, between 17:02-17:15 UTC and 18:00-18:06 UTC.

WHO WAS AFFECTED?

Greenhouse Recruiting customers.

WHAT WAS THE CAUSE?

Beginning at 16:45 UTC, the release of simplified interviewing permissions caused an increased database load, which in turn, created a slowdown in requests that we did not have the capacity to handle. This was exacerbated by timeouts connecting to our internal Greenhouse Predicts service. Due to the high load, servers were not able to respond to health checks, and were automatically restarted, lowering overall capacity.

Beta tests were conducted for this roll-out, but did not cause performance issues.

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

  • We are adjusting the way we check on the health of our servers to improve site performance when we hit the high end of our capacity. We will be increasing the duration of timeouts and maximum allowed repeated failures to avoid restarting servers that are busy fulfilling requests.
  • We will be auditing the duration of timeouts for requests to internal and external services, such as Greenhouse Predicts, to keep them from impacting Greenhouse Recruiting performance.
  • We improved the performance of database queries related to simplified interviewing permissions as part of resolving this incident. We will be following up to improve the performance of these queries further.

We apologize for the inconvenience this incident has caused. We take the reliability of our application seriously and are actively working to prevent similar incidents like this one from occurring in the future. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new.

Posted Sep 18, 2020 - 18:02 UTC

Resolved
This incident has been resolved.
Posted Sep 09, 2020 - 23:51 UTC
Update
We've released another change to improve overall site performance.
The counts of "Last 7" and "Next 7" interviews in the scheduling flow are back to normal.
Posted Sep 09, 2020 - 23:14 UTC
Monitoring
We're releasing a fix to improve overall site performance.
As part of this fix, the counts of “Last 7” and “Next 7" interviews in the the scheduling flow will temporarily display 0 for all users.
Posted Sep 09, 2020 - 21:02 UTC
Identified
We've identified an issue causing degraded performance and are releasing a fix to improve performance.
Posted Sep 09, 2020 - 19:00 UTC
Monitoring
We've identified an issue with Greenhouse Predicts that caused a spike in errors and response times for all requests. We've temporarily disabled Greenhouse Predicts and are continuing to monitor overall response times and error rates.
Posted Sep 09, 2020 - 17:56 UTC
Investigating
We're currently investigating long response times and increased error rates.
Posted Sep 09, 2020 - 17:21 UTC
This incident affected: Greenhouse Recruiting (Silo 1).