WHAT HAPPENED?
At 3:36pm ET on 1/23/2019, we made a change to Greenhouse Recruiting that included an update to our primary database. The database update introduced a performance degradation and as a result we began serving errors at an increased rate.
Around 4:40pm ET, we applied a fix for the performance issues and by 4:44pm ET error rates for Greenhouse Recruiting had returned to normal.
WHAT WAS THE EFFECT?
For the duration of the incident, a subset of users received errors when trying to access Greenhouse Recruiting.
A subset of Harvest API write requests and Partner API requests made during the incident also returned errors at an increased rate.
Outgoing webhooks in Greenhouse Recruiting may have been delayed during this period.
WHO WAS AFFECTED?
Most customers who access Greenhouse Recruiting through https://app.greenhouse.io either encountered increased errors or were unable to access the application at all.
Customers who access Greenhouse Recruiting through https://app2.greenhouse.io were unaffected by this incident.
About 50% of customers who access Greenhouse Recruiting through a company subdomain (e.g. https://mycompany.greenhouse.io) either encountered increased errors or were unable to access the application at all.
Candidates applying through Greenhouse Job Boards or the Job Board API were unaffected. Greenhouse Onboarding and Harvest API read requests were also unaffected.
WHAT WAS THE CAUSE?
The incident was caused by a change we made to a table in our database. Databases have multiple ways of fetching the same result and use a query planner to pick the most efficient one. After making the change, the query planner began issuing inefficient query plans for any queries against the modified table. The modified table is frequently queried by our application, so the bad query plans resulted in query timeouts and increased database utilization.
After we isolated the root cause, we updated table statistics for that table and the query planner began returning more efficient queries. Soon after, site performance returned to normal.
WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?
We will be performing a review of our database change management process to ensure that similar incidents can't occur again in the future.
We take the reliability of our software very seriously, and are committed to making changes to prevent similar issues from occurring again. Please accept our apologies for any inconvenience caused. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new