Increased Error Rates
Incident Report for Greenhouse
Postmortem

WHAT HAPPENED?

Beginning at 11:06am ET on 5/1/2019, Greenhouse Recruiting began serving increased errors for some users.

At 11:13am, we applied a partial fix and error rates began to drop, and by 11:16am we had fixed the underlying cause. Error rates then returned to normal for all users.

WHAT WAS THE EFFECT?

For the duration of this incident, users intermittently received error pages while accessing Greenhouse Recruiting.

WHO WAS AFFECTED?

Customers who access Greenhouse Recruiting through https://app.greenhouse.io encountered slower-than-usual page load times and increased errors.

Customers who access Greenhouse Recruiting through https://app2.greenhouse.io were unaffected by this incident.

About 50% of customers who access Greenhouse Recruiting through a company subdomain (e.g. https://mycompany.greenhouse.io) either encountered increased errors or were unable to access the application at all.

Customers making write requests through the Harvest API encountered increased errors.

Candidates applying through Greenhouse Job Boards or the Job Board API were unaffected. Greenhouse Onboarding, the Onboarding API, and Harvest API read requests were also unaffected.

WHAT WAS THE CAUSE?

Just before and during the incident, we saw a significant increase in traffic on a Harvest API write endpoint. The increased traffic unexpectedly put our primary database under heavy load, and resulted in slower page load times. Eventually, the page load times became slow enough that we began serving timeouts.

After we noticed the increased database load and traced it back to the Harvest API, we manually terminated the load originating in Harvest until we could disable the API keys causing the load. After we disabled the API keys, site performance returned to normal.

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

We have conservatively reduced the rate limit for the Harvest API endpoint that caused the increased database load. We'll be doing a more thorough evaluation of that endpoint to find an appropriate permanent rate limit.

We apologize for the inconvenience this incident has caused. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new.

Posted May 02, 2019 - 12:25 EDT

Resolved
This incident has been resolved. We will post a detailed postmortem within the next day.
Posted May 01, 2019 - 11:47 EDT
Monitoring
We have taken corrective actions and error rates in both Greenhouse Recruiting and the Harvest API have returned to normal. We are monitoring the application to confirm that the issue is fully resolved.
Posted May 01, 2019 - 11:27 EDT
Investigating
Our internal monitoring is reporting increased errors from Greenhouse Recruiting and the Harvest API. We are currently investigating.
Posted May 01, 2019 - 11:15 EDT
This incident affected: Greenhouse Harvest API (Silo 1) and Greenhouse Recruiting (Silo 1).