Greenhouse Recruiting unavailable

Incident Report for Greenhouse

Postmortem

WHAT HAPPENED?

Beginning at 6:19pm UTC (2:19pm EDT) on 8/7/2019, Greenhouse Recruiting began serving errors at an increased rate for many users.

By 6:34pm UTC, Greenhouse Recruiting was entirely down for that same group of users.

By 7:10pm UTC, Greenhouse Recruiting error rates had returned to normal for all users.

‌

WHAT WAS THE EFFECT?

For the duration of this incident, many users consistently received error pages while accessing Greenhouse Recruiting.

‌

WHO WAS AFFECTED?

Customers who access Greenhouse Recruiting through https://app.greenhouse.io encountered increased errors and downtime.

Customers who access Greenhouse Recruiting through https://app2.greenhouse.io were unaffected by this incident.

About 50% of customers who access Greenhouse Recruiting through a company subdomain (e.g. https://mycompany.greenhouse.io) either encountered increased errors or were unable to access the application at all.

Customers making write requests through the Harvest API encountered increased errors.

Candidates applying through Greenhouse Job Boards or the Job Board API were unaffected. Greenhouse Onboarding, the Onboarding API, and Harvest API read requests were also unaffected.

‌

WHAT WAS THE CAUSE?

At 6:17pm UTC, we encountered a bug in our database, PostgreSQL, that caused connections, locks, and load to spike in our database and quickly led to increased query execution times and timeouts for normal queries. Our web application began serving more errors to users.

At 6:19pm UTC, we were notified of increased errors by our internal monitoring and began investigating. We noticed the increased locks in our database. We attempted to kill the problematic connections, but were unable to do so. We then attempted to reboot our database. Both of those operations failed because Postgres was unable to kill some connections, another effect of the bug that initiated the incident. We contacted our hosting provider, but while they were investigating the database was finally able to reboot on its own.

The database reboot completed at 7:06pm UTC, and by 7:10pm our error rates had returned to normal.

‌

WHAT ARE WE DOING TO PREVENT THIS FROM OCCURRING AGAIN?

We have confirmed the underlying cause of this incident as a bug in Postgresql. We will be upgrading our primary database versions to apply the fix for the bug. In the meantime, we are disabling a database feature that allows this bug to occur.

We are also working with our hosting provider to determine whether there were actions we could have taken to return our database to a functioning state more quickly in the future.

We apologize for the inconvenience this incident has caused. We take the reliability of our application seriously and are actively working to prevent incidents like this one from occurring in the future. If you have any questions or concerns, please reach out via: https://support.greenhouse.io/hc/en-us/requests/new.

Posted Aug 09, 2019 - 19:02 UTC

Resolved

This incident has been resolved. We will provide a full postmortem when we have concluded our internal investigation.

We apologize for the inconvenience this has caused you and your teams.

Posted Aug 07, 2019 - 20:05 UTC

Monitoring

Greenhouse Recruiting is now available for all customers. We will continue monitoring for any additional disruption.

Posted Aug 07, 2019 - 19:10 UTC

Update

Our team is continuing to investigate and will post updates shortly.

Posted Aug 07, 2019 - 19:04 UTC

Update

Greenhouse Recruiting is currently unavailable for some customers.

Posted Aug 07, 2019 - 18:34 UTC

Investigating

We are investigating an increase in error rates in Greenhouse Recruiting.

Posted Aug 07, 2019 - 18:24 UTC

This incident affected: Greenhouse Recruiting (Silo 1).