Over the past couple weeks, Planning Center’s performance on Sunday mornings has been unacceptable for your standards and ours. As a parent who has stood in long check-in lines and as a former church staff member who has experienced the pain of a slow check-in process first hand, I feel your pain.
We let you down, and we’re not taking it lightly. Our entire organization is focused on fixing this problem and regaining your trust.
I want to be as open and transparent as possible with you. At a very high level, the database that powers Planning Center has been getting bogged down under the new workload of all our churches starting their fall season. We have worked with a database consultant to make sure our configuration is correct, rewrote parts of our application so they would not rely on our database, and even moved our database to a more powerful server. In the end, none of these things relieved the pressure.
After much research and investigation we have decided to move our three biggest applications to three separate database servers. So Check-Ins, Services, and People now have separate database servers, each one as powerful as the one that used to power the databases for all our apps. This will give us more than enough headroom so that the issues we have been experiencing will no longer affect us. We will also continue examining every part of our apps to make sure they are finely tuned.
If you would like more details, our operations team has included more information on exactly what happened over the last couple of weeks, the actions we have taken, and the mistakes we have made.
We would like to offer a one-month credit to all of our Check-Ins customers that experienced this issue. I know that an account credit pales in comparison to the weight of the problem, but I want to communicate how serious this situation is to me. If this issue affected you and you are an Organization Administrator, you can receive your credit here.
Again, I would like to reiterate that we are not brushing this off, and that we will make this right. I believe we have the best team to solve these issues and deliver the performance our customers deserve. If you have any questions or concerns that have not been addressed here, please let us know and we will do what we can do to help.
Thank you for using Planning Center.
P.S. Our Help Center has an article with ideas for checking people in when you don't have access to Check-Ins for any reason, even if that's just because of your personal internet connection.
Two Sundays ago (August 20th)
The short version is that there was a bug in some code that made our databases work too hard. Once that code was reverted, service recovered quickly. For the full rundown, you can read last week’s report.
Through a complete analysis following the incident, we learned two things:
- There were some inefficient search queries in Check-Ins that needed to be addressed.
- If we had more resource headroom on our MySQL database server hardware, we would have been in a better place to handle the bad code.
As soon as these issues were identified, work began.
To address the first point, we decided that Check-Ins should utilize something called Elasticsearch, which is a type of database purpose built for searching large datasets. We’ve recently deployed Elasticsearch to power the search bar in Planning Center People and the new people page in Services. We’re getting comfortable with the technology and knew it was up for the job. On Monday morning, two developers dropped the projects that they were working on and spent their week implementing Elasticsearch in Check-Ins. On Thursday afternoon we were able to deploy the new code to production.
This was a huge success. The graph below shows the average response time to search for someone by name from a check in station (shorter is better):
The vertical blue line in the center of the graph indicates the moment Elasticsearch was deployed. Not only were the searches returning faster than before, but all of that load was completely removed from our database server. Great success!
As for the second takeaway, we needed to upgrade our database server. Late Tuesday night we took the site down for scheduled maintenance while we migrated our database to a much larger server.
Initially results looked very promising. The server had tons of headroom, and the average response time for most of our apps was the fastest it has ever been.
This graph shows the number of queries that take more than 2 seconds for the database to perform. On the right side you can see that these slow queries fell to practically zero once the new hardware was deployed. We were ecstatic, and with both of these changes in place we felt very confident about going into another weekend.
Last Sunday (August 27th)
Myself, along with another member of our ops team, were online early Sunday morning to monitor our infrastructure. At first, everything looked incredible. Our response times were the fastest they’ve ever been. Even while our database server was performing 50,000 requests per second (that’s over 3 million per minute!), it was hardly breaking a sweat. Then suddenly at 6:50am (all times Pacific) we started seeing degraded performance across our apps and response times skyrocketed.
Our first instinct was to look at the new database server. Its CPU, IO utilization, and memory usage were well within its limits, so we kept looking.
We looked to see if any slow queries were clogging things up, we checked our network connection to Elasticsearch and other services, we checked for queueing in our load balancer and other web servers, all of this found nothing. Lists in Planning Center People have caused high database load in the past, so as a precaution we disabled those. Then we fired up more web servers, hoping that would help. All of this either found nothing or was largely ineffective.
By 7:30am we were running out of options. Even though the database looked fine to us, we called our MySQL specialists to have a look. Once that work began, he modified some configuration settings in MySQL, which seem to have helped. By 8:30am our servers were back to handling requests appropriately, but the brunt of our Sunday morning check in rush was mostly over. It was impossible to know if the improvement was from the new configuration settings or the decreased overall traffic.
At 8:45am our systems had recovered, but we were still nervous so we kept lists disabled for a while longer. They were re-enabled at about 10:00am.
We left Sunday morning knowing that none of us would sleep well until we knew exactly what the problem was so that it could be fixed.
Alongside our MySQL consultants, we’ve performed a deep dive into our logs to analyze what happened, and came to a surprising conclusion.
Up until now, we’ve always had success scaling our MySQL server by simply putting it on bigger hardware. Unfortunately, after a certain threshold of hardware size is reached, MySQL stops scaling linearly without being fine-tuned for the type of work it does. When we moved our database to a new server with so many more processing cores, we found those limits, and our database became incapable of handling the load. It actually became less efficient than it was with fewer cores.
We learned a few lessons last week, and we’re acting on those right now.
First, the work to move more queries off of our database and onto Elasticsearch was a huge win. The same two developers are now moving even more searches off of MySQL and on to Elasticsearch.
Second, any time we see performance degradation in the future, we will immediately place a call to our specialists to get them looking, whether initial analysis indicates a database problem or not.
Third, we can’t keep MySQL on the server it’s on now. In order to get MySQL to perform efficiently on so many cores, its settings need to be fine-tuned for the type of work that the database typically does. But since we have seven very different apps sharing this database, there is no “typical.”
So, last night during scheduled maintenance we brought three new database servers online, all of which are the same size and running the same configuration that we’ve had success with for the past 8 months. One server is dedicated to running Services, another is dedicated to Check-Ins, and the third is shared by People and the other apps. Since each of these servers is based on the same hardware that all of Planning Center was running on just weeks ago, there is tons of headroom in each. (Side note — technically we haven’t moved from one server to three, we’ve actually moved from four servers to 12. Each server is a set of servers - a primary, a secondary, a backup server, and another backup server running on a time delay. It’s a complicated setup that keeps your data safe from inadvertent loss and helps us sleep better at night).
Now that we are back to running on a proven hardware configuration with one-third the amount of traffic on it, we are confident we will be able to support our growth for a long time to come. As our normal Tuesday morning traffic is starting to ramp up, we’re watching the graphs and the results are promising. We’re back to having some of the best performance we’ve ever had.
Next, we’ve begun investigating strategies for stress testing our infrastructure. There is much work to do before we are able to simulate the full complexity of our normal Sunday morning traffic, but initial work has been promising and has already revealed some previously undiscovered bottlenecks.
Lastly, we will be adding proactive monitoring of all actions in our apps to detect and alert on any regressions caused by newly deployed code. We’ll also work to detect and alert on areas in our applications that slow down over weeks or months, so the developers and operations teams can work together to make improvements before these regressions become a real problem.
Over the last 18 months we’ve seen an amazing influx of new churches relying on Check-Ins. To be specific, 18 months ago we would see about 200,000 kids checked in on a Sunday morning. Three weeks ago, that number was just shy of 400,000. It’s humbling to see our work help so many churches, and help keep so many kids safe.
Keeping up with that growth is a challenge, and until now we’ve handled it in stride. In that same time frame we’ve made substantial changes in our teams and processes. Our operations team has quadrupled in size (from one to four), we’ve migrated from owning physical servers to using the much more scalable Amazon Web Services, and have hardly even had a hiccup on Sunday mornings.
I am incredibly proud of the work our teams have done so far and I have absolute confidence in their ability to continue navigating our ever-growing demands. We all make mistakes, but we are learning from them and are committed to doing everything we can to ensure your weekends go off without a hitch.
Thank you for your support.
Update — September 5th, 2017
Last Sunday went off without a hitch on our end. In fact, our average response times for most requests were the fastest they've ever been.
The charts below are a nice visual indicator of how check-in stations performed on Sunday. The top graph shows how long it took for a station to load, and the bottom graph shows the frequency of stations being reloaded. Combined you can see that response times remained constant as throughput increased, which is exactly what we want. Times are in Pacific Time.
(Also, the throughput chart is a neat way to visualize the aggregate busiest times for checking children in across the country — big spikes just before every hour with smaller bumps just before the half hours.)
Again, thank you for your patience and for sticking with us.
If you did experience any issues last Sunday (or any other day!), please get in touch with our friendly support team and they'll can help you get the issue sorted out.