Make small, data-driven performance improvements

Wil Hall

We often don’t notice performance issues until they become, well, issues.

Furthermore it may only become imperative to address them — or so we’re told — when they start affecting our end users.

Performance issues are a form of technical debt. They are issues because we neglect the root cause until it affects our application in a way we can no longer ignore. Until it is critical for us to fix them right now.

This is the worst-case scenario, because we’re fixing the issue under time pressure. Just like we should practice design driven development and never build a feature until we know why the user cares, we shouldn’t rush into fixing a performance issue until we fully understand the cause and effect.

In order to know how to fix our application, we have to measure it.

When and what to measure

The tempting answer when faced with deciding what to measure about your application is everything. After all, why wouldn’t you want more insight into how your application is performing?

But just because you have a lot of metrics doesn’t mean you’re measuring the right things. And more metrics means more data to sort through when trying to identify a performance issue.

Our metrics — just like our tests — should provide clarity around critical paths in our application. When we add additional metrics, we should always do so in order to prove or disprove our assumptions about how the application is performing.

Ideally we have some general performance metrics being collected before we see an issue, so that we can compare the performance metrics while the issue is occuring to their historical values.

When working on a new application, start small. It’s important to get ahead of performance issues, but it’s better to add metrics and monitoring incrementally as you need them.

Whether your application is new or already established, consider starting with the following:

Response time (internal)

Use an application performance monitoring solution such as Appsignal to see how your application is performing, particularly when it comes to mean response time and where that time is being spent.

Response time (external)

An external monitoring tool such as UptimeRobot can’t replace an application performance monitoring solution, but it may provide a more accurate representation of the performance end-users are experiencing. Many external monitoring solutions also offer additional features such as alerts for uptime SLA violations, reminders about TLS certificate expiration, and more.

Database performance

Application performance is directly affected by database performance, and while application monitoring can often identify when the database is an issue, having dedicated tools to monitor your database servers can help pinpoint the reason for the issue. Visual monitoring tools such as pgDash for Postgres or RedisMonitor for Redis can be a great way analyze your database performance. You may still need to utilize some advanced techniques for trickier problems, but these monitoring tools will help you know where to start looking.

Start small

When considering the above, utilize the tools that are available to you, and then supplement as needed. For example if you’re using Heroku, a lot of application dyno metrics are available out of the box, and some of the aforementioned monitoring tools are available as Heroku addons.

Measure twice, patch once, and repeat

When we first notice a performance smell — before it becomes a performance issue — we should take the time to consider what monitoring we can add to gain insight into the problem.

  1. Start with a problem statement: Under load, database queries are taking longer.
  2. Identify the information you need to narrow down the issue: We need to be able to monitor query performance while the application is under load to determine which database queries are problematic.
  3. Augment the existing monitoring: By introducing an application monitoring solution, we can see what queries are executed in production when a request takes longer than 1 second.
  4. Identify the issue: After analyzing the queries made on requests with response times greater than 1 second, we identified that some_table in our Postgres database was missing several indexes.
  5. Fix the issue: Make an informed fix, leaving a paper trail to how the issue was identified, and why the particular solution was chosen.

Getting ahead of performance issues starts with reviewing application metrics often, identifying anomalies, and dedicating the time to understanding them.

Work iteratively, adding monitoring as you need it to ensure that when you review your application performance, the data you have is relevant and reliable.

By taking the time to do this proactively, the need to do it under time pressure can be avoided.

This process should be a shared responsibility of the entire team. Just as we need to find the time to refactor, we need to find the time to monitor our application. The best way to accomplish this is to redefine what it means to be “done” with a new feature, and work as a team to make performance considerations part of our day-to-day development cycle.