I've got a theory: The Scientific Method applied to web site performance

By John Locke on July 13, 2014

What can you do about this page being so slow? That's a question we've been asked by half a dozen customers in the past 6 months, and as it turns out, we can do quite a lot.

One of my long-standing complaints about Drupal is that it's a resource hog. That's an issue we can generally help by throwing lots of hardware and caching systems at the problem -- but that's not the kind of performance issue these clients were having.

For one client, a crowd funding site, it was a page that reported how much money needed to get paid out to the charities. For another client, it was a roster of soccer players that had become unusable right at the start of their season. For another, it was an athletic consent form, which was resulting in a server error. For yet another, they couldn't edit the display settings for a particular content type unless we bumped the memory limit up over 1/2 GB.

We're just now working with a new client who has memcached set up, and lots of other caching -- and has bumped the memory limit up to a whopping 2GB. (Our typical servers run at 1/16th that much per request).

Most of these come from sites built by other developers, but we were responsible for a couple of them.

The question is, when challenged with this type of issue, how do you resolve it? Here's my approach:

First, observe and collect information. What page(s) exhibit the problem? Does the same problem exist on a copy of the same site in a different environment? Is the server running out of RAM? Is there a slow query in the database? Does it matter if you're logged in? (most of the problems we've troubleshooted recently were all private administrative pages).

Next, develop a hypothesis. With experience you develop a nose for this kind of thing -- I often feel like I can sniff out the general area causing the problem with just a bit of investigation. A database query missing an index. A database query getting called inside a loop. A poorly designed model. A recursive function run amok.

Test the hypothesis. Look for evidence that proves or disproves it. Break out the profiler, the debugger, the query logging, the query plan explainer, the memory monitors, and look for evidence that supports or eliminates your hypothesis. If it's eliminated, or inconclusive, go back to the previous step and come up with a new hypothesis. If it appears to be confirmed, keep testing and dialing in until you reach the source of the problem.

Use tools, and measure results. Here's one way you can tell how competent a shop is, at least at a technical level -- do they make regular use of tools to help them arrive quickly at a result? Here are a few of our favorite tools to get the job done:

Devel module, dpm function -- Most Drupal developers know to use this, or at least a print_r or var_dump to see what the value of a variable is at a particular part of the code.
XDebug -- a tool that should be in any PHP developer's toolbox, this combined with an IDE lets you step through code, set breakpoints, figure out why something doesn't have the value you expect it to have. Debugging is a bit more setup, but far more powerful than printing variable values.
MariaDB Slow Query log -- we run MariaDB on most of our servers, and with either that or MySQL you can turn on the "slow query log" in my.cnf. Then you'll get a log of any individual queries without an index or that take longer than a specified number of seconds.
EXPLAIN SELECT -- this database statement reveals how the database query planner joined the tables in any particular query you want to optimize. Crucial things to look for include joins that generate a temporary table, joins that have a high cardinality, joins that don't have an index listed.
XHProf -- now we get to code profiling. This PHP extension has built-in support in the Drupal devel module -- when enabled, it generates a bunch of linked HTML reports you can browse that specifies how much time and RAM is spent on each function call, as well as parent and child functions. Browsing a profile run, with some experience you can quickly identify where the bottleneck in your code is, and compare different runs.
Web analytics -- When performance issues are not consistent, they tend to be more of an environmental cause. This might be spikes in traffic, Denial of Service attacks, or simply not enough capacity with the current hardware without more tuning. Analytics can shed light on the peak times of day, the peak days of the week, and other regular behavior of your audience.
Server performance monitoring -- Amazon has monitoring tools that can take snapshots of things like amount of RAM in use, swap used, number of connections, etc for EC2 servers. We run our own monitoring server that does the same thing for our servers and any customers on our server maintenance plan. This can give us insight into performance-related events -- what else is going on on the server at the time, which may not be captured by analytics?
Server log analysis -- this is most useful for identifying sources of attack. There are great tools like Splunk for doing this, but we generally just do some quick log parsing with Awk to get sorted lists of the top visitors in a particular time period. If it's not Google's spiders, and it's not our customers' offices, we can guess at competitors or possibly malicious attackers, and block them at the server level.
Page load measurements -- there are several browser extensions that can give you a measurement of how long it takes to get a page from your server, how long it takes to render that page, how long it takes to run any scripts, etc. The Chrome Inspector and Firebug are two crucial tools here that can give you very specific numbers for page load times. These times vary a lot, especially between "cold" and "warm" visits, so you're looking for representative numbers to give you the ballpark, not an exact figure -- too many other network factors can alter your results. This Chrome extension is also very handy.

Once you have identified the problem, and worked out a solution, measure the improvement in performance. Sharing what you've found, and the improvements gained, can help prevent your team from making the same mistakes again and help you deliver better quality. Science advances through publication of results, peer review of the methodology, and other groups replicating the same (or equivalent) results. Open source software improves the same way.

We've hit, and resolved, quite a few big performance issues. While we're not necessarily going to share all the details -- we work with several clients who we are bound by non-disclosure rules to not reveal -- we will be regularly writing posts about specific issues and how we've resolved them.

Stay tuned, and please share your favorite tools and techniques for resolving performance challenges in the comments!

Topic

Comments

mobile app performance testing

Just found this brilliant validation of what sound 'test & learn' methodology. The process you've described is what I've been expounding / practising for years (little did I know it was based in the Theory of Science). We can say, it is complex branch of science.

Add new comment

Your name

The content of this field is kept private and will not be shown publicly.

Homepage

Notify me when new comments are posted

All comments

Replies to my comment

Comment

About text formats

Filtered HTML

Web page addresses and email addresses turn into links automatically.
Allowed HTML tags: <a href hreflang> <em> <strong> <blockquote cite> <cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h1> <h2 id> <h3 id> <h4 id> <h5 id> <p> <br> <img src alt height width>
Lines and paragraphs break automatically.

Drupal Canvas — Block HTML (locked)

Allowed HTML tags: <strong> <em> <u> <a href> <p> <br> <ul> <ol> <li>

Drupal Canvas — Inline HTML (locked)

Allowed HTML tags: <strong> <em> <u> <a href>