Continuing on the series, the next item on the list seems to be the mistake I see the most--putting slow code in loops, loading up things that don't need to be loaded, making simple requests expensive.
In terms of processing time, it's expensive to open a database connection. It's expensive to connect to another computer. It's expensive to load up a big framework to respond to a single request. It's relatively cheap to retrieve a pre-constructed page out of a cache.
The single biggest mistake I see that kills performance in code is putting database calls inside a loop. One code project we picked up had display code that showed the results of a search. First, it did a search to identify all the matching rows in the database. Then it looped through that result set, grabbing the rest of the data for each individual row, one query at a time. Then it cut down this set to the page size, discarding all that data it had loaded up. If the search yielded over a thousand results, it took over a minute to run! All of this data could be loaded with a single smarter database query--and doing so made the same search practically instantaneous.
This type of performance penalty is the main reason I don't care for frameworks all that much--they often trade performance for programmer convenience. This is fine when your site is small, but leads to a lot more optimizing work down the road if your site takes off. And while good frameworks can turn result sets across objects efficiently, it usually takes learning how to make the framework do this in the first place--which means that programmers are better off learning how to do all of the work themselves before using a framework so they understand how to avoid these problems.
So here are some principles I use to make PHP applications speedy from day one.
Get as much data from each database query as you possibly can--but not much more
Unless a database table regularly contains a large blob we rarely need, go ahead and load up the entire row when creating a corresponding object. For example, in a project management tool, if asked to retrieve a task object, in my code you would provide a task id and you would get a task object pre-loaded with all the task object properties loaded with data from the database. While you can call getter methods to get individual properties, these do not result in yet another call to the database.
When retrieving arrays of task items, I usually provide a static search method that does a single query getting all of the data for all matching rows, constructing each task object, and passing it the already retrieved data so there are no further database calls--request the first 30 matching tasks, and the system still only does a single query on the database.
Doing a database query is expensive, but making a sophisticated query doesn't add much to that as long as the database is properly indexed. When you know you have to do one, wring as much data as you can from each query. Use JOINs and database functions to do as much work as you possibly can in a single query.
I'm not that big a fan of stored procedures, mainly because I haven't learned how to manage them effectively across deployment instances. Make a change to a code base, all you have to do to get it elsewhere is commit it to the repository and update your working copies. Make a change to a stored Postgres function, and you need to manually replace the function using psql or some other tool. But a stored function can be a way to offload more processing to the database, possibly gaining some performance in the process.
In general, I think of the database being in a separate silo than the business logic. The requests between these silos are what's expensive--the processing in one side or the other is less so. Minimize the number of times you switch, and your application will be faster... As a side benefit, when your traffic outgrows what a single server can handle, and your database calls are actually on a different server, you won't need to rewrite your application.
Avoid repeating yourself
Cut and paste when programming is a bad thing. Stepping through my own code with a debugger often reveals areas where I do the same thing twice. I loop through an array in one method to calculate some value. Then I loop through the same array somewhere else to perform some other operation. While loops can be fast, if you're manipulating large objects or arrays, you still want to minimize this wherever you can. Sorting is expensive--wherever you can, let the database pre-sort your results for you. Look for opportunities to leverage work you're doing in one part of your application to do double-duty and handle the task you're doing elsewhere.
Writing code is a lot like writing anything else--it takes time to distill down to the essence. Early drafts can be much wordier than later drafts. If you have the time to go back and consolidate the areas of work, you'll get a small performance benefit out of this.
Out of this list, this item is the least important. Try to consolidate as much as you can the first time through your code, but caching will far more than make up the difference. These are the slight improvements to save for future revisions--but if you see an obvious opportunity to combine and simplify code, take it.
Use Lazy-Loading wherever it makes sense
If your application needs to hit the database on every single request, go ahead and open a database connection early. If on some requests your application just returns static data, save a tiny bit of processing and skip the database connection. On a few projects, I've written code that connects to multiple databases, so I've written a simple stub class that maintains a singleton database connection object. In every method that connects to a database, it calls the static method that returns the database connection object, creating and establishing it if it doesn't already exist.
We program extensively with Smarty, and in some projects use Smarty's caching system. When used with a lazy-loading design, it's extremely effective at speeding up page views. In our "standard" architecture, we have a controller stub that the browser requests. This stub examines the request, identifies the view and the data objects to load, and sometimes creates controller objects to handle specific requests. However, if you're using a caching system, you need to check for a cached version before doing any of this processing. Either check the cache at the top of your controller, or move your controller itself into a file that's loaded by a Smarty template. By having the template load the controller and decide what to do next, that processing never happens if Smarty retrieves the cached template instead.
Now that we program a lot with Ajax, we no longer automatically create a Smarty object for every request--first we check whether we're returning HTML, XML, JSON, or something else, and only create the Smarty object for particular types of views.
These are examples of how we use lazy loading to avoid loading large chunks of code or establish database connections we never use.
Plan early on for caching
When you first launch your application, you probably don't need caching because you're not getting that much traffic. Some applications only run in private networks and never need to do any caching. But if you're building a Facebook application or expecting huge amounts of traffic someday, create strategies for caching early on.
As I mentioned earlier, Smarty does this extremely well. You need to provide a way to uniquely identify an item in the cache, and Smarty will do it for you. Just make sure you check for the cached version before doing a lot of extra processing.
Without Smarty, it's relatively easy to use output buffering to capture the output of your code and store it somewhere for later retrieval.
Many projects designed for traffic have simple switches you can just turn on to take advantage of caching, including Drupal and Joomla. After caching as much HTML as possible, the problem turns into more of a system administration project--installing an opcode cache like eAccellerator can help your server handle 30-40% more traffic, in our experience. These systems essentially compile your PHP to get more speed, and cache the result.
The next level of caching, for truly large sites, is using a system like memcached. Memcached provides a system for distributing a cache across multiple data servers, so for the truly large sites, the problem starts involving developers again. PHP provides a memcache module you use to store and retrieve your pages in memcached. When your site outgrows what can be run on two servers, it's time to have your system administrators set up a memcached cluster and rewrite your application to use it.
Avoid over-engineering your application
I inherited another project gone awry that had started with some really huge, complicated framework that seemed half-done. Most projects we're called in to complete involve spaghetti code, mixed logic and presentation, and no clear architecture. This one, in contrast, was over-engineered for the problem. To figure out how the code worked, I ran it through a debugger. To get to my main class for a particular object, it ran through a series of no less than 8 inherited classes. And worse, some utility methods were copied between child classes, instead of being put once higher in the class hierarchy. I saw clear reasons for having 3 layers of inheritance in this application. Not 8.
Since then I've seen a few times where developers seem to create more inherited classes just because it seems like they should to be correct, not because there was any practical value in it. I rarely see the need for more than 3 levels of object inheritance, and never more than 4 (at least in a web application). When your application needs to open 20 files just to respond to a simple AJAX data request, that's over-engineered. When you create an elaborate class structure just to avoid a simple function, that's over-engineered.
There's a scale here, from non-engineered spaghetti code to rigid, sophisticated frameworks. I suspect that most people without formal training start with spaghetti code and gradually learn how to create more structured code--while computer science majors start out with over-engineered structures and eventually loosen up in the real world after running their code through some profilers and realizing they don't need all that complexity for a simple problem. Everyone over time, at least anyone with a knack for this stuff, ends up somewhere in the middle, with enough architecture to do the job--and little more. There's definitely some variation here as a matter of taste, but there are measurable problems with either extreme.
I further suspect that Rails might be so popular now because a lot of web developers out there with no formal training are suddenly seeing the benefits of structured code and smart frameworks.
Keep in mind how expensive each operation is
Some actions take a while to complete. In our experience, the most expensive actions involve connecting to another server, especially ones not in the same data center. Keep these in mind when coding, and don't do them if it isn't necessary. For very expensive operations, especially when you need to do a bunch at once, consider forking a process using a call to the shell, or move to a maintenance routine called from a cron job.
- curl to connect to another server
- Other functions used to connect to remote servers: fopen, file, etc
- domxml, SimpleXml on very large XML documents
- Sending mail to multiple recipients
- Sorting on large arrays
- Database connection to remote server
- domxml, SimpleXML on medium-sized documents
- Recursive functions
- Individual database queries
- domxml, SimpleXML
- Creating complex objects
- Loading large files
- XML event-based parsers
- Retrieving cached files
- Loops on small arrays
- Lookups in hashes stored in memory, retrieving constants
Do you have any other tips for writing fast PHP code? Please add a comment below...