Programming is an exercise in understanding a problem. To program effectively, you need to fully understand, in intricate detail, the problem your program is solving. Sometimes as a programmer you don't fully understand the problem until you've wrestled with it a few times in code.
Most experienced programmers will tell you that when creating a large program, you almost always have to scrap your work at least once. At some point, you find that you've programmed your way into a dead end, that you just can't quite get where you're trying to go without doing it again. This is part of the process of understanding the problem, and usually once you've made this leap, you can visualize the whole thing laid out before you, and the next go around leads to a useful, functioning program. Not only that, but the next go-around has a much higher percentage of clear, understandable code.
Clarity in code is a sign of the maturity of the application. It's also a sign of requirements that haven't changed from the original. Inevitably, in the real world, code accumulates hairy sections to deal with changing requirements, accreting moss, dirt, and all sorts of cruft as the real world steps in to make things messy. The more clear, organized, well-defined, and well-documented a code base is, the longer it will last in the real world before needing a major revision.
If you see a project that seems completely transparent, easy to figure out, and easy to change, you're probably looking at code that has been through some serious revision, and has been recently refactored to reflect the problem it's trying to solve. As long as the fundamental assumptions of the design do not change, clean code is easy to enhance, extend, and otherwise adjust to meet new requirements. Until it gets hairy again and is time to start again.
Clean code is elegant. Clean code is flexible. Clean code is related to powerful code, but code can be powerful without being clean.
Here are some principles we use to develop or identify clean code.
Use a good overall architecture for your application.
Like many other software companies, we use a Model-View-Controller architecture for most of our projects. The Model defines the problem space, what data needs to be stored, and how it's broken down. The View is the human interface, the presentation of the software to the user. The Controller connects the model to the view, and often enforces authorization rules and the interface to other systems.
In our applications, the model is almost always object-oriented. We build up classes of objects that correspond to what we're modeling. We like using template systems like Smarty for the view, so our designers and front-end coders can change the presentation without affecting core business logic. Our controllers are a mix of objects and functional code, whatever seems most appropriate for the overall system.
Normalize data as much as practical.
In database terms, normalization is the process of identifying all the properties of all the objects that have a one-to-one relationship to each other, that fit cleanly in the same database table. For example, a contact has only one first name and one last name, one father, and one mother (at least in the biological sense), but might have more than one email address, mailing address, and phone number. When modeling this data structure, you might decide to have one contact table that allows for 3 email addresses. Or you might have a separate email address table that allows any number of email addresses associated with a contact. If you were going to fully normalize this data, you would have separate email address tables, phone number tables, and physical address tables. But is this really practical? Does your particular system need to track all the email addresses of a user, or is one (or two) enough? If you can limit it to one email address, it might make a fine unique identifier for your system, if you know your users don't share email addresses.
But if you're going to track three contacts for a company, why not normalize this into a separate table, and remove the arbitrary limitation? I shudder when I see fields named "email1, email2, email3, email4."
Each database table should be owned by a single class.
If you have a contact table, you should probably have a contact class to manage it. While other classes may query this table in a join, those classes should be getting only specific fields from the table. Only the contact class should write to the contact table, and in most cases, all requests for any contact details should go through the contact class. The rest of your application should talk to a contact object, rather than the underlying data, except when you're trying to optimize for speed.
The main benefit of this approach is that you can more easily change the structure of your database tables with minimal impact to your application. If you decide that you really do need more than one email address for a contact, you can do most of the heavy lifting in the contact class, and only need to make small changes to the template to show the new data. The other parts of your application should be unaffected, because they simply request the default email address from your contact object--which is smart enough to know that's now coming from a different table.
If you really need to do sophisticated table joins to make your application fast, consider setting up a query builder structure. We sometimes set up static methods on a class that modify the different parts of a query to add the desired fields and do the appropriate joins.
Define who is responsible for what.
I'm not talking about people here--I'm talking about classes, files, and functions. Just like classes in the model own particular database tables, you should define which part of the application is responsible for all of the major parts of an application: authentication, authorization, state, the structure of the URL, form handling, initialization, etc. Each one of these functions should be owned by a particular part of the application. This "meta" stuff about the system we usually leave in the controller, often with included files dedicated to particular features. We usually build helper methods into base classes inherited by all of our data objects in the model, specifically for state and authorization.
Authentication, verifying that a user is who they say they are, should be consistent across your application. You usually have people log in with a username and password. The problem is, because the Web is stateless, you need to verify that you're still talking with the same user on every single request. To do this, you either use http authentication, which passes the same credentials with each request, or you give the browser a token that you match up in a session. Your web application needs to verify the session or credentials with every single request, if it does anything that you don't want the Internet at large to be able to do.
Authorization, granting access to particular objects and methods for particular users, can be a bit more complicated. There are several different models for authorization: simple ownership, group ownership, user levels, and full-fledged access control lists. Authorization can either be handled by the controller or by the model itself. If the code is clear, it should be apparent where authorization is handled, and how it may be changed.
Small Pieces Loosely Joined.
Even more than powerful programming, clear programming means breaking things up into manageable, understandable chunks. Each class in the model should correspond to the objects in the real world you're modeling. The typical method on classes in our models are usually between 5 to 25 lines of PHP code. Some reach 30 or 40 lines, and only the really ugly ones reach 100 lines. If a method is reaching that threshold, it can probably be broken into several smaller helper methods that make the main method more readable. If these helper methods can be reused by other methods, well, you're killing two birds with one stone. More often that not, this level of refactoring distills the essence of the problem down into components that make your code more powerful.
Most of the long methods in our code seem to be related to form processing, parsing different parameters to insert or update data across multiple database tables. Through a combination of setting up property maps inside the object, clever getter and setter methods, and utility methods that iterate across relevant properties, these long methods can be decimated to a few calls that make the method much more portable, resilient to bad data, and more easily overridden from subclasses, too.
Create effective documentation.
I'm just starting to get into the habit of creating JavaDoc/PHPDoc style of comments, documenting each function and method. I'm a long time user of the Komodo IDE from ActiveState, and it kindly shows you the comment immediately preceding a function you type, in a tooltip as you provide parameters. Being able to see what parameters your method is expecting, what it returns, and any gotchas about using it without opening the file containing the class, saves a lot of time during development. Those kinds of comments I consider to be required.
On the other hand, a comment that states the obvious is a waste of space. Comment anything unusual or unexpected. For example, if I assign a variable in an "if" expression, I'll put a comment that I meant to assign it, that it's not just missing the extra =.
if ($a = $b->value) // assigns value to $a, skips section if value is false
Related to inline code comments, use descriptive variable names, and consistent placeholders. I use $i, $j, $k for loops, $ar for generic arrays in helper functions, $obj for an unknown object, $t for a global Smarty template object. Otherwise I'm referring to $task, $oldtask, $project, $user, and $todotomorrow.
For complex projects, inline comments are not enough. You need a solid architectural document that illustrates objects and their relationships, workflow, and how to customize. Diagrams are good.
Finally, clear code is tidy code. While PHP isn't as picky about tabs and whitespace as Python, properly nested code blocks promote readability, help keep your code valid, and gives you a quick indication about how deep you are inside a function.
Clear code invites customization, enhancement, and further development. Clear code is maintainable, and a sign that an application can likely be kept up-to-date for quite a while to come. Clear code takes more time to develop, but usually indicates a better understanding of the problem. Clear code is more portable, more reusable for other purposes, and more powerful.
Thank you for the feedback. Would it be okay if I post your comment on the Komodo testimonial page?
"I’m a long time user of the Komodo IDE from ActiveState, and it kindly shows you the comment immediately preceding a function you type, in a tooltip as you provide parameters. Being able to see what parameters your method is expecting, what it returns, and any gotchas about using it without opening the file containing the class, saves a lot of time during development. Those kinds of comments I consider to be required."
I'd be happy to send you a t-shirt and a mug in exchange. Please let me know your t-shirt size, address and contact phone number.