HTTP – Your platform is the web, so learn how it works. The deeper details of the HTTP protocol can be absolutely invaluable for any PHP developer to know. This will also help you understand proxies and caches in depth.

FastCGI – Most PHP deployments are on mod_php, but FastCGI can be a great alternative interface. It’s also universal, and will let you run many languages with only a single front-end service.

C – PHP is written in C, so it’s fundamental to writing extensions. By learning C well, you can not just write extensions, but also find and debug problems in the PHP scripting engine and existing extensions.

Network Programming – If you ever need to something complicated with PHP, like off-loading background processes for example, network programming is an extremely useful skillset. A basic understanding of sockets and daemons is already a huge benefit.

Your Operating System – I originally wanted to write ‘UNIX’ here. However, I realize that some of you deploy on Windows systems. I am not touching that argument with a ten foot pole. What ever your operating system is, learn it in depth. You’ll be surprised at what you can do.

A C10k nerd tries node.js

November 26, 2009

I’ve spent a lot of time as of late thinking about C10K-type problems. Obviously, we all know there are faster ways to do certain things than PHP. In this context, I mean simple performance, of course. I don’t consider development time because I won’t waste it with overly verbose stacks – but I digress.

In any case, I spent some time looking at node.js lately. node.js is not a client-side javascript library; it is an addition to Google’s V8, which is an extremely fast Javascript engine. node.js describes itself as “Evented I/O for for V8 Javascript”, and that is exactly what it is. As an example, here is a simple HTTP server written for node.js:

var sys = require('sys');
var multipart = require('multipart');
var http = require('http');

var handler_404 = function(req,res) {
  res.rtext(404,'four-oh-fail');
}

var handler_index = function(req,res) {
  res.rtext(200,'Oh, hai!');
}

var router = function(req,res) {

  res.easyResponse = function(rescode,restype,resdata) {
    res.sendHeader(rescode,{'Content-Type': restype});
    res.sendBody(resdata);
    res.finish();
  }     

  res.rtext = function(code,text) { this.easyResponse(200,'text/plain',text); }

  switch(req.uri.path) {
    case '/':
      handler_index(req,res);
      break;
    default:
      handler_404(req,res);
      break;
  }
}

http.createServer(router).listen(8000);
sys.puts('Server running at http://127.0.0.1:8000/');

As you can see, it’s extremely simple to write something moderately complex. node.js is also incredibly fast. My little 256mb virtual machine for testing spat out over 3,000 requests per second for the above, after a little bit of tuning. I really like node.js so far. That being said, there are still a lot of things missing. For example, processing POST form variables is a no-go. However, you can decode multipart forms. There are no drivers for any database that I am aware of this time, and et cetera, et cetera. In other words, it’s still pretty rough, but it’s getting there.

I also looked at Google’s Go. I like Go, and I think that if I were writing something complicated, I’d probably prefer it over node.js. However, node.js’s immediacy and low barrier of entry make it extreme;y attractive and I am really looking forward to seeing it grow.

Why Your PHP Site is Slow

November 19, 2009

I’m going to take a wild guess and say that your PHP site is hosted on an Apache 2 web server and that you’re using mod_php for your PHP request handling. That’s cool. That is a solid, dependable stack. I’m going to guess, though, that it went down the last time you were slashdotted. Let me show you how to fix that without having lolcats upgrade your RAM.

Let’s take a look at a page view to see what’s going on with your server. I am going to throw out a random server configuration for us to examine this. Let’s say you’re sitting on some sort of Pentium 4 with a gig of RAM, which is a reasonable expectation for a moderately-aged ‘personal’ web server. Let’s also say that you’re really kind of boring in your requirements, and you really just host your blog, which sits at myawesomeblog.tld. I am also going to assume you use WordPress, because it’s quite popular.

A user visits your website, and their web browser issues the following requests:

- /2008/01/01/how-to-train-puppies-with-php/ which is routed to index.php, which is WordPress.
- /wp-content/themes/puppies/style.css which is your style sheet.
- /wp-content/themes/puppies/jquery-1.3.2.s which is a Javascript library you need for the item below.
- /wp-content/themes/puppies/mycode.js which is some fancy javascript you use to make things be shiny.
- a ton of image in /wp-content/themes/puppies, which are support images for your layout. Let’s say you are loading 7 of them.

The WordPress page, plus the style sheet, the two javascript files, and seven images makes a total of eleven requests in order to satisfy the user’s page view. It almost certainly adds up to a relatively marginal amount of bandwidth, but it is quite a number of requests. Many sites have more; I have seen as many as 114 on poorly built sites, and I am sure there are worse out there still. But, wait, you ask, why are these requests bad? I am glad you asked!

So, as it would turn out, PHP is not what we call “thread-safe”. This is not necessarily a bad thing, and we won’t discuss the technical reasons, but it basically means that PHP doesn’t run in a multi-threaded model. It serves each request as a simultaneous, concurrent process. On UNIX systems, we accomplish this on Apache with a pre-forking model. That is, when you start Apache, it spawns a certain number of worker processes. These worker processes wait for the main process to receive a request and then forward it to them. If it needs more worker processes, due to there being more visitors, it spawns more worker processes, and so on. This is somewhat simplified, and the concurrency limits and other settings are adjustable.

That sounds pretty smart. Why can this be bad? Well, the think to consider is, that a forked process is a complete copy of the original. As such, the workers are pretty large. On an example installation I have in another window, each Apache worker processes is 33 megabytes. When this process loads mod_php and loads a bunch of PHP files, it will consume some more memory, so let’s make it a flat 50 megabytes for the sake of example. Requests for static resources, like image and CSS files, are going to consume less than 50 and more than 33 megabytes, but not much more. Let’s assume 50 megabytes though, since we are looking for a capacity maximum. As such, if your server has 1024 megabytes of memory, and we estimate that 200 megabytes are being used by the operating system and other processes, that leaves us 824 megabytes for Apache processes. 824 divided by 50 leaves us 16 and change. That means you can only safely process 16 simultaneous requests, otherwise you risk running out of memory, and hitting virtual memory, which means your disks start thrashing, and your request performance will fall through the floor. This will cause user requests to start queueing up, which means your load is going to go up, and everyone has a really bad time. The stock option for the maximum number of simultaneous Apache worker processes in most packages is 150. 150! That will sink the server 9 times over!

Each of our blog users needs 11 requests to satisfy their page view. A per the HTTP protocol, they are limited to 2 simultaneous connections per host, so their requests will only only 2 at a time. However, this means you are limited to 8 simultaneous users at any given exact point in time. That is not a lot, and it’s certainly a lot more than your hardware is capable of.

How do we fix this? Well, for starters, you can move all your static resources (your CSS and Javascript files, your images, etc) to a CDN. This is pretty easy these days. For example, you can upload them to Rackspace’s Cloud Files, which gives you instant CDN abilities for, at the time of this writing, 22 cents a gigabyte, if I remember correctly right now. All you really need to do is edit your theme to point your CSS and Javascript files to a different location.

Some people may not want to get a CDN account for their blog or such, even if it is very affordable. You can also, effectively, create your own one-node CDN by replacing the performance problem: Apache. There are many web servers that work well with PHP; one of them is nginx. Nginx works differently than Apache in that it does not spawn a process for every request. Rather, to explain it in a simplified way, nginx jumps back and forth between connections to serve requests, and only maintains the one process. This is an incredibly fast model, but we noted earlier that PHP needs to have a process per request. So how does that work? Well, there is an interface called FastCGI, which grew largely out of discontent with the CGI model. This interface allows us to create pools of PHP worker processes, much the same way that Apache created pools of Apache worker processes. The key issue here is that PHP fastcgi workers have much lower memory requirements than an Apache worker process. On average, I find them to be 1/10th as large, with some PHP code loaded. The way an nginx + php configuration works is basically this:

- A request comes in to nginx.
- If the request is a simple static file, nginx serves it without further commentary, and does so incredibly quickly and with very little resource usage, especially in comparison to Apache.
- If the request is for a PHP page, nginx connects to a pool of php worker processes, grabs one, and has it process the PHP request.
- nginx takes the output from the PHP worker and serves the request.

Because the PHP workers are much smaller than the Apache workers, you can have the very fast nginx server process all your static materials, and simultaneously increase your capacity to serve PHP requests by as much as 10 times, depending on what your exact memory requirements end up being.

If you don’t want to do that, you can also make sure your static files are being served with a far-future expires header, so that they are retained in the browser cache, and the client doesn’t come back with HTTP HEAD requests on every page load to check that those files are still identical. If you don’t know what I am talking about, google it; it’s a big topic, but it’s worth reading. Once we’ve removed all these extraneous requests, you are down to 1 requests per page view. That, in our model of a page that requires 11 resources, is a 90.990% reduction in capacity use.

So; that’s why your PHP app is slow, and that is my two cents. I’ll address specific questions in comments if they are asked.

As of late, there is a bit of an uproar in the PHP community about sexism in IT, in general. I’m not a woman, not a feminist, so I won’t pretend to understand all the issues, but the primary issue, as I have perceived it, is as such: Information Technology has a significant RTFM (Read The F…ine Manual) mentality. This is detrimental to both IT community growth in general, and especially detrimental to achieving a population distribution more heavily slanted towards women. I can most certainly understand why an RTFM manual mentality would be isolating towards women. At risk of being stoned or other, I will put forth that women probably don’t respond as neutrally to the very conflict-driven, do-or-die responses that are common in IT communities. I will concede that absolutely. Additionally, I will also concede that a brisk attitude towards newcomers in general is probably less adapt at recruitment than a friendly approach. This is clear.

However, both arguments are completely irrelevant because IT needs that attitude in order to survive. IT fields, and in particular software development, are meritocratic, science-oriented disciplines. The software development discipline, in general, has outputs who’s value and quality are exponentially proportional to the value and quality of the inputs. To present an example, let me choose phpbb2.

phpbb2 was/is an immensely popular bulletin board package, written in php, which initially was absolute crap. Yes, it worked nicely, but it was riddled with security holes and performance problems that nobody really noticed until it got popular. At one point, one security vulnerability was so bad that it forced several hosting providers to intervene and come up with ways of blocking it’s exploitation, which range from very smart Apache rulesets, to just plain disabling any host that ran the software. This was a very bad time. Some might argue that this doesn’t matter, since it was “just a bulletin board”, but it matters a lot. A lot of these bulletin boards ran on machines where the phpbb2 installation was used as an entry point to further penetrate the system. Other accounts on those machines had important data, or credit card numbers, or any number of things. So, yes, this was very important.

A simple statement summarizes this problem: If you write open source software, although in many cases you may not be legally responsible for the results, you are at least morally responsible for what happens when people start using this software. This is a simple projection of engineering ethics. The problem with the “everybody can do it” mentality is that, in fact, not everyone can do it. Not everyone can develop quality software. The software may be pretty and popular, but that doesn’t make it good. Your engineering mistakes in building that software can cost others a lot of money and/or time, and if you think that your mistakes will only be limited to individual servers, you are quite wrong. These days, cross-site scripting vulnerabilities and crossdomain problems wreck havoc across the web. I’m not trying to encourage anyone from learning to program. I encourage that. However, I do want to discourage the mentality that a lot of open source projects have these days; that crowd-sourcing is better. It’s often not. You will get a lot more software quality out of 10 experienced engineers than you will 10,000 rank amateurs. I think that defends my point adequately, and I don’t want to tangent too far, because this ties back into the larger topic of sexism.

The way the software development community defends itself against the above mentioned problems of, without meaning offense to anyone, incompetence, is attitude. This industry and the community as well, are very much darwinian. If you aren’t productive, you’re in the way, and you’ll be eliminated. This leads to brisk responses, impatience, and subversion – and it’s 100% fine that way. Any given group, and I am not confining my statement to just women, that finds this to be a problem, really has to just learn to deal with it. I realize that sounds harsh, but the truth is that we as a community need that darwinian mechanism unless we want to turn into an incompetent joke. If you are an amateur, please, please, please, don’t waste the time of much more experienced people by asking questions about things that you can easily look up. It’s a waste of time, and it bleeds productivity. Please, please read a few books and find your legs. Write a few things, make a lot of mistakes. Learn from those mistakes. Once you’ve established the proper vocabulary and base knowledge for you and us to communicate intelligently in this field, you can be certain no one is going to turn you away. We’re just really busy people.

All that being said, I don’t care what sex you are, what color you are, how tall you are, how you smell (since you’re probably not sitting next to me), or what you first language is. I really could not care less. However, things are the way they are for a reason – because this system works.

On Making it Better

November 7, 2009

If you develop software for a living, you’re plagued ceaselessly by thoughts about how to make the software you build better. Sometimes, this results in feature creep. One will often try to stuff more into an application in an effort to make it ‘better’, to make it ‘do more’. After some years of experience, you realize that ‘more’ is not the same thing as ‘better’, and you stop doing that.

Once you move past that, you start trying to make your software ‘better’ in other ways. Sometimes, this means that you start to over-think the problems that your software is trying to solve. By this time you understand that it becomes exponentially more difficult to change your software as it becomes more specialized, so you start to prepare for eventualities. You write code ‘safely’. In popular programming culture, this usually leads to someone coming up with some new paradigm that will, end hunger and cure cancer, such as “object-oriented programming”, or “modular programming”, or lists of acronyms such as RAD, SCRUM, TTD, etc. Mind you, none of these are bad ideas, but in the context of the argument I am making, they are ways of looking at your code to make it ‘better’, by securing it’s malleability in the future.

The problem is, for most of us, figuring out what the right balance of making it “better” and making it work, is something we never really master to perfection. While you are planning and re-factoring code, you aren’t actually getting anything done in the terms of man-hours or deadlines. You’re just making it “better”. This is not a bad thing. That’s called computer science, and thinking is encouraged. However, I think the people who pay us may occasionally have very different ideas, and hence, we need that balance.

I pride myself on trying to stay very pragmatic in my approaching to developing software for clients. I pride myself on the fact that I usually get the balance of “right or done” well adjusted, or at least close to. However, I am horribly, horribly guilty of failing in that balance on this blog. I’ve started at least four separate code bases for my blog – and to what end? Because I wanted to make it better. I’ve decided that that is a complete waste of time, as I should have done in the beginning. So, I am starting over; one more time. Completely fresh, and hosted by someone else so that I can’t be tempted to touch it.

So, fresh start for me. Good. Now, a fresh start for you. What are you working on that should be done by now? Maybe it’s time to be a bit more pragmatic.