Monday, September 10, 2007

Creating search engine friendly URLs in PHP

One of the major reasons for using a server-side language such as PHP is for the ability to generate dynamic content. Often this will lead to single scripts that produce their content based on the input parameters (that is, the variables in the URL).

This article covers various techniques and methods for representing these parameters in the URL in a clean and “friendly” manner, as well as then how to read the parameters.

If you’re not quite sure what I mean, take the following example. This website (phpRiot) stores each of its articles in a database table called articles. Now, we could have built the site so all articles were referenced by their article ID, such as:

  • http://www.phpriot.com/article.php?article_id=1234

However, this isn’t necessarily the best way to do it. Firstly, if people have read a large number of articles on the web site, then their URL history will contain a whole bunch of different IDs, so they won’t be able to directly back to an article without either bookmarking it or visiting the home page and finding the link.

More importantly though, this is wasting valuable data that the search engines can use in indexing your site. phpRiot has been built in such as way that each articles are accessed in a more human-readable format. For example, a previous article on this site is accessed with the following URL:

  • http://www.phpriot.com/d/articles/php/application-design/multi-step-wizards/index.html

This article will go over how read URLs like this and map them back to the data in your database. There are several methods that can be used with PHP, so we will cover each of these and discuss the pros and cons of each.

Additionally, the ideas used in these may cross-over between each method (such as the manual parsing of URLs), but we will also go over this.

Apache’s mod_rewrite

The first method we will look at is the mod_rewrite module that comes with Apache. This module works by matching the requested URL against a set of predefined rules, and then passing in the data to the specified script in the format you determine.

Let’s say that we have a script called news.php in the root directory of the web site (so you could access via http://www.example.com/news.php). This script is responsible for outputting a single news article, as chosen by the news_id parameter passed in the URL.

So if you were trying the access the new article with an ID of 63, you would use http://www.example.com/news.php?news_id=63.

Instead though, we want to make this a bit fancier, so rather than passing an in the URL, we want to access articles using http://www.example.com/news/63.html. There’s no particular reason for having it like this – it’s just for our example.

Anyway, we can make this happen with mod_rewrite with a very simple rule, either in the web server config (httpd.conf), or in a .htaccess file in the web site directory.

The contents would look like this:

Highlight: Plain

RewriteEngine on
RewriteRule ^/news/([0-9]+)\.html /news.php?news_id=$1

Using the above regular expression, we match all requests to the web site that start with news, then have a number followed by .html. Items stored in brackets are stored in variables, such as $1 or $2 (we only have one set of brackets so only $1 is set here).

We then use the $1 parameter in the destination URL. Now inside the news.php script, we just access the news_id parameter as we would have if we called the script in the original way. That is:

Highlight: PHP


$news_id = $_GET['news_id'];
?>

Extra URL parameters

Sometimes you have a situation where you want to pass extra URL parameters to the script. So going back to our example above, perhaps you can access the news.php script with an extra parameter called ‘print’, which displayed a printer friendly version of the article (technically you should be using CSS stylesheets for this, but that doesn’t matter for this example).

So you would normally access the printer friendly version of article 63 using http://www.example.com/news.php?news_id=63&print=1.

Using our rewrite version, we want to access the article using http://www.example.com/news/63.html?print=1, however, the rule we created above will simply discard the print parameter in the URL. To pass this to the news script, we need to use the internal Apache variable %{QUERY_STRING} in our mod_rewrite pattern. We just append this to the news_id with an ampersand.

Highlight: Plain

RewriteEngine on
RewriteRule ^/news/([0-9]+)\.html /news.php?news_id=$1&%{QUERY_STRING}

So now, we can access both parameters through $_GET.

Highlight: PHP


$news_id = $_GET['news_id'];
$printVersion = isset($_GET['print']);
?>

So that’s all there is to using mod_rewrite. This is a very powerful and complex module, and it is easy to get into trouble using it. Sometimes it can be hard to get your patterns to match correctly, or you create recursive rules, etc. There are several debug settings you can use to try and resolve any problems you might have.


Using the Apache ForceType directive

An alternative to using mod_rewrite is to instead use the ForceType directive. What this does, is allow PHP scripts without a .php extension to be executed as PHP scripts. Normally web servers are configured so PHP scripts must finish with .php, so other non-PHP scripts (such as .html files) don’t have to be processed by PHP.

Going back to our example in the mod_rewrite section, instead of having a script called news.php in the root directory, our script would just be called ‘news’. So it would be accessed using http://www.example.com/news.

Using the following in our httpd.conf or a .htaccess, this ‘news’ file will processed on the server as a PHP file.

Highlight: Plain


ForceType application/x-httpd-php

Now, when we access our article using http://www.example.com/news/63.html, our news script is accessed directly, and we must parse out the ”/63.html” part. This is stored in the server variable PATH_INFO.

Highlight: PHP


echo $_SERVER['PATH_INFO'];
// outputs '/63.html'
?>

So now we can use regular expressions to extract the number 63 from this string. There are other techniques you will find useful for extracting data also, such as using PHP’s explode() function. For example, if you explode this string on ’/’, then all parts of the path will be stored in an array (there’s only 1 part in this example though, so it’s not worth doing). Anyway, back to regular expressions.

Here is a regular expression (compatible with preg_match()), that looks for a string that has precisely a slash at the start, then a number, followed by .html. It then stores the matches to an array, from which we extract the article Id.

Highlight: PHP


$path = $_SERVER['PATH_INFO'];
preg_match('!^/(\d+)\.html$!', $path, $matches);

// $matches[0] will store the entire matched string, while $matches[1]
// stores the string matched in the first set of brackets. We want it
// to be an int, so we simply cast it.
$news_id = (int) $matches[1];
?>

Normally we’d use / as the regex delimeter, but since we’re matching a slash, it’s tidier to use something different (in this case ’!’). Additionally, we’re matching 1 (+) or more digits (\d), and we must escape the . since we’re matching a literal period (. normally means “any character”).

So that’s all there is to it. Now you can use the $news_id accordingly in that script. Of course, if the path was in an incorrect format, the matched $news_id would come out as 0 after we casted it as an int, so in other words, it’ll still be safe to plugin to your database, even if the article doesn’t exist.

Using a custom 404 handler

This is probably the most complicated method of achieving this result, however, it is also the simplest to expand upon and is more powerful.

By taking advantage of Apache’s custom 404 handler, you can have a single controlling script that decides how all requests are handled. Of course, this is only requests that do not match an existing file. For example, if you have images on your web site, you can still access them in an identical fashion—the image file will exist, hence the 404 handler will not be used.

Additionally, by taking advantage of PHP’s header() function, you can output, say, a ‘200 OK’ header rather than a ‘404 File Not Found’ header, so from the end user’s point of view, they have no idea the page wasn’t really found.

An example of where this would be used

Taking this idea further, you probably wouldn’t bother implementing a system like this for just a news handling engine as in our examples, but rather, on a larger site that has a lot more content.

For example, if you look at the following URL: http://www.phpriot.com/d/articles/php/index.html. PhpRiot is actually using the ‘ForceType’ method with a PHP script called ‘d’ that handles requests, but we could have implemented it using this method.

Suppose we wanted our URL to look like this: http://www.phpriot.com/articles/php/index.html. Instead of creating this path on our web server for each and every article, we would use the 404 handler to parse out the article path like we currently do with our ‘d’ file.

Implementing the 404 handler

We’re not going to implement the example listed above as it involves other complexities not relevant to this article, so instead, we’ll implement our news article example. We’re also going to add in scope to handle other requests (other than news) and also for outputting error pages.

The first thing to do would be to setup the 404 handler. This can be done either in a .htaccess or in the httpd.conf.

Highlight: Plain

ErrorDocument 404 /handler.php

This means that all requests that don’t match an existing file, are passed to the handler.php script in our web root.

So in this script, we need to parse out the request. You can find the original request in the server REDIRECT_URL variable.

Highlight: PHP


$request = $_SERVER['REDIRECT_URL'];

// explode on / to find all the different request parts
$parts = explode('/', $request);

// flag to determine whether or not we've found content
$found = false;

// the first element will be empty to we get rid of it
array_shift($parts);

// now we determine the type of content
switch ($parts[0]) {
case 'news':
// use a very similar regex to our previous example
preg_match('!^(\d+)\.html$!', $parts[1], $matches);
$news_id = (int) $matches[1];

$output = getNewsArticle($news_id);
// this function doesn't really exist, but if it
// did it would return the news content if article
// found, or return null if not

if ($output !== null)
$found = true;

break;

case 'articles':
// here we would implement a handler to display a document,
// say if they accessed http://www.example.com/documents/1234.html

break;

default:

}

if ($found) {
// output a header to say the content exists, other a 404 will be sent
header('HTTP/1.1: 200 OK');
echo $output;
}
else {
// no content was found. this should be automatically sent by the
// server anyway, but we'll specify anyway just in case
header('HTTP/1.0 404 Not Found');
echo 'File not found';
}
?>

Obviously this script is slightly crude, but hopefully in its simplicity you can see how powerful this method can be and what possibilities it can open.


No comments:

About Me

Ordinary People that spend much time in the box
Powered By Blogger