Wednesday, November 21, 2007

Caching in PHP using the filesystem, APC and Memcached

By Evert on 2006-11-01 02:14:18

Caching is very important and really pays off in big internet applications. When you cache the data you're fetching from the database, in a lot of cases the load on your servers can be reduced enormously.


One way of caching, is simply storing the results of your database queries in files.. Opening a file and unserializing is often a lot faster than doing an expensive SELECT query with multiple joins.


Here's a simple file-based caching engine.


<?php



// Our class

class  FileCache  {



// This is the function you store information with

function  store ( $key , $data , $ttl ) {



// Opening the file

$h  =  fopen ( $this -> getFileName ( $key ), 'w' );

     if (! $h ) throw new  Exception ( 'Could not write to cache' );

// Serializing along with the TTL

$data  =  serialize (array( time ()+ $ttl , $data ));

     if ( fwrite ( $h , $data )=== false ) {

       throw new  Exception ( 'Could not write to cache' );

     }

fclose ( $h );



   }



// General function to find the filename for a certain key

private function  getFileName ( $key ) {



       return  '/tmp/s_cache'  .  md5 ( $key );



   }



// The function to fetch data returns false on failure

function  fetch ( $key ) {



$filename  =  $this -> getFileName ( $key );

       if (! file_exists ( $filename ) || ! is_readable ( $filename )) return  false ;



$data  =  file_get_contents ( $filename );



$data  = @ unserialize ( $data );

       if (! $data ) {



// Unlinking the file when unserializing failed

unlink ( $filename );

          return  false ;



       }



// checking if the data was expired

if ( time () >  $data [ 0 ]) {



// Unlinking

unlink ( $filename );

          return  false ;



       }

       return  $data [ 1 ];

     }



}



?>

Key strategies


All the data is identified by a key. Your keys have to be unique system wide; it is therefore a good idea to namespace your keys. My personal preference is to name the key by the class thats storing the data, combined with for example an id.


example


Your user-management class is called My_Auth, and all users are identified by an id. A sample key for cached user-data would then be " My_Auth:users:1234 ". '1234' is here the user id.


Some reasoning behind this code


I chose 4096 bytes per chunk, because this is often the default inode size in linux and this or a multiple of this is generally the fastest. Much later I found out file_get_contents is actually faster.


Lots of caching engines based on files actually don't specify the TTL (the time it takes before the cache expires) at the time of storing data in the cache, but while fetching it from the cache. This has one big advantage; you can check if a file is valid before actually opening the file, using the last modified time ( filemtime() ).


The reason I did not go with this approach is because most non-file based cache systems do specify the TTL on storing the data, and as you will see later in the article we want to keep things compatible. Another advantage of storing the TTL in the data, is that we can create a cleanup script later that will delete expired cache files.


Usage of this class


The number one place in web applications where caching is a good idea is on database queries. MySQL and others usually have a built-in cache, but it is far from optimal, mainly because they have no awareness of the logic of you application (and they shouldn't have), and the cache is usually flushed whenever there's an update on a table. Here is a sample function that fetches user data and caches the result for 10 minutes.


<?php



// constructing our cache engine

$cache  = new  FileCache ();



  function  getUsers () {



     global  $cache ;



// A somewhat unique key

$key  =  'getUsers:selectAll' ;



// check if the data is not in the cache already

if (! $data  =  $cache -> fetch ( $key )) {

// there was no cache version, we are fetching fresh data



        // assuming there is a database connection

$result  =  mysql_query ( "SELECT * FROM users" );

$data  = array();



// fetching all the data and putting it in an array

while( $row  =  mysql_fetch_assoc ( $result )) {  $data [] =  $row ; }



// Storing the data in the cache for 10 minutes

$cache -> store ( $key , $data , 600 );

     }

     return  $data ;

}



$users  =  getUsers ();



?>

The reason i picked the mysql_ set of functions here, is because most of the readers will probably know these.. Personally I prefer PDO or another abstraction library. This example assumes there's a database connection, a users table and other issues.


Problems with the library


The first problem is simple, the library will only work on linux, because it uses the /tmp folder. Luckily we can use the php.ini setting 'session.save_path'.


<?php



private function  getFileName ( $key ) {



       return  ini_get ( 'session.save_path' ) .  '/s_cache'  .  md5 ( $key );



   }



?>

The next problem is a little bit more complex. In the case where one of our cache files is being read, and in the same time being written by another process, you can get really unusual results. Caching bugs can be hard to find because they only occur in really specific circumstances, therefore you might never really see this issue happening yourself, somewhere out there your user will.


PHP can lock files with flock() . Flock operates on an open file handle (opened by fopen) and either locks a file for reading (shared lock, everybody can read the file) or writing (exclusive lock, everybody waits till the writing is done and the lock is released). Because file_get_contents is the most efficient, and we can only use flock on filehandles, we'll use a combination of both.


The updated store and fetch methods will look like this


<?php

// This is the function you store information with

function  store ( $key , $data , $ttl ) {



// Opening the file in read/write mode

$h  =  fopen ( $this -> getFileName ( $key ), 'a+' );

     if (! $h ) throw new  Exception ( 'Could not write to cache' );



flock ( $h , LOCK_EX );  // exclusive lock, will get released when the file is closed



fseek ( $h , 0 );  // go to the beginning of the file



     // truncate the file

ftruncate ( $h , 0 );



// Serializing along with the TTL

$data  =  serialize (array( time ()+ $ttl , $data ));

     if ( fwrite ( $h , $data )=== false ) {

       throw new  Exception ( 'Could not write to cache' );

     }

fclose ( $h );



   }



   function  fetch ( $key ) {



$filename  =  $this -> getFileName ( $key );

       if (! file_exists ( $filename )) return  false ;

$h  =  fopen ( $filename , 'r' );



       if (! $h ) return  false ;



// Getting a shared lock 

flock ( $h , LOCK_SH );



$data  =  file_get_contents ( $filename );

fclose ( $h );



$data  = @ unserialize ( $data );

       if (! $data ) {



// If unserializing somehow didn't work out, we'll delete the file

unlink ( $filename );

          return  false ;



       }



       if ( time () >  $data [ 0 ]) {



// Unlinking when the file was expired

unlink ( $filename );

          return  false ;



       }

       return  $data [ 1 ];

    }



?>

Well that actually wasn't too hard.. Only 3 new lines.. The next issue we're facing is updates of data. When somebody updates, say, a page in the cms; they usually expect the respecting page to update instantly.. In those cases you can update the data using store(), but in some cases it is simply more convenient to flush the cache.. So we need a delete method.


<?php



function  delete (  $key  ) {



$filename  =  $this -> getFileName ( $key );

         if ( file_exists ( $filename )) {

             return  unlink ( $filename );

         } else {

             return  false ;

         }



     }



?>

Abstracting the code


This cache class is pretty straight-forward. The only methods in there are delete, store and fetch.. We can easily abstract that into the following base class. I'm also giving it a proper prefix (I tend to prefix everything with Sabre, name yours whatever you want..). A good reason to prefix all your classes, is that they will never collide with other classnames if you need to include other code. The PEAR project made a stupid mistake by naming one of their classes 'Date', by doing this and refusing to change this they actually prevented an internal PHP-date class to be named Date.


<?php



abstract class  Sabre_Cache_Abstract  {



         abstract function  fetch ( $key );

         abstract function  store ( $key , $data , $ttl );

         abstract function  delete ( $key );



     }



?>

The resulting FileCache (which I'l rename to Filesystem) is:


<?php



class  Sabre_Cache_Filesystem  extends  Sabre_Cache_Abstract  {



// This is the function you store information with

function  store ( $key , $data , $ttl ) {



// Opening the file in read/write mode

$h  =  fopen ( $this -> getFileName ( $key ), 'a+' );

     if (! $h ) throw new  Exception ( 'Could not write to cache' );



flock ( $h , LOCK_EX );  // exclusive lock, will get released when the file is closed



fseek ( $h , 0 );  // go to the start of the file



     // truncate the file

ftruncate ( $h , 0 );



// Serializing along with the TTL

$data  =  serialize (array( time ()+ $ttl , $data ));

     if ( fwrite ( $h , $data )=== false ) {

       throw new  Exception ( 'Could not write to cache' );

     }

fclose ( $h );



   }



// The function to fetch data returns false on failure

function  fetch ( $key ) {



$filename  =  $this -> getFileName ( $key );

       if (! file_exists ( $filename )) return  false ;

$h  =  fopen ( $filename , 'r' );



       if (! $h ) return  false ;



// Getting a shared lock 

flock ( $h , LOCK_SH );



$data  =  file_get_contents ( $filename );

fclose ( $h );



$data  = @ unserialize ( $data );

       if (! $data ) {



// If unserializing somehow didn't work out, we'll delete the file

unlink ( $filename );

          return  false ;



       }



       if ( time () >  $data [ 0 ]) {



// Unlinking when the file was expired

unlink ( $filename );

          return  false ;



       }

       return  $data [ 1 ];

    }



    function  delete (  $key  ) {



$filename  =  $this -> getFileName ( $key );

       if ( file_exists ( $filename )) {

           return  unlink ( $filename );

       } else {

           return  false ;

       }



    }



   private function  getFileName ( $key ) {



       return  ini_get ( 'session.save_path' ) .  '/s_cache'  .  md5 ( $key );



   }



}



?>

There you go, a complete, proper OOP, file-based caching class... I hope I explained things well.


Memory based caching through APC


If files aren't fast enough for you, and you have enough memory to spare.. Memory-based caching might be the solution. Obviously, storing and retrieving stuff from memory is a lot faster. The APC extension not only does opcode cache (speeds up your php scripts by caching the parsed php script), but it also provides a simple mechanism to store data in shared memory.


Using shared memory in APC is extremely simple, I'm not even going to explain it, the code should tell enough.


<?php



class  Sabre_Cache_APC  extends  Sabre_Cache_Abstract  {



         function  fetch ( $key ) {

             return  apc_fetch ( $key );

         }



         function  store ( $key , $data , $ttl ) {



             return  apc_store ( $key , $data , $ttl );



         }



         function  delete ( $key ) {



             return  apc_delete ( $key );



         }



     }



?>

My personal problem with APC that it tends to break my code.. So if you want to use it.. give it a testrun.. I have to admit that I haven't checked it anymore since they fixed 'my' bug . . This bug is now fixed, APC is amazing for single-server applications and for the really often used data.


Memcached


Problems start when you are dealing with more than one webserver. Since there is no shared cache between the servers situations can occur where data is updated on one server and it takes a while before the other server is up to date.. It can be really useful to have a really high TTL on your data and simply replace or delete the cache whenever there is an actual update. When you are dealing with multiple webservers this scheme is simply not possible with the previous caching methods.


Introducing memcached . Memcached is a cache server originally developed by the LiveJournal people and now being used by sites like Digg , Facebook , Slashdot and Wikipedia .


How it works



  • Memcached consists of a server and a client part.. The server is a standalone program that runs on your servers and the client is in this case a PHP extension.

  • If you have 3 webservers which all run Memcached, all webservers connect to all 3 memcached servers. The 3 memcache servers are all in the same 'pool'.

  • The cache servers all only contain part of the cache. Meaning, the cache is not replicated between the memcached servers.

  • To find the server where the cache is stored (or should be stored) a so-called hashing algorithm is used. This way the 'right' server is always picked.

  • Every memcached server has a memory limit. It will never consume more memory than the limit. If the limit is exceeded, older cache is automatically thrown out (if the TTL is exceed or not).

  • This means it cannot be used as a place to simply store data.. The database does that part. Don't confuse the purpose of the two!

  • Memcached runs the fastest (like many other applications) on a Linux 2.6 kernel.

  • By default, memcached is completely open.. Be sure to have a firewall in place to lock out outside ip's, because this can be a huge security risk.


Installing


When you are on debian/ubuntu, installing is easy:


apt-get install memcached

You are stuck with a version though.. Debian tends to be slow in updates. Other distributions might also have a pre-build package for you. In any other case you might need to download Memcached from the site and compile it with the usual:


./configure

make

make install

There's probably a README in the package with better instructions.


After installation, you need the Pecl extension . All you need to do for that (usually) is..


pecl install Memcache

You also need the zlib development library. For debian, you can get this by entering:


apt-get install zlib1g-dev

However, 99% of the times automatic pecl installation fails for me. Here's the alternative installation instructions.


pecl download Memcache

tar xfvz Memcache-2.1.0.tgz #version might be changed

cd Memcache-2.1.0

phpize

./configure

make

make install

Don't forget to enable the extension in php.ini by adding the line extension=memcache.so and restarting the webserver.


The good stuff


After the Memcached server is installed, running and you have PHP running with the Memcache extension, you're off.. Here's the Memcached class.


<?php



class  Sabre_Cache_MemCache  extends  Sabre_Cache_Abstract  {



// Memcache object

public  $connection ;



         function  __construct () {



$this -> connection  = new  MemCache ;



         }



         function  store ( $key ,  $data ,  $ttl ) {



             return  $this -> connection -> set ( $key , $data , 0 , $ttl );



         }



         function  fetch ( $key ) {



             return  $this -> connection -> get ( $key );



         }



         function  delete ( $key ) {



             return  $this -> connection -> delete ( $key );



         }



         function  addServer ( $host , $port  =  11211 ,  $weight  =  10 ) {



$this -> connection -> addServer ( $host , $port , true , $weight );



         }



     }



?>

Now, the only thing you have to do in order to use this class, is add servers. Add servers consistently! Meaning that every server should add the exact same memcache servers so the keys will distributed in the same way from every webserver.


If a server has double the memory available for memcached, you can double the weight. The chance that data will be stored on that specific server will also be doubled.


Example


<?php



     $cache  = new  Sabre_Cache_MemCache ();

$cache -> addServer ( 'www1' );

$cache -> addServer ( 'www2' , 11211 , 20 );  // this server has double the memory, and gets double the weight

$cache -> addServer ( 'www3' , 11211 );



// Store some data in the cache for 10 minutes

$cache -> store ( 'my_key' , 'foobar' , 600 );



// Get it out of the cache again

echo( $cache -> fetch ( 'my_key' ));



?>

Some final tips



  • Be sure to check out the docs for Memcache and APC to and try to determine whats right for you.

  • Caching can help everywhere SQL queries are done.. You'd be surprised how big the difference can be in terms of speed..

  • In some cases you might want the cross-server abilities of memcached, but you don't want to use up your memory or have your items automatically get flushed out.. Wikipedia came across this problem and traded in fast memory caching for virtually infinite size file-based caching by creating a memcached-compatible engine, called Tugela Cache , so you can still use the Pecl Memcache client with this, so it should be pretty easy. I don't have experience with this or know how stable it is.

  • If you have different requirements for different parts of your cache, you can always consider using the different types alongside.

No comments:

About Me

Ordinary People that spend much time in the box
Powered By Blogger