Saturday, January 08, 2011

PHP Parsing, Searching an RSS Feed and a Cron Job

Kayaked for over an hour. Going out at 9:30 am would've been impossible in the summer. These days, it's so chilly out over the water that the sun is a non-issue. I went all the way to the shoal marker at the entrance where I had the summer encounter with the shark.

My custom search tool has two major parts: a Javascript client component, and a PHP server component. This post describes the PHP server code. This is built on the CodeIgniter framework, so it has three parts: a Model, a View, and a Controller.

The Controller's job is to receive the request from the Browser, then call the Model and View as needed. The code for the controller is listed below. It has a constructor function Search(), which just loads some CodeIgniter helper libraries, then loads in our Model, called rssfeed_model.

The second function in the controller, index() handles the GET request from the Browser. It first calls the model to load in the RSS feed (described more below), then it reads in some URL parameters, q, title, and callback. The q parameter contains the search string, the title holds the title we want to display for the search, and callback tells us whether or not this is a JSONP GET request.

Once the parameters are read, the search is performed by calling the model's search() method. The returned results are then sorted so that the ones with the most relevance will appear first in the displayed results. Finally, we call the view to actually send the results. We call either the "search_view" or the "search_json_view", depending upon whether or not the "callback" parameter was set to anything in the URL.

class Search extends Controller {
  function Search() {
  function index() {
    $q = $this->input->get("q");
    $this->data['q'] = $q;    
    $this->data['title'] = $this->input->get("title");
    $callback = $this->input->get("callback");
    $this->data['callback'] = $callback;
    // Perform the search
    $this->data['results'] = $this->rssfeed_model->search($q,!empty($debug));
    if (empty($callback)) {   
    } else {

Next we look at the rssfeed_model. It has two methods: loadfeed() and search(). Loadfeed() is pretty simple, it just calls PHP's simplexml_load_file to load in the RSS feed data. RSS is an XML-based format so this results in loading the entire Blog content into a data structure. The search() function searches through that data structure looking for matches with the search phrase.

The search function uses XML Xpath searches to parse things out of the RSS XML. One slightly tricky part of this is that each call to PHP's xpath() search function must be preceded by a call to registerXPathNamespace(). This is because the RSS feed specifies a namespace, and Xpath won't parse it unless you specify that namespace. The search uses regular expressions to match search terms against the blog content. This was discussed in a previous post.

class rssfeed_model extends Model {
  private $xml;
  function rssfeed_model() {
  function loadfeed($url='') {
    $this->xml = simplexml_load_file($url);
  function search($q,$debug=false) {
    if (empty($q)) { 
      return array(); 
    $q = trim(strtolower($q));
    if ((($fc=="'") || ($fc=='"')) && (($lc=="'") || ($lc=='"'))) {
      $qarray = array(substr($q,1,-1));
    } else {
      $qarray = explode(" ",$q);
    $r = array();
    $this->xml->registerXPathNamespace('atm', '');
    $posts = $this->xml->xpath('//atm:entry');
    foreach ($posts as $k=>$post) {
      $count = 0;
      foreach ($qarray as $q) {
        $termcount = 0;
        if (empty($q)) continue; 
        $content = strip_html_tags(strtolower($post->content));
        $title = strip_tags(strtolower($post->title));
        $termcount += 1000*preg_match_all('/([^a-z]|^)'.$q.'([^a-z]|$|(s[^a-z]))/',$title,$matches);
        $termcount += preg_match_all('/([^a-z]|^)'.$q.'([^a-z]|$|(s[^a-z]))/',$content,$matches);
        $post->registerXPathNamespace('atm', '');
        $categories = $post->xpath('atm:category');
        foreach ($categories as $category) {
          $tag = strtolower($category['term']);
          $termcount += 1000*preg_match_all('/([^a-z]|^)'.$q.'([^a-z]|$|(s[^a-z]))/',$tag,$matches);
        $post->registerXPathNamespace('atm', '');
        $links = $post->xpath('atm:link[@rel="alternate"]');
        foreach ($links as $link) {
          $href = $link['href'];
        $count += $termcount;
        if ($debug && ($termcount>0)) {
          echo "Term: $q scored $termcount for post: '".(string)$post->title."'<br>\n";
      if ($count > 0) {
        $r[] = array("title"=>(string)$post->title,"link"=>$href,"count"=>$count);
    return $r; 

Now we turn to the View. The search_json_view builds a JSON structured string, placing it all in the $r variable, and echo'ing it out to the Browser at the end.
$r = $callback.'(';
  $r .= '{ "q": "'.str_replace('"',"'",$q).'",';
  $r .= '"title":"'.$title.'",';
  $r .= '"result": [';
  foreach ($results as $result) {
    $r .= '{ "count":"'.$result['count'].'", "title":"'.
          str_replace('"','',$result['title']).'", "link":"'.$result['link'].'" },'."\n";
  $r .= ']}';
$r .= ')';
echo $r;


One thing you might have noticed is that the search currently loads in the RSS data from a local file on the server. I didn't want every single search to re-download the whole RSS feed, so I just store the feed in that local file. But that raises the question of how to keep the results up to date. I do that with a cron job that runs every hour on the server. Here's how I set that up.

First, I created a file containing a command that downloads the RSS feed and puts it in a file. I called the file searchCron:

File searchCron:
curl > blogsearch/cssbakery.rss
This uses the popular "curl" command to retrieve the URL.

Next, I need to schedule this command to run every hour. So I create another file, which I called mycron, to tell the cron scheduler how to run it:
File mycron:
    20 * * * * searchCron

The first five fields specify when to run the command. In my case I have it run at 20 minutes after the hour, every hour, every day, every month, every year. If I wanted to just run it once a day, say at 10:30am every morning, I'd change this file as follows:

30 10 * * * searchCron
Now to actually schedule the job, I run the crontab command on the server:
crontab mycron
And that should get the cron job scheduled... However, I did run into a problem. For some reason my cron job wasn't working. If you run the command "crontab -l" it will list all your cron jobs. It showed that mine was scheduled just fine, but still the RSS feed file wasn't being updated at all!

My next step was to look in the system log:
tail /var/www/syslog
In there I saw this entry:
Jan  8 17:20:01 /USR/SBIN/CRON[12525]: (web) CMD (searchCron^M)
The thing that got my attention was the "^M" at the end of the command. As it turned out I had edited the "mycron" file on Windows and uploaded it. That left the Windows carriage return-line feed sequence in the file, and that extra character - from the Linux point of view - was causing the failure. So, I re-created the file on Linux (using vi) and that solved the problem. The cron job now runs every hour.


Bill said...

dude, I've been trying out your examples. You are damn good! Getting other people's examples to work is really a pain in the ass.

Post a Comment