HtmlSource - a new DBO driver for CakePHP

Ok, ok - I’ve been slacking on this blog again, but I will keep that for another post where I will announce some major changes I have been thinking of lately. For today, I’d like to introduce the new DBO Source Driver: HtmlSource - which is completely functional but still lacking some of the features I have planned for it.

So what’s an HTML DBO driver you ask?

Simply put, it’s a way to treat any HTML page like a database and be able to retrieve (scrape) certain parts using an SQL-like command:

SELECT href, title FROM a WHERE class="submit"

I first stumbled on the idea when I discovered the HtmlSql class by Jonas John but apart from it being a third party class, I disliked a couple other things like having to use ‘$class’ instead of ‘class’ in the WHERE clause, only using eval() and regular expressions, etc.

But why a DBO driver?

My main goal was to be able to spend the time after it is released on adding more services instead of upgrading the already available ones. With that in mind, I identified here below some of the reasons, feel free to (dis)agree in the comments:

  • Unlike someone would believe in 2007, 90% of the sites/services scraped do not offer any webservices to facilitate the task.

  • Scraping 100+ different sites/pages quickly becomes a hassle to maintain; mainly because all of those services will not tell you when the HTML code has changed for you to update your regular expressions accordingly.

  • PHP’s native DOM objects don’t cut it. At least not for mass scraping. Maybe not as bad as custom regular expressions but still, a daunting maintenance.

The code

So without further ado and sorry for the lack of complete instructions yet, but here is the code I wanted to share with you today and hear suggestions/criticism if any.

In your /app/models/datasources/dbo directory, create the dbo_html.php file and in the /app/ directory, the curl.php one.

Example

Not to leave you completely unaware of how to use it, here is a tiny example that should help you figuring it all out - and maybe help with it?

Create a new model HtmlModel - this acts here as a dynamic model:

class HtmlModel extends AppModel
{
   var $useDbConfig = 'html';
   /**
    * Not used for the moment, but usually, the returned HTML model should
    * always have an 'html' element/table
    *
    * @var string
    * @access public
    */
   var $useTable = 'html';
   /**
    * Overwrites the Model::setSource method that breaks with a MissingTable error when
    * the fetched url doesn't return any HTML
    *
    * @param string
    * @todo add a filter to send more useful errors when no default 'html'
    */
   function setSource($tableName){}
   /**
    * Sets a different datasource (table/element)
    *
    * @param string
    * @return void
    * @access public
    */
   function switchTable($table)
   {
      parent::__construct(false, $table);
   }
}
In your controller you can now easily query the HTML page like any other Model by doing the following:
class ExampleController extends AppController
{
   var $uses = array('AnyOtherModel'); //can also be null or not defined
   function getLinks($url)
   {
      $this->setDataSource($url);
      loadModel('HtmlModel');
      //use it like this:
      $this->HtmlModel->setSource('a'); //to assign a virtual table to read from
      $this->HtmlModel->findAll(array('class' => 'submit'), 'href');
      //or
      $this->HtmlModel->query('SELECT href FROM a WHERE class="test"'));

  //continue by defining your view or more processing here.....

} function setDataSource($url) { $config = array( ‘driver’ => ‘html’, ‘host’ => $url, ‘username’ => null, ‘password’ => null, ‘database’ => ”, ‘dom’ => true, ); $this->ds[’html’] = ConnectionManager::create(’html’, $config); } }

8 Responses to “HtmlSource - a new DBO driver for CakePHP”

  1. links for 2007-12-08 « Richard@Home Says:

    […] Loud Baking » Blog Archive » HtmlSource - a new DBO driver for CakePHP Looks like a very interesting concept. I’m looking forward to seeing it developed further. (tags: cakephp html sql) Posted by Richard@Home Filed in 15 […]

  2. Daniel Hofstetter Says:

    Wouldn’t it make sense if you would define a default table name like “html”? So you could avoid that someone has to call setSource() before he can use findAll() or query().

  3. Jad Says:

    @Daniel: Yes, you’re right but since I haven’t implemented any way of detecting and returning an error when it’s a 404 page or stuff like that (which most of the times do not contain any HTML tag), I opted to keep it that way for now but already included the default $useTable attribute to pave the way for future upgrades.

  4. daniel Says:

    I like it! now the screen scraping will be much easier :)

    thx, regards daniel

  5. Andruu Says:

    Would it be possible to upload the files again? I am getting a 404. Thanks

  6. Chris Says:

    I second Andruu’s comment… links are dead. Can you reupload the files?

  7. Josh Says:

    Love it!!! Can you re-post the files??

  8. Jad Says:

    Hey guys,

    Thanks for dropping by. I don’t have time right now to look at the files on my old PC HD nor to check what broke the links but I am planning on posting an update to this post this week or, at least, if I don’t get time for that, get those links fixed for sure.

    Stay tuned and thanks again for letting me know!

Leave a Reply