HtmlSource - a new DBO driver for CakePHP
Ok, ok - I’ve been slacking on this blog again, but I will keep that for another post where I will announce some major changes I have been thinking of lately. For today, I’d like to introduce the new DBO Source Driver: HtmlSource - which is completely functional but still lacking some of the features I have planned for it.
So what’s an HTML DBO driver you ask?
Simply put, it’s a way to treat any HTML page like a database and be able to retrieve (scrape) certain parts using an SQL-like command:
SELECT href, title FROM a WHERE class="submit"
I first stumbled on the idea when I discovered the HtmlSql class by Jonas John but apart from it being a third party class, I disliked a couple other things like having to use ‘$class’ instead of ‘class’ in the WHERE clause, only using eval() and regular expressions, etc.
But why a DBO driver?
My main goal was to be able to spend the time after it is released on adding more services instead of upgrading the already available ones. With that in mind, I identified here below some of the reasons, feel free to (dis)agree in the comments:
Unlike someone would believe in 2007, 90% of the sites/services scraped do not offer any webservices to facilitate the task.
Scraping 100+ different sites/pages quickly becomes a hassle to maintain; mainly because all of those services will not tell you when the HTML code has changed for you to update your regular expressions accordingly.
PHP’s native DOM objects don’t cut it. At least not for mass scraping. Maybe not as bad as custom regular expressions but still, a daunting maintenance.
The code
So without further ado and sorry for the lack of complete instructions yet, but here is the code I wanted to share with you today and hear suggestions/criticism if any.
In your /app/models/datasources/dbo directory, create the dbo_html.php file and in the /app/ directory, the curl.php one.
Example
Not to leave you completely unaware of how to use it, here is a tiny example that should help you figuring it all out - and maybe help with it?
Create a new model HtmlModel - this acts here as a dynamic model:
class HtmlModel extends AppModel
{
var $useDbConfig = 'html';
/**
* Not used for the moment, but usually, the returned HTML model should
* always have an 'html' element/table
*
* @var string
* @access public
*/
var $useTable = 'html';
/**
* Overwrites the Model::setSource method that breaks with a MissingTable error when
* the fetched url doesn't return any HTML
*
* @param string
* @todo add a filter to send more useful errors when no default 'html'
*/
function setSource($tableName){}
/**
* Sets a different datasource (table/element)
*
* @param string
* @return void
* @access public
*/
function switchTable($table)
{
parent::__construct(false, $table);
}
}
In your controller you can now easily query the HTML page like any other Model by doing the following:
class ExampleController extends AppController
{
var $uses = array('AnyOtherModel'); //can also be null or not defined
function getLinks($url)
{
$this->setDataSource($url);
loadModel('HtmlModel');
//use it like this:
$this->HtmlModel->setSource('a'); //to assign a virtual table to read from
$this->HtmlModel->findAll(array('class' => 'submit'), 'href');
//or
$this->HtmlModel->query('SELECT href FROM a WHERE class="test"'));
//continue by defining your view or more processing here.....
}
function setDataSource($url)
{
$config = array(
‘driver’ => ‘html’,
‘host’ => $url,
‘username’ => null,
‘password’ => null,
‘database’ => ”,
‘dom’ => true,
);
$this->ds[’html’] = ConnectionManager::create(’html’, $config);
}
}
December 8th, 2007 at 8:17 am
[…] Loud Baking » Blog Archive » HtmlSource - a new DBO driver for CakePHP Looks like a very interesting concept. I’m looking forward to seeing it developed further. (tags: cakephp html sql) Posted by Richard@Home Filed in 15 […]
December 9th, 2007 at 1:57 pm
Wouldn’t it make sense if you would define a default table name like “html”? So you could avoid that someone has to call setSource() before he can use findAll() or query().
December 10th, 2007 at 3:15 am
@Daniel: Yes, you’re right but since I haven’t implemented any way of detecting and returning an error when it’s a 404 page or stuff like that (which most of the times do not contain any HTML tag), I opted to keep it that way for now but already included the default $useTable attribute to pave the way for future upgrades.
May 26th, 2008 at 7:18 pm
I like it! now the screen scraping will be much easier
thx, regards daniel
September 28th, 2008 at 2:35 am
Would it be possible to upload the files again? I am getting a 404. Thanks
October 21st, 2008 at 12:53 am
I second Andruu’s comment… links are dead. Can you reupload the files?
October 21st, 2008 at 1:17 am
Love it!!! Can you re-post the files??
October 21st, 2008 at 5:32 am
Hey guys,
Thanks for dropping by. I don’t have time right now to look at the files on my old PC HD nor to check what broke the links but I am planning on posting an update to this post this week or, at least, if I don’t get time for that, get those links fixed for sure.
Stay tuned and thanks again for letting me know!