CakePHP URL validation bug fix and enhancement
Update: 25/10/2007 Bug appears to have been fixed in the latest pre-beta release. Looks like this was the correct regex.
When I started parsing the millions of Google pages scraped, I came across a bug with the URL validation. To fix that, I overwrote the url validation method in appmodel.php:
function url($check)
{
$validation =& new Validation;
$validation->check = $check;
$validation->regex = '/^((https?|ftps?|file|news|gopher)://)?' //protocol
. '('
. '(?:(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d).){3}(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d)' //ip 199.194.52.184
. '|' //ip or domain
. '([0-9a-z]{1}[0-9a-z-].)‘ //subdomain(s) www.
. ‘([0-9a-z]{1}[0-9a-z-]{0,56}).’ //domain
. ‘([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})’ //tld
. ‘(:[0-9]{1,4})?’ //port
. ‘)’
. ‘(’
. ‘/?|’ //ending-slash
. ‘/[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’ //path
. ‘)$/i’;
return $validation->check();
}
Advanced URL validation
The more I looked at the different kind of URLs the app needs to validate, the more I realized that this regex, as complete as it is, wouldn’t do the job for every url validation. I decided to break it apart and make a much more advanced url validation that can accept certain options like numeric host or not, scheme(s), port(s), subdomain(s) and path(s) to allow.
/**
* validates url
*
* @param string string to check
* @param array options for allowing different url parts
*/
function url($check, $options = array())
{
if (!is_array($options))
{
trigger_error(__('(Model::url) Parameter options should be an array', true), E_USER_WARNING);
}
$default = array(
'scheme' => 'https?|ftps?|file|news|gopher',
'host' => '([0-9]{1,3}.){3}[0-9]{1,3}',
'subdomain' => '([0-9a-z]{1}[0-9a-z-]*.)*',
'port' => '(:[0-9]{1,4})?',
'path' => '[w-.,'@?^=%&:;/~+#]*[w-@?^=%&/~+#]',
);
$options = am($default, $options);
$regex = '/^';
if ($options['scheme'] !== false && is_string($options['scheme']) || $options['scheme'])
{
$regex .= '((' . $options['scheme'] . ')://)?';
}
$regex .= '(';
if ($options['host'] !== false && is_string($options['host']) || $options['host'])
{
$regex .= $options['host'] . '|';
}
if ($options['subdomain'] !== false && is_string($options['subdomain']) || $options['subdomain'])
{
$regex .= $options['subdomain'];
}
$regex .= '([0-9a-z]{1}[0-9a-z-]{0,56}).([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})';
if ($options['port'] !== false && is_string($options['port']) || $options['port'])
{
$regex .= $options['port'];
}
$regex .= ')(/?';
if ($options['path'] !== false && is_string($options['path']) || $options['path'])
{
$regex .= '|/' . $options['path'];
}
$regex .= ')$/i';
$validation =& new Validation;
$validation->check = $check;
$validation->regex = $regex;
return $validation->_check();
}
Options
scheme: Optional, mixed. Regular expression of the schemes (protocols) that will be allowed. If set to false, strings containing a scheme will be invalidated. If true or unset, all schemes (http, https, ftp, ftps, file, news, gopher) will be allowed.host: Optional, mixed. If set to false, no numerical hosts allowed. If true or unset, all numerical hosts (IPs) will be allowed.
subdomain: Optional, mixed. If set to false, 0 subdomains will be allowed. If true or unset, unlimited number of subdomains.
port: Optional, mixed. If set to false, port will not be allowed in string. If true or unset, all ports are accepted.
path: Optional, mixed. If set to false, strings can’t contain folder(s) nor file nor query. If true or unset, all are allowed.
Note: All options can take a custom regex instead of a boolean
Examples
In my models I can now set different URL validation rules:
var $validate = array(
'website1′ => array(
‘url’ => array(
‘allowEmpty’ => false,
‘required’ => true,
‘rule’ => ‘url’,
‘message’ => ‘Ad desturl should be a valid link’,
),
),
‘website2′ => array(
‘url’ => array(
‘allowEmpty’ => false,
‘required’ => true,
‘rule’ => array(’url’, array(’subdomain’=>’www.’)),
‘message’ => ‘Ad desturl should be a valid link’,
),
),
‘ftp’ => array(
‘url’ => array(
‘allowEmpty’ => false,
‘required’ => true,
‘rule’ => array(’url’, array(’host’=>’https?|ftps?’)),
‘message’ => ‘Ad dest_url should be a valid link’,
),
),
);
I submitted it as an enhancement, if the core developers find it useful, maybe they will add it.
Trackbacks
Use this link to trackback from your own site.


Seems like the regEx is broken, as it is giving me errors when I perform the validation.
[code]Parse error: syntax error, unexpected ‘@’, expecting ‘)’ in /app/app_model.php on line 58[/code]
The actual line in my file is: [code]’path’ => ‘[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’,[/code]
Let me know if you can fix it, would love to use it!
@adam: sorry for the delay and thanks for dropping by.
This regex was already implemented in the latest pre-beta release of CakePHP - you don’t need to manually do it anymore.
Do you use the pre-beta version of CakePHP on production sites?
I’m using the latest stable (1.1.18.5850), and it doesn’t look like it supports any of your code here, does it?
@Elliot: The fix was implemented in the pre-beta version. I mention that at the top of the post in the ‘Update’ section - maybe you missed that
@Jad - Elliot commented on using pre-beta code on a production site. Do you think it’s a good idea? It tends to not be as well tested as it could be
@James: at the pace Cake is being developed, I personally think it’s better to stick with just one version (preferably a stable one) until the next stable is released. Not long ago, the 1.2 stable was released, so pre-beta is already old news