CakePHP URL validation bug fix and enhancement

Posted by Jad on September 23, 2007

Update: 25/10/2007 Bug appears to have been fixed in the latest pre-beta release. Looks like this was the correct regex.

When I started parsing the millions of Google pages scraped, I came across a bug with the URL validation. To fix that, I overwrote the url validation method in appmodel.php:

   function url($check)
   {
      $validation =& new Validation;
      $validation->check = $check;
      $validation->regex = '/^((https?|ftps?|file|news|gopher)://)?'  //protocol
                            . '('
                              . '(?:(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d).){3}(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d)' //ip 199.194.52.184
                              . '|' //ip or domain
                              . '([0-9a-z]{1}[0-9a-z-].)‘ //subdomain(s) www.
                              . ‘([0-9a-z]{1}[0-9a-z-]{0,56}).’ //domain
                              . ‘([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})’ //tld
                              . ‘(:[0-9]{1,4})?’ //port
                           . ‘)’
                           . ‘(’
                              . ‘/?|’ //ending-slash
                              . ‘/[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’ //path
                           . ‘)$/i’;
      return $validation->check();
   }

Advanced URL validation

The more I looked at the different kind of URLs the app needs to validate, the more I realized that this regex, as complete as it is, wouldn’t do the job for every url validation. I decided to break it apart and make a much more advanced url validation that can accept certain options like numeric host or not, scheme(s), port(s), subdomain(s) and path(s) to allow.
   /**
    * validates url
    *
    * @param   string   string to check
    * @param   array    options for allowing different url parts
    */
   function url($check, $options = array())
   {
      if (!is_array($options))
      {
         trigger_error(__('(Model::url) Parameter options should be an array', true), E_USER_WARNING);
      }

  $default = array(
                 'scheme' => 'https?|ftps?|file|news|gopher',
                 'host' => '([0-9]{1,3}.){3}[0-9]{1,3}',
                 'subdomain' => '([0-9a-z]{1}[0-9a-z-]*.)*',
                 'port' => '(:[0-9]{1,4})?',
                 'path' => '[w-.,'@?^=%&:;/~+#]*[w-@?^=%&/~+#]',
                 );
  $options = am($default, $options);

  $regex = '/^';
  if ($options['scheme'] !== false && is_string($options['scheme']) || $options['scheme'])
  {
     $regex .= '((' . $options['scheme'] . ')://)?';
  }
  $regex .= '(';
  if ($options['host'] !== false && is_string($options['host']) || $options['host'])
  {
     $regex .= $options['host'] . '|';
  }
  if ($options['subdomain'] !== false && is_string($options['subdomain']) || $options['subdomain'])
  {
     $regex .= $options['subdomain'];
  }
  $regex .= '([0-9a-z]{1}[0-9a-z-]{0,56}).([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})';
  if ($options['port'] !== false && is_string($options['port']) || $options['port'])
  {
     $regex .= $options['port'];
  }
  $regex .= ')(/?';
  if ($options['path'] !== false && is_string($options['path']) || $options['path'])
  {
     $regex .= '|/' . $options['path'];
  }
  $regex .= ')$/i';

  $validation =& new Validation;

  $validation->check = $check;
  $validation->regex = $regex;
  return $validation->_check();

}

Options

scheme: Optional, mixed. Regular expression of the schemes (protocols) that will be allowed. If set to false, strings containing a scheme will be invalidated. If true or unset, all schemes (http, https, ftp, ftps, file, news, gopher) will be allowed.

host: Optional, mixed. If set to false, no numerical hosts allowed. If true or unset, all numerical hosts (IPs) will be allowed.

subdomain: Optional, mixed. If set to false, 0 subdomains will be allowed. If true or unset, unlimited number of subdomains.

port: Optional, mixed. If set to false, port will not be allowed in string. If true or unset, all ports are accepted.

path: Optional, mixed. If set to false, strings can’t contain folder(s) nor file nor query. If true or unset, all are allowed.

Note: All options can take a custom regex instead of a boolean

Examples

In my models I can now set different URL validation rules:
   var $validate = array(
                        'website1′ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => ‘url’,
                                                   ‘message’ => ‘Ad desturl should be a valid link’,
                                                   ),
                                       ),
                        ‘website2′ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => array(’url’, array(’subdomain’=>’www.’)),
                                                   ‘message’ => ‘Ad desturl should be a valid link’,
                                                   ),
                                       ),
                        ‘ftp’ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => array(’url’, array(’host’=>’https?|ftps?’)),
                                                   ‘message’ => ‘Ad dest_url should be a valid link’,
                                                   ),
                                       ),
                        );
I submitted it as an enhancement, if the core developers find it useful, maybe they will add it.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Adam D Sat, 24 Nov 2007 09:48:14 EST

    Seems like the regEx is broken, as it is giving me errors when I perform the validation.

    [code]Parse error: syntax error, unexpected ‘@’, expecting ‘)’ in /app/app_model.php on line 58[/code]

    The actual line in my file is: [code]’path’ => ‘[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’,[/code]

    Let me know if you can fix it, would love to use it!

  2. Jad Fri, 30 Nov 2007 13:40:28 EST

    @adam: sorry for the delay and thanks for dropping by.

    This regex was already implemented in the latest pre-beta release of CakePHP - you don’t need to manually do it anymore.

  3. Elliot Sun, 09 Dec 2007 04:02:37 EST

    Do you use the pre-beta version of CakePHP on production sites?

    I’m using the latest stable (1.1.18.5850), and it doesn’t look like it supports any of your code here, does it?

  4. Jad Mon, 10 Dec 2007 03:18:09 EST

    @Elliot: The fix was implemented in the pre-beta version. I mention that at the top of the post in the ‘Update’ section - maybe you missed that ;)

  5. James Sat, 23 Feb 2008 23:49:27 EST

    @Jad - Elliot commented on using pre-beta code on a production site. Do you think it’s a good idea? It tends to not be as well tested as it could be ;)

  6. Jad Mon, 25 Feb 2008 20:20:53 EST

    @James: at the pace Cake is being developed, I personally think it’s better to stick with just one version (preferably a stable one) until the next stable is released. Not long ago, the 1.2 stable was released, so pre-beta is already old news ;)

Comments