CakePHP URL validation bug fix and enhancement

Update: 25/10/2007 Bug appears to have been fixed in the latest pre-beta release. Looks like this was the correct regex.

When I started parsing the millions of Google pages scraped, I came across a bug with the URL validation. To fix that, I overwrote the url validation method in appmodel.php:

   function url($check)
   {
      $validation =& new Validation;
      $validation->check = $check;
      $validation->regex = '/^((https?|ftps?|file|news|gopher)://)?'  //protocol
                            . '('
                              . '(?:(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d).){3}(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d)' //ip 199.194.52.184
                              . '|' //ip or domain
                              . '([0-9a-z]{1}[0-9a-z-].)‘ //subdomain(s) www.
                              . ‘([0-9a-z]{1}[0-9a-z-]{0,56}).’ //domain
                              . ‘([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})’ //tld
                              . ‘(:[0-9]{1,4})?’ //port
                           . ‘)’
                           . ‘(’
                              . ‘/?|’ //ending-slash
                              . ‘/[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’ //path
                           . ‘)$/i’;
      return $validation->check();
   }

Advanced URL validation

The more I looked at the different kind of URLs the app needs to validate, the more I realized that this regex, as complete as it is, wouldn’t do the job for every url validation. I decided to break it apart and make a much more advanced url validation that can accept certain options like numeric host or not, scheme(s), port(s), subdomain(s) and path(s) to allow.
   /**
    * validates url
    *
    * @param   string   string to check
    * @param   array    options for allowing different url parts
    */
   function url($check, $options = array())
   {
      if (!is_array($options))
      {
         trigger_error(__('(Model::url) Parameter options should be an array', true), E_USER_WARNING);
      }

  $default = array(
                 'scheme' => 'https?|ftps?|file|news|gopher',
                 'host' => '([0-9]{1,3}.){3}[0-9]{1,3}',
                 'subdomain' => '([0-9a-z]{1}[0-9a-z-]*.)*',
                 'port' => '(:[0-9]{1,4})?',
                 'path' => '[w-.,'@?^=%&:;/~+#]*[w-@?^=%&/~+#]',
                 );
  $options = am($default, $options);

  $regex = '/^';
  if ($options['scheme'] !== false && is_string($options['scheme']) || $options['scheme'])
  {
     $regex .= '((' . $options['scheme'] . ')://)?';
  }
  $regex .= '(';
  if ($options['host'] !== false && is_string($options['host']) || $options['host'])
  {
     $regex .= $options['host'] . '|';
  }
  if ($options['subdomain'] !== false && is_string($options['subdomain']) || $options['subdomain'])
  {
     $regex .= $options['subdomain'];
  }
  $regex .= '([0-9a-z]{1}[0-9a-z-]{0,56}).([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})';
  if ($options['port'] !== false && is_string($options['port']) || $options['port'])
  {
     $regex .= $options['port'];
  }
  $regex .= ')(/?';
  if ($options['path'] !== false && is_string($options['path']) || $options['path'])
  {
     $regex .= '|/' . $options['path'];
  }
  $regex .= ')$/i';

  $validation =& new Validation;

  $validation->check = $check;
  $validation->regex = $regex;
  return $validation->_check();

}

Options

scheme: Optional, mixed. Regular expression of the schemes (protocols) that will be allowed. If set to false, strings containing a scheme will be invalidated. If true or unset, all schemes (http, https, ftp, ftps, file, news, gopher) will be allowed.

host: Optional, mixed. If set to false, no numerical hosts allowed. If true or unset, all numerical hosts (IPs) will be allowed.

subdomain: Optional, mixed. If set to false, 0 subdomains will be allowed. If true or unset, unlimited number of subdomains.

port: Optional, mixed. If set to false, port will not be allowed in string. If true or unset, all ports are accepted.

path: Optional, mixed. If set to false, strings can’t contain folder(s) nor file nor query. If true or unset, all are allowed.

Note: All options can take a custom regex instead of a boolean

Examples

In my models I can now set different URL validation rules:
   var $validate = array(
                        'website1′ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => ‘url’,
                                                   ‘message’ => ‘Ad desturl should be a valid link’,
                                                   ),
                                       ),
                        ‘website2′ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => array(’url’, array(’subdomain’=>’www.’)),
                                                   ‘message’ => ‘Ad desturl should be a valid link’,
                                                   ),
                                       ),
                        ‘ftp’ => array(
                                       ‘url’ => array(
                                                   ‘allowEmpty’ => false,
                                                   ‘required’ => true,
                                                   ‘rule’ => array(’url’, array(’host’=>’https?|ftps?’)),
                                                   ‘message’ => ‘Ad dest_url should be a valid link’,
                                                   ),
                                       ),
                        );
I submitted it as an enhancement, if the core developers find it useful, maybe they will add it.

6 Responses to “CakePHP URL validation bug fix and enhancement”

  1. Adam D Says:

    Seems like the regEx is broken, as it is giving me errors when I perform the validation.

    [code]Parse error: syntax error, unexpected ‘@’, expecting ‘)’ in /app/app_model.php on line 58[/code]

    The actual line in my file is: [code]’path’ => ‘[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’,[/code]

    Let me know if you can fix it, would love to use it!

  2. Jad Says:

    @adam: sorry for the delay and thanks for dropping by.

    This regex was already implemented in the latest pre-beta release of CakePHP - you don’t need to manually do it anymore.

  3. Elliot Says:

    Do you use the pre-beta version of CakePHP on production sites?

    I’m using the latest stable (1.1.18.5850), and it doesn’t look like it supports any of your code here, does it?

  4. Jad Says:

    @Elliot: The fix was implemented in the pre-beta version. I mention that at the top of the post in the ‘Update’ section - maybe you missed that ;)

  5. James Says:

    @Jad - Elliot commented on using pre-beta code on a production site. Do you think it’s a good idea? It tends to not be as well tested as it could be ;)

  6. Jad Says:

    @James: at the pace Cake is being developed, I personally think it’s better to stick with just one version (preferably a stable one) until the next stable is released. Not long ago, the 1.2 stable was released, so pre-beta is already old news ;)

Leave a Reply