CakePHP URL validation bug fix and enhancement
Update: 25/10/2007 Bug appears to have been fixed in the latest pre-beta release. Looks like this was the correct regex.
When I started parsing the millions of Google pages scraped, I came across a bug with the URL validation. To fix that, I overwrote the url validation method in appmodel.php:
function url($check)
{
$validation =& new Validation;
$validation->check = $check;
$validation->regex = '/^((https?|ftps?|file|news|gopher)://)?' //protocol
. '('
. '(?:(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d).){3}(?:25[0-5]|2[0-4]d|(?:(?:1d)?|[1-9]?)d)' //ip 199.194.52.184
. '|' //ip or domain
. '([0-9a-z]{1}[0-9a-z-].)‘ //subdomain(s) www.
. ‘([0-9a-z]{1}[0-9a-z-]{0,56}).’ //domain
. ‘([a-z]{2,6}|[a-z]{2}.[a-z]{2,6})’ //tld
. ‘(:[0-9]{1,4})?’ //port
. ‘)’
. ‘(’
. ‘/?|’ //ending-slash
. ‘/[w-.,’@?^=%&:;/~+#]*[w-@?^=%&/~+#]’ //path
. ‘)$/i’;
return $validation->check();
}
Continue reading…
CakePHP’s advanced model fields validation
After checking different blogs and tutorials, the bakery, API and IRC channel, it was obvious that some kind of documentation for the validation methods available in the Model was necessary. I can’t say that I will fulfill this mission but I’ll at least share what I came up with for future reference.
Here is the ‘User’ model I will be using in my example:
class User extends AppModel
{
var $name = 'User'; //optional
var $validate = array(
‘username’ => array(
array(
‘allowEmpty’ => false,
‘required’ => true,
‘rule’ => ‘alphaNumeric’,
‘message’ => ‘Username should only contain alpha-numeric characters.’,
),
array(
‘rule’ => array(’between’, 3, 10),
‘message’ => ‘User should be between 3 and 10 characters long.’,
),
array(
‘rule’ => ‘isUnique’,
‘message’ => ‘Username is already in use.’,
),
),
‘passwd’ => array(
‘alphaNumeric’ => array(
‘allowEmpty’ => false,
‘required’ => true,
‘rule’ => ‘alphaNumeric’,
‘message’ => ‘Username should only contain alpha-numeric characters.’,
),
‘validLength’ => array(
‘rule’ => array(’between’, 3, 10),
‘message’ => ‘User should be between 3 and 10 characters long.’,
),
),
‘website’ => array(
array(
‘rule’ => ‘url’,
‘on’ => ‘update’,
‘message’ => ‘Invalid URL.’,
),
),
‘agree_tos’ => array(
array(
‘allowEmpty’ => false,
‘required’ => true,
‘on’ => ‘create’,
),
),
);
);
}
That’s a lot of validation rules, I know - I just wanted to try covering the multiple ways of using the Model->validates() method.
Continue reading…
Domain TLD Parser
Parsing URLs in PHP isn’t perfect. Don’t get me wrong here, it does the job when it comes to breaking the URL in logical parts, but, it doesn’t have any options to parse the host into domain name, TLD and sub-domain(s). Most probably because new TLDs are coming out from time to time and they want to avoid having to update that same function with every new TLD release.
To over-come this limitation and because I needed some way of extracting the domain, sub-domain and TLD out of each given URL, I came up with the following class: Domain TLD Parser
It parses hosts with all kinds of different TLDs, even the country-specific ones like ‘.co.za’, ‘.ne.jp’ or ‘.ltd.uk’. Here is an example:
<?php $url = $SERVER[’HTTPREFERER’]; include(’/path/to/domaintldparser.class.php’); $domain = new DomainTldParser; echo ‘<pre>’; print_r($domain->parse($url)); echo ‘</pre>’; ?>
DocBlock and svn:keywords
In the application’s coding conventions (which I am almost done writing), I took the time to elaborate about code documentation and the use of phpDocumentor. Among the things discussed, there is the DocBlock, the header template of each PHP file and which looks something like that:
/** * Short description for file. * * Long description for file * * PHP 5 * * Copyright (c) 2007, Company Name * Street address * City, State, Zip * * * @filesource $HeadURL$ * @copyright Copyright (c) 2007, Company Name * @link http://www.companywebsite.com CompanyName * @package #### PACKAGE NAME #### * @sub-package #### SUBPACKAGE NAME #### * @since #.#.# //Correct version number as needed * @version $Revision$ * @author Your Name * @modifiedby $LastChangedBy$ * @lastmodified $Date$ */Now you might be asking yourself what are all those
$HeadURL$, $Revision$, etc. Those are ‘keywords’ for Subversion which can be dynamically updated on every commit. By default, Subversion doesn’t substitute those keywords but you can easily set that directly from the shell using:
$ svn propset --recursive svn:keywords 'HeadURL Revision LastChangedBy Date' /path/to/repoOr, in case you are using TortoiseSVN, from the right-click menu of your repository’s folder - TortoiseSVN > Properties > Add. You can then enter ’svn:keywords’ in the ‘property name’ field and ‘HeadURL Revision LastChangedBy Date’ in the ‘property value’ field. Don’t forget to check the ‘apply property recursively’, otherwise, make sure you are only setting it on a file not a directory.
From the svn propset help shell command:
The svn:keywords, svn:executable, svn:eol-style, svn:mime-type and svn:needs-lock properties cannot be set on a directory. A on-recursive attempt will fail, and a recursive attempt will set the property only on the file children of the directory.
SELECT DISTINCT in CakePHP
Even though CakePHP’s model already includes many of the database query functions, I found that the SELECT DISTINCT was missing. Ok, I know that you can always do it using either Model->query('SELECT DISTINCT or c1, c2)Model->findAll(null, 'DISTINCT but that would be like saying use c1, c2‘)Model->query() instead of Model->findAll().
The cool thing in CakePHP is that you can add your own functions to use in your app on top of the ones that come bundled in the core. For the Model, you just create an ‘app_model.php’ file that you place in our app’s main folder. The empty file should look like this:
/**
* Custom AppModel that adds functionality to the core Model
*/
class AppModel extends Model
{
//empty
}
Now inside your new AppModel class, add the following function:
/**
* Returns a resultset array with DISTINCT fields from database matching given conditions.
*
* @param mixed $conditions SQL conditions as a string or as an array('field' =>'value',...)
* @param mixed $fields Either a single string of a field name, or an array of field names
* @return array Array of records
*/
function findDistinct($conditions = null, $fields = null)
{
$db =& ConnectionManager::getDataSource($this->useDbConfig);
$str = 'DISTINCT ';
if (!is_array($fields))
{
$str .= '`' . $fields . '`';
}
else
{
foreach ($fields as $field)
{
$str .= '`' . $field . '`, ';
}
$str = substr($str, 0, -2);
}
$queryData = array(
'conditions' => $conditions,
'fields' => $str,
);
$data = $db->read($this, $queryData, false);
return $data;
}
You can now use Model->findDistinct('c1') or Model->findDistinct(array('c1', 'c2', 'c3')) to retrieve DISTINCT columns values.
Hoping you find it useful.
nusoap class or SOAP extension?
Back when the web app was to be hosted on the VPS (using PHP4), I had started coding some of the scrapers and parsers to retrieve data from different affiliate networks. After moving to the new servers and setting them up with the latest stable versions of PHP and MySQL, it was time I clean up, optimize and document my code. But no, before I could start doing that and while I was still testing it to refresh my memory on the different methods available within the DtSoap class, I started getting errors!
nusoap
My code required nusoap to communicate with the networks’ WSDL. However, just like for nusoap, PHP5 includes a class named ‘soapclient‘ which caused the conflict. The solution was simple, changing the class’ name to ‘nusoapclient’ and the constructor too before finally changing my code to call for it instead of the old name.That was the lazy-guy in me reasoning. As soon as the geek kicked in, you guessed it, or maybe not: I decided to update the code and use the built-in soap extension available in PHP5. Continue reading…
Data Manipulation 101
One of the application’s core functionalities is monitoring changes on different sources. Some of which have some kind of webservices available while other don’t. By monitoring I mean getting, validating, storing and comparing the data over time. For that, different data manipulations are required:
- fetching
- scraping
- parsing
- cleansing
- storing
- mining
The sandbox environement I am using:
And now, on with the show. Next up, Data Scraping.
Configuring CakePHP for easy deployment
Whenever you are developing an application, you are normally using at least 2 environments: development and production. Depending on how big your app and database are, deployment may become a long list of to-dos in order to have everything setup correctly (database changes, code merges, apache configuration, etc.). When you want to do things the right way, you usually have a 3rd environment, staging. Yes, another almost identical to-do list.
Now imagine you are developing in a team with each member working on a different part of the application in individual branches on the repository. New features are only merged with the team’s branch (development) after they are tested individually. So now, you have to deploy from individual to development, then to staging before finally pushing updates to production.
This time-consuming process is definitely not the best solution, put aside the fact that it is bound to break with any mistake while reconfiguring everything. Today, we will automate CakePHP’s configuration for easy deployment. Continue reading…
Detailed documentation - to the rescue!
Having worked by myself on the vast majority of my coding projects, I never realized the necessity of having lots of documentation. Good documentation takes time to write and time is something I happen to always be short on. I am not only talking about code commenting here, but rather all kinds of documentation: coding conventions, database structure, choice of configuration, etc.
When this project was started, I only had one other person involved in the coding part and it was mostly for some outside classes we needed. With the growing code needs (all the new features, etc.), I believed it would be wise to add a new person to the team. Given my past experience with site/script development, for which I had never planned to hand to other coders, I was expecting the worse when it comes to explaining all what the web application is about: features, users, structure, etc. It definitely always sounds much simpler when you are the author/creator, but when you are also the end-user, things become ridiculously easy to understand and put together - which is unfortunately not the case for someone that has never heard nor used of anything similar before. Continue reading…
Hardware, connectivity and load estimation
When developing a web application, you can’t only think of software (programming language, operating system, etc.), you also have to think of hardware, connectivity and load management.
In one of my early posts, I said that I’ll leave the web app on a VPS - I rather quickly understood that I was totally wrong to think it could handle the load, which is not only related to the number of users or to the app’s success but also to the number of data tracked, the system’s core functionality.
As a matter of fact, adding more features certainly added more value but it definitely added more overhead. So why don’t I just scratch those new features out, at least for now? After all, one of the books I believe to be true on many aspects, Getting Real (by 37Signals), repeats that over and over again. While I can’t deny that many of the advices found in that book will prove to be very beneficial, but sometimes, you need to put aside a couple general rules and follow your instinct.
And that’s what I decided to do.
Some details
The application’s business logic is pretty extensive and processes data all day long, no matter if users are online or not. It also communicates with different child servers all over the web.The database, will start with millions of entries and over 30 tables, growing by the millions of new records every day.
Sketching the virtual network
Choice of hardware
The application server needs to cope with heavy-duty tasks, as fast as possible. The database server doesn’t only need storage space but also processing speed. And finally, the web server needs to handle multiple concurrent connections while serving HTML.After comparing multiple not-so-solid-benchmarks, reading on Web Hosting Talk and a couple of emails with some ISPs, I finally made my choice on the hardware settings illustrated in green above.
Choosing your ISP
With all the hosting reviews and comparison sites available, you’d think that the task is easy. Let me tell you it’s not. While some will strongly encourage you to look for a provider in your city (and for good reasons), I opted to go for pricing, staff experience and perfect reliability/support track record.I started researching a bunch, some of the names that sticked: rackspace.com, theplanet.com, 365main.com, cari.net, softlayer.com. I won’t go into too much details about how I finally made my choice because I am no expert, but let’s say that rackspace.com is over-priced (and I am being nice) and theplanet.com seems to have lost its touch by growing too big, otherwise it would have done a great job.
SoftLayer is the one I decided to go with. Established in 2005, with a management team that evolved together for many years, it looked pretty solid to me. Their staff, pricing structure, website and echo in the market just reinforced that feeling.
UPDATE: Looks like I had eliminated 365Main right on time! Less than 10 days after I had made my choice, 365Main suffers a power outage which leaves BIG customers (craiglist, technorati, sixapart, adbrite, yelp, redenvelope and other) offline for several hours. Sources: outage at 365 Main’s San Francisco datacenter 365 Main datacenter power outage - Six Apart Technorati Craigslist San Franciscon Power Outage - A Case Study in Downtime
Estimating your needs
For this web application, the counters just don’t start at 0. The basic servers’ configuration takes that into account but, like for any other application, the actual load inflicted by concurrent users’ requests can’t really be estimated nor benchmarked in advance.
Not so long ago, only 2 solutions would have been available. Start with just enough and be ready to upgrade fast or start strong enough and handle the first couple of months to discover your averages.
Today, there is a third solution: Amazon EC2 and S3 services. Pay-as-you-go for storage and processor usage, what better to avoid seeing your application crashing under the heavy load or your bank account drained because of the hefty initial costs that turned out useless?
Conclusion
I am no expert in that field to start with but I feel confident that I made the right choices. Time will tell I guess.Agree, disagree?

