Data Manipulation 101

Posted by Jad on July 30, 2007

One of the application’s core functionalities is monitoring changes on different sources. Some of which have some kind of webservices available while other don’t. By monitoring I mean getting, validating, storing and comparing the data over time. For that, different data manipulations are required:

  • fetching
  • scraping
  • parsing
  • cleansing
  • storing
  • mining

I will share in this serie of posts different resources, code snippets, benchmarks and techniques related to one or more of the above. Maybe if I get enough time, I can cover them all and group them in one easy-to-understand chapter: data manipulation 101.

The sandbox environement I am using:

- PHP 5.2.3
- libcurl
- dom.
- MySQL 5.1

And now, on with the show. Next up, Data Scraping.

Configuring CakePHP for easy deployment

Posted by Jad on July 29, 2007

Whenever you are developing an application, you are normally using at least 2 environments: development and production. Depending on how big your app and database are, deployment may become a long list of to-dos in order to have everything setup correctly (database changes, code merges, apache configuration, etc.). When you want to do things the right way, you usually have a 3rd environment, staging. Yes, another almost identical to-do list.

Now imagine you are developing in a team with each member working on a different part of the application in individual branches on the repository. New features are only merged with the team’s branch (development) after they are tested individually. So now, you have to deploy from individual to development, then to staging before finally pushing updates to production.

This time-consuming process is definitely not the best solution, put aside the fact that it is bound to break with any mistake while reconfiguring everything. Today, we will automate CakePHP’s configuration for easy deployment.
Continue reading…

Detailed documentation - to the rescue!

Posted by Jad on July 22, 2007

Having worked by myself on the vast majority of my coding projects, I never realized the necessity of having lots of documentation. Good documentation takes time to write and time is something I happen to always be short on. I am not only talking about code commenting here, but rather all kinds of documentation: coding conventions, database structure, choice of configuration, etc.

When this project was started, I only had one other person involved in the coding part and it was mostly for some outside classes we needed. With the growing code needs (all the new features, etc.), I believed it would be wise to add a new person to the team. Given my past experience with site/script development, for which I had never planned to hand to other coders, I was expecting the worse when it comes to explaining all what the web application is about: features, users, structure, etc. It definitely always sounds much simpler when you are the author/creator, but when you are also the end-user, things become ridiculously easy to understand and put together - which is unfortunately not the case for someone that has never heard nor used of anything similar before. Continue reading…

Hardware, connectivity and load estimation

Posted by Jad on July 18, 2007

When developing a web application, you can’t only think of software (programming language, operating system, etc.), you also have to think of hardware, connectivity and load management.

In one of my early posts, I said that I’ll leave the web app on a VPS - I rather quickly understood that I was totally wrong to think it could handle the load, which is not only related to the number of users or to the app’s success but also to the number of data tracked, the system’s core functionality.

As a matter of fact, adding more features certainly added more value but it definitely added more overhead. So why don’t I just scratch those new features out, at least for now? After all, one of the books I believe to be true on many aspects, Getting Real (by 37Signals), repeats that over and over again. While I can’t deny that many of the advices found in that book will prove to be very beneficial, but sometimes, you need to put aside a couple general rules and follow your instinct.

And that’s what I decided to do.

Some details

The application’s business logic is pretty extensive and processes data all day long, no matter if users are online or not. It also communicates with different child servers all over the web.

The database, will start with millions of entries and over 30 tables, growing by the millions of new records every day.

Sketching the virtual network

development_phase_-_servers_setup.gif

Choice of hardware

The application server needs to cope with heavy-duty tasks, as fast as possible. The database server doesn’t only need storage space but also processing speed. And finally, the web server needs to handle multiple concurrent connections while serving HTML.

After comparing multiple not-so-solid-benchmarks, reading on Web Hosting Talk and a couple of emails with some ISPs, I finally made my choice on the hardware settings illustrated in green above.

Choosing your ISP

With all the hosting reviews and comparison sites available, you’d think that the task is easy. Let me tell you it’s not. While some will strongly encourage you to look for a provider in your city (and for good reasons), I opted to go for pricing, staff experience and perfect reliability/support track record.

I started researching a bunch, some of the names that sticked: rackspace.com, theplanet.com, 365main.com, cari.net, softlayer.com. I won’t go into too much details about how I finally made my choice because I am no expert, but let’s say that rackspace.com is over-priced (and I am being nice) and theplanet.com seems to have lost its touch by growing too big, otherwise it would have done a great job.

SoftLayer is the one I decided to go with. Established in 2005, with a management team that evolved together for many years, it looked pretty solid to me. Their staff, pricing structure, website and echo in the market just reinforced that feeling.

UPDATE: Looks like I had eliminated 365Main right on time! Less than 10 days after I had made my choice, 365Main suffers a power outage which leaves BIG customers (craiglist, technorati, sixapart, adbrite, yelp, redenvelope and other) offline for several hours.
Sources:
outage at 365 Main’s San Francisco datacenter
365 Main datacenter power outage - Six Apart Technorati Craigslist
San Franciscon Power Outage - A Case Study in Downtime

Estimating your needs

For this web application, the counters just don’t start at 0. The basic servers’ configuration takes that into account but, like for any other application, the actual load inflicted by concurrent users’ requests can’t really be estimated nor benchmarked in advance.

Not so long ago, only 2 solutions would have been available. Start with just enough and be ready to upgrade fast or start strong enough and handle the first couple of months to discover your averages.

Today, there is a third solution: Amazon EC2 and S3 services. Pay-as-you-go for storage and processor usage, what better to avoid seeing your application crashing under the heavy load or your bank account drained because of the hefty initial costs that turned out useless?

Conclusion

I am no expert in that field to start with but I feel confident that I made the right choices. Time will tell I guess.

Agree, disagree?

Delays - very often, uncontrollable

Posted by Jad on July 17, 2007

In my last post, I said I would start posting more frequently after being absent for a while and here I am, 7 days later, with no posts to show.

I haven’t been procrastinating, nor have I been making more changes to the application’s plan, no - I was recovering from an unexpected surgery!

Aside from affecting the project’s timeline, this hospital stay gave me time to take a couple steps back from the development/planning side of things. Time I spent refreshing my memory with books like Building Scalable Web Sites (by Cal Henderson) and Prioritizing Web Usability (by Jakob Nielsen and Hoa Loranger).

A couple of the things that got either setup, approved or coded during my absence:

  1. New company (legal documents, bank account, etc.)
  2. A new Google Adwords API account (GAA) - never heard back from them concerning the old account’s inquiry.
  3. A new Yahoo! Search Marketing API (YSMA) sandbox account.
  4. 75% of the data cleansing classes we need.

Lesson of the day:

Early in the development process, having a detailed plan (no matter if you know that delays will occur), combined with taking action on simple third party requests (placing orders, registering accounts, etc.), helps keeping things rolling when you encounter unexpected events or delays.

Breaking the silence

Posted by Jad on July 11, 2007

After now almost 3 weeks without posting, it was time I get back to it before I loose interest. Very often, after starting something for no hard cold cash, I say to myself: “Why did you ever think of starting it in the first place!”. Many times the answer (or the lack of answer) to that leads to dropping it - not in this case.

Before I get back to work, here’s a roundup on what’s been happening:

- Alex* came for a week to discuss the features, marketing strategy and JV opportunities (oh, and yes, for the staff reading this, and a little vacation to Alexandria!). Great time we had and many new ideas, some requiring to look back at the use cases, database design, initial server setup - but no code.

- I then had my cousin here for a couple of days - again, great time but no code.

- All that time, I haven’t stopped doing research on the different aspects of the web application that I will share with you in the coming days. Some of are the APIs it will use, the hardware resources it will require, libraries and classes it can benefit from, etc.

So yeah, the good news is that you haven’t really missed much. The bad news is that the launch date has obviously been affected and same applies to the schedule.

I’ll leave it at that for the moment, more updates real soon.