How to Scale IT Infrastructure Without Leaving Stretch Marks

As it grows, a successful Web site or application generates more and more traffic. Unless infrastructure can be scaled to match, its increasingly slow performance will discourage users from coming back, and a promising application will have failed.

Raising the ante is the fact that many new applications are produced by startup ventures, who cannot afford enough infrastructure to cover future needs. They have no alternative but to grow their infrastructure as their needs and, hopefully, their revenue grows. This can be done successfully, but there are a number of factors to consider.

Basically, the application — in fact, the design process behind it — must be scalable, as should the businesses processes involved with the application. Monitoring and redundancy are also important concerns. Keeping these factors balanced can lead to cost-managed scaling — and profitability.

Scalable Applications

Web applications are often built with an emphasis on new features and rapid development, with little consideration for long-term growth. However, what if use balloons by a factor of 10, or 100? We often see applications that reach a certain point and then hit a wall. They simply can’t scale any further without being redesigned and rebuilt.

The main problem lies in the difference between scaling up and scaling out. Scaling up means moving the application — unchanged — to a more powerful machine. Scaling up is easier, and buying hardware is usually cheaper and faster than rearchitecting the application. However, when quad-processor Intel or AMD machines aren’t fast enough, the next step is big iron Sun boxes. That is a hugely expensive move, and on the Web it’s rarely done anymore.

Instead, Web applications typically scale out, meaning that the workload is divided among multiple, inexpensive, commodity machines. However, to scale out, the software must be designed so its functions can be spread among multiple machines.

For instance, it should be possible to subdivide the application’s database, alphabetically or geographically, so that each section can be handled by a different server. This division cannot happen automatically — it must be handled by the architects.

Scalable Design Processes

Speaking of architects, it is important that they not rely solely on their own resources. Re-inventing the wheel must be avoided.

In my experience, technical people often don’t like to ask for help in the early stages of a project, but that is the best time to ask for it — rather than during the crisis that may come later. Architects should always look for someone who has already done what they are trying to do, especially as that person probably wrote down the answer. In fact, a simple Google search might produce the architecture for your next application.

Whatever you do, don’t be proud — it’s always better to learn from past experience than to completely invent the process yourself.

Development resources are almost always scarce, and deciding how much to expend on the features that the users see, versus how much to expend on the architecture that the users do not see (but which would make the application scalable) is always difficult. There is no one answer, but the question needs to be the subject of some deliberation. Weigh the importance of every new feature against the scalability and performance impact it will have on your application.

Scalable Business Processes

Meanwhile, as the application grows and becomes more important, the software behind it is not the only thing that needs to scale. Basically, all business processes associated with it may need to be scaled. With startups, it is common to find a lack of formal, documented business processes, and the absence of any change management process can be especially problematic. There needs to be a process to manage changes to the infrastructure, the software and the content.

Frankly, the most likely time for a problem to emerge is while introducing changes. Not only should there be planning for each step involved in the change, but there should be planning for a rollback to the previous version if the change does not work. Otherwise, you could find yourself trying to figure out what to do while in the middle of a crisis. While you’re trying to get the site back to normal, you’ll be losing revenue, as well as the trust of your customers.

Going from one to two servers, for instance, may not sound like a big deal, but you’ll have to add load balancers and also change the cabling, networking and software configurations. If anything goes wrong, the site could go down.

Keep Communication Consistent

Obviously, the potential for problems is much greater if you are scaling from 10 up to 20 servers. There could be multiple people involved in the upgrade, and something is sure to go wrong if all those people don’t know exactly what they’re supposed to be doing. To ensure that they do, the process should be documented. Additionally, when the process is completed, the outcome should be analyzed to refine the process, and future documentation should be changed to reflect what was learned.

Furthermore, without documented processes, individuals can become single points of failure. If only one person is knowledgeable about a site’s functions and is on vacation when a crisis erupts, resolving that crisis may be nearly impossible.

Of course, for a startup that’s essentially doing something new every day, relying on documented processes is not always an option. At the very least, make sure you’re communicating regularly and in a common forum. Have a weekly meeting involving everyone involved in the management of your application and talk through recent and planned events and changes.

Monitor Closely

If you are monitoring usage closely, you can respond before your traffic out-strips your infrastructure. Monitoring application usage should be done regularly, and tracked over time. You should test out the entire user experience from end to end, and get a baseline of what kind of performance is adequate — and what impact adding resources would have.

Mature organizations often mimic their site in a test environment and simulate traffic loads to see what the response will be, so they can plan what infrastructure they’ll need. Few startups do that — but many second-time entrepreneurs do. Essentially, if you can locate the bottlenecks in your application before you encounter them in production, you may be able to avoid them altogether.

Unfortunately, infrastructure problems can arise that no amount of monitoring and planning can help overcome. For instance, a mention of your service on a popular blog could trigger a flood of visitors that swamp your infrastructure for a few days. Or you may be in a seasonal business with a period of peak use followed by an extended lull in traffic. In either case, gearing up to handle peak use will leave you with idle infrastructure the rest of the time. Consider all your options for rapid deployment of additional capacity and have a contingency plan for unforeseen demand.

Careful of Redundancy

Redundancy is another big issue with startups, since many cannot afford to build it into their infrastructure on day one. In the beginning, they’re most likely running everything on one server, creating a single massive point of failure. Unfortunately, it is only a matter of time before something fails, giving the site a black eye. In fact, redundancy is rarely added until a site has experienced its first major failure.

Adding full redundancy can add significant cost to the infrastructure, since there is additional equipment and added complexity that must be managed. Unfortunately, there is no one size fits all solution for redundancy. Instead, there must be an ongoing effort to identify the points of failure that pose the highest potential risk, and steps must be taken to mitigate them. Make sure you consider redundancy as part of the architectural design even if you don’t have the budget for the redundant infrastructure on day one.

In the end, successful scaling can mean successful cost management. By acquiring only the infrastructure that is actually needed, when it is needed, costs can be reduced so that — eventually — the enterprise can enjoy the benefits of economies of scale. In other words, when traffic and revenue increase by a factor of 10, perhaps the cost of the infrastructure will increase by only a factor of five.


John Engates is chief technology officer for Rackspace Managed Hosting.


Leave a Comment

Please sign in to post or reply to a comment. New users create a free account.

Related Stories

E-Commerce Times Channels