Product growth: Scaling the SaaS platform [004.1]

I had just rejoined the startup and the product launch was within a week… wow. I had started the work on the visual voicemail product before I had left a year ago and now it was about to launch within a week. The interesting thing was that I had shown the demo version of the product to Carriers in USA and Europe during my first stint with luke warm response. But then iPhone launched in 2007 with visual voicemail and we started getting calls from carriers.

Data CenterThe existing SaaS platform was serving Voice SMS subscribers in the field but the daily traffic was pretty small compared to the voicemail traffic. So the platform was scaled accordingly in readiness for the voicemail traffic based on the Carrier forecast of 400k subscribers over 6 months. The devices were gobbled up by subscribers and the platform was hit with significantly higher traffic even before the official launch. The adoption was 200k subscribers in 2 weeks, 400k subscribers in 6 weeks and 800k in 6 months. Neither our team nor the Carrier team was ready for such high traffic. So the first 6 weeks was hell with system going down daily and the team scampering for more hardware while engineering was updating the software to fix new issues encountered due to new traffic pattern. The team got the chaos in control within 6 weeks but we learnt a lot about losing voicemail messages, notifications, failing Oracle db, and never to run real-time queries on a high transaction based system.

We finally realized the challenge with the architecture design, wherein, we had to scale individual services separately such as the sms handlers, web service handlers, smtp voicemail message handlers, etc. Each of the components were reaching their own breaking points as the traffic was increasing and we were on the run to upgrade hardware and software to keep the show running. It was a great learning and bonding experience for the team with late nights and all. The architecture was painfully scalable and if we had to deploy it in different data centers then it would be a lot of engineering work to make it work and that too will be clunky.

So the team got together very quickly and figured out an island based architecture design, which scaled linearly and allowed us to deploy multiple islands to support increasing traffic. Basically, create a virtual machine running all the services inside it with multiple threads on a given box with any number of CPUs. We benchmarked traffic on the various box configurations to determine the tps rating. Thereafter, it was just a matter of just adding boxes linearly limited by the Oracle DB box, which was common to each server box in an island. It took a year to re-architect and deploy the new system but it worked like a charm. Off course, we had to figure out inter-island communication.

We deployed 4 islands of 300tps each over 6 years to support 15M+ active subscribers in the Carrier network. The infrastructure used to handle between 10M to 15M voicemails per day. Our SLA was to deliver the voicemail audio file to the subscriber’s device within 60 seconds at all times. In addition, we accepted SLA of 99.997% uptime, which is only 5 minutes downtime every month. The penalty for not meeting our SLA for downtime was loss of revenue for that month and we never lost our revenue due to downtime ever.

Since the business model was cost per subscriber per month as the voicemail app was offered free to subscribers, the cost of supporting visual voicemail was suddenly looking way beyond any forecast for the Carrier customer on a yearly basis. We realized this issue within the first month and the effort to figure out how to deliver voicemail to text premium service using automated transcriptions began.

Scaling the SaaS platform took a couple of years in parallel with the product growth but I mentioned it first because most companies falter in setting up the appropriate test lab equipment for load testing to support and certify the scaling SaaS platform. This issue has a potential to break your business by consuming all your resources towards operating the platform and losing sight of the product roadmap.

Next up… impact of startup acquisition on the product.

Previous blog in product growth series: How we grew product revenue to $50M+ per year for our customers? [004.0]

Leave a Comment