A new architecture for a new year

In early January 2018, we completed a huge refactoring project, Project Spectre, designed to modernise the architecture behind CasperVend and CasperLet. This project was commenced in June 2017, in response to an increasing number of outages. Today, our systems are running faster and more reliably than ever before - what follows is our (technical) story.

CasperVend today is, by orders of magnitude, the biggest vending system in Second Life. We support tens of thousands of merchants and run millions of connected in-world entities.

It's not always been that way, though. We started with humble beginnings. When CasperVend first launched, HippoVend was king. We were running on a crappy shared hosting package. Over time, we've scaled to meet the ever increasing demands of our customers, first moving to virtual servers, and then later to dedicated hardware. As we've scaled, our architecture has tried to evolve too, but making large, sweeping changes on a live, established system is very difficult.

2017 was a difficult year for us. We started experiencing some large outages, the largest of which was 12 hours of downtime. Some of these outages were caused by our architecture not keeping up with demand. While our average uptime throughout the year was still a respectable 99.66%, the outages were obviously unacceptable. Such extended periods of downtime cause headaches for the merchants who put their trust in us.

The outages we experienced were caused by a variety of issues, including:

  • Our storage architecture being overloaded.
  • DDoS attacks.
  • Fiber cuts.
  • Server hardware failures
  • Datacenter power failures
  • Network failures

So, in June 2017, we launched Project Spectre, and began the gargantuan task of refactoring our systems with as minimal impact as possible. We had originally hoped to launch CasperLet 2.0 and CasperVend 3.0 as part of this refactor, but as the issue became more urgent, the goal-posts moved and we had to do something about the current system.

First, we tackled our storage cluster. Between 2012 and late 2017 we were using NAS storage and php_dio file locking to synchronise our vendor and rental unit data between servers. In late 2017 we appeared to reach the limits of what NFS can handle, which caused kernel panics, and this caused an increasing frequency of short outages. We started migrating our data to a MongoDB replica set. This was nearly 1TB of data, and the php_dio file locking was used extensively throughout our framework. In early January 2018, we moved the last component to our new MongoDB storage cluster. The new cluster has been flawless in its performance, and can be easily scaled up as needed.

But what about the other causes? The longest (12 hour) outage was caused by a sustained DDoS attack on our systems. As part of Project Spectre, we migrated our systems to new, incredibly fast NVMe servers. However, we didn't just move machines, we separated our public-facing websites from the inworld components. Our in-world systems are now 100% isolated from the rest of the world, and are only accessible from our in-world scripts. This makes a DDoS attack impossible.

Our datacenter partners OVH have been hard at work eliminating the possibility of outages caused by fiber cuts, too. Back in 2015, there was only a single main trunk leading to the relatively new datacenter in Beauharnois, Canada. This line was cut twice in 2015, giving us downtime in May and November. Since that time, OVH has installed a new redundant northern route and two southern routes. It's incredibly unlikely that all four routes will be cut at the same time, so this issue can be considered to be gone.

Issues with server hardware failures are also being addressed. Every part of our inworld framework is now doubled up on duplicate hardware. We've got two application servers, two database servers, and two webservers. If one of the servers fails, traffic will automatically be directed to the secondary server.

There could still be outages caused by power and network connectivity issues within the datacentre. These are pretty much unavoidable. However, they don't happen often, and tend to be fixed very quickly. We've only been impacted by one power failure within the 8 years we've been operating, and that was resolved within 29 minutes.

It's been a hard slog, but I am finally very happy that we are able to provide a rock-solid service from here on out. Now I can focus on the next generations of CasperVend and CasperLet, as well as many more exciting projects to come.

Much love!

~Casper