Research and development is a key part of providing constantly improving solutions to our clients and partners. As part of our programme, we are aiming to provide greater insight into to work being undertaken in the R&D group , and the outcomes of this. In this article, we cover the topic of caching and the result of a recent project to improve the performance characteristics of this.
Content Delivery Networks such as Akamai and CloudFront are another form of caching, and save the origin servers (those that actually host the site) from having to repeatedly deliver the same content to multiple different clients. EstarOnline uses various forms of caching to save the database from being queried for the same data repeatedly.
In short, caching allows systems to scale to levels of performance they would not otherwise be able to achieve, this allows for the accommodation of increased loads from sales, marketing events or attempted 'denial of service' attacks, without the need to invest in significant over-provisioning of server resource to meet the demands of these relatively rare events. An often-used analogy, that that of road traffic and the question "why would you build a 4-lane super-highway just for holiday traffic, when for 99% of the time, a two-lane road is sufficient?"
One of the keys to "good" caching is determining the best balance of cache duration against the need for changes made to the data (for example, stock levels and product availability) to be reflected in the output. EstarOnline has built-in support for internally caching the results of expensive database queries, and this caching is used to achieve the highly responsive and scalable online storefronts demanded by our clients.
Under certain very high-load conditions (usually generated by sales activity) we have noticed that the system responsiveness can became somewhat unstable - with periods of good performance, interspersed with periods of poor performance, despite having extensive caching in place. This was observed on our monitoring screens as a "saw-tooth" load profile, despite traffic being reasonably consistent over the same period. This was an indication that the load spikes were internally generated. We have a lot of activity "behind the scenes", with integrations running, payment batches being processed and maintenance tasks all running on various schedules, which can be a cause when all schedules end up synchronising on a certain time, we were able to eliminate these as causes of the peaks we observed. Further investigation traced this back to the primary internal caches expiring at fixed intervals - which often coincided with each other - causing increased "spike" load as the data for multiple items needed to be refreshed from the database at about the same time.
Our R&D group embarked on a research project to determine what could be done to reduce the severity of these spikes and thus improve the overall system behaviour under high loads.
The outcome of this was a research paper which lead to the discovery that by varying the duration of the cache by a randomised amount around the desired duration, we attained a dramatically different load profile, which eliminated the extreme peaks and allowed for more consistent performance. The very clear results mean that we have incorporated the findings of this research into the caching system within the platform.
This was deployed across all clients in Weekly Platform Update 239 (deployed late July).
We have immediately seen 'real world' performance improvements, based on high traffic spikes caused by recent client promotional activity.