Microsoft advertises Front Door for a variety of use cases. It’s the place where all client traffic enters. It serves as a CDN with some fairly sophisticated routing capability and DDOS protection. It supplies a good deal of power over incoming traffic and it’s great for what it does.
“Powerful” and “robust” are generally one-way synonyms for “complex” in operations, and Front Door is no exception. It offers power at the price of complexity and it’s not always easy to make sense out of what it should do and where it should do it.
Caching, in particular, presents a number of opportunities to increase global performance and a similar number of opportunities to create baffling problems if it’s configured incorrectly for the task at hand.
In light of that, I thought it might be useful to take a closer look at what we’ve seen cause caching trouble and how we re-configured Front Door to avoid it.
Cache Expiration
We generally want this relative to the rate of change in the cached content. We don’t just want our cache to serve up cached content; we need it to serve up current content. If we’re making changes to that content every day, a three-day cache period guarantees that our users will always be days behind our latest-release cycle.
Accordingly, absent a compelling reason to do otherwise, we don’t generally enable caching in non-production environments at all. The entire point of a development environment is that it changes very rapidly and very frequently. Developers need to be able to see the results of a change to Dev in real time, or as close to it as possible.
QA has to be sure they’re testing the current code, not cached code from days before. UAT tends to be a bit more stable over short periods, since it’ll only update after QA has signed off (ideally), and there’s certainly an arguable case for a short cache there.
Prod changes less over a given period of time and it’s also where users see the performance gain from the cache This is where Front Door shines.
Default Cache Behavior
“If the Cache-Control
header isn’t present on the response from the origin, by default Front Door randomly determines a cache duration between one and three days.“
I am not wild about the word “random” in operations. I understand the decision here and why it’s implemented the way it is. I still want to know down to the second when my content is expected to expire. Part of that is the control obsession common to administrators and operators.
Most of it involves how much fun it’s not to tell developers they get to wait 0-72 hours to learn if their most recent change actually fixed the problem they were addressing. Being able to tell those developers it’ll be 70 hours before the change shows up is better but not a lot better.
We’ve got a couple of options on how to never have that conversation. Ideally, dev isn’t behind the CDN and this issue never gets a chance to manifest.
If we can’t avoid that, for whatever reason, we can set the cache control headers on the site pages to reflect the cache duration we like. That works and Front Door does respect that header. It’ll also necessarily vary by environment and controlling for environmental variation isn’t what we want developers to do.
We want developers writing code; that’s why we have them. We want devops handling the post-development work that keeps everything neat and tidy in the finished product. The whole function of DevOps, from the larger view, is reducing friction and making things run fast, stable, safe, and secure.
Insisting that your developers jump through hoops to fit into your infrastructure is the biggest part of why devops exists in the first place – this is a persistent source of stress in places where development and operations are walled off from each other.
Accordingly, we control this from the operations side wherever that’s feasible and Front Door makes it not only feasible but (relatively) easy. The model Front Door uses to manage this is a rule set – loosely, a set of conditions to test for and what to do with a request that matches or does not match those conditions. In specific it affords us three options for what do with cached origin content when a request meets a given condition.
- Honor Origin – this is the default behavior of Front Door. If expiration is set on the origin, we respect it. If not, we cache it from 24-72 hours.
- Override Always – this is how we avoid making developers decide on content expiration. We set this value and ignore whatever is or isn’t present in the origin directives. If we need request matching “/dev” to cache for 5 seconds, we Override Always and sleep peacefully knowing it’ll always cache for 5 seconds.
- Override if Origin Missing – Just what it says. If the content has a directive explaining how long it’d like to be cached, we agree and cache it accordingly. If it doesn’t, we fall back to our override value instead of the 1-3 day default.
Production
Overrides help us tame Front Door against environments where caching is an active liability but ultimately Front Door is there for a reason and Prod is that reason.
If we’re relying on caching for performance, it doesn’t do us any good at all to just turn it off for the one environment users can see. Since it has to be enabled and working to show any benefit, we’ll have to figure out how to set cache expiration for production.
Ideally we’d refresh cache for any changed files on any release to Prod and leave the others alone but Front Door only checks for cache expiration. It doesn’t know that you changed the site map, updated the graphics on the Marketing page, and corrected a typo in a bio under the Team page. It just knows this content is expired and it needs to go get an update.
So, knowing that, we can set a rule for some pattern that only matches Prod and override the default cache expiration in favor of whatever makes sense for our release cycle.
Absent any external factors, it’s not terribly difficult to write a deployment pipeline that purges the origin cache after every production deployment and that promises to at least offer updated content to users until the next release.
Full rule documentation is available at https://learn.microsoft.com/en-us/azure/frontdoor/standard-premium/how-to-configure-rule-set.
Caching Error Pages
If Front Door reaches into its origin with a request, and that request results in an error page (transient database connection problem, transient network, actual corner-case code problem that evaded detection in qa/uat), it will happily add that error page to its cache and serve it up for 1-3 days or however long your override period is. No one wants that.
We’d rather not have errors at all, but since that can’t be avoided, we don’t want them sticking around one to three days after they’ve been resolved.
It’s not all Front Door’s fault. We asked it to make a request from its origin and cache the result. We didn’t ask it to read the response and make decisions about whether or not to cache it. We can teach it to behave using another rule, though.
Set a rule that looks for a predefined error code or set thereof in the response header, then use Override Always or Override if Origin Missing to turn 1-3 days into 5-10 minutes – that keeps a DDoS attack from flooding the origin with known-bad requests but also doesn’t leave an error in place days after it’s been corrected.