Caching Cookie headers

“I can’t cache this piece of content because it has a Cookie header” any web developer/sysadmin following the so called “caching best practices”.

Well, turns out nowadays pretty much every piece of content has at least a cookie header and so does every single user. Does it mean it’s not possible to cache anything?
Absolutely NOT. We can cache anything we want to, we only need to pay some extra attention to it as we don’t want to deliver content to the wrong users.

We have two types of cookie headers:

  • cookie for tracking: those headers are used to track the user journey on a website or when browsing The Internet. These headers most of the time don’t bring private users information because they are mostly used to come up with personalized content, based on the browsed pages, for each specific consumer.
  • cookie bringing private users information: those cookies, instead, contain a lot of private information nobody wants to leak. I.e. they can be associated to credit cards credentials(you don’t want to leak those).

The good news it that with Varnish, using VCL, you can cache both of them. By logic, you will understand it is worth having an extra check for the second case.

Here’s the VCL

Caching tracking cookies with Varnish

sub vcl_recv {
    set req.http.Cookie-Backup = req.http.Cookie;
    unset req.http.Cookie;
  }

  sub vcl_hash {
    if (req.http.Cookie-Backup) {
      # restore the cookies before the lookup if any
      set req.http.Cookie = req.http.Cookie-Backup;
      unset req.http.Cookie-Backup;
    }
  }

 

By default Varnish won’t cache content if a Cookie or a Set-Cookie header is present, that’s because it’s better “being safe than sorry” and Varnish won’t take responsibilities for caching with present cookies, but you can still instruct it to do so.

In vcl_recv when Varnish receives the request, we save the original Cookie header is a new Cookie-Backup header and unset the original headers. This allows Varnish to create an hash key or to lookup for an already existing object in cache, without unsetting the Cookie header the cache would be bypassed by default and no content cached, hence the little trick.
Once we are sure Varnish will either create a fresh hash key for a new object or lookup the cache for an already cached piece of content, we can set again the Cookie header and add it to the hashing scheme(in vcl_hash).

Caching a private cookie with Varnish

sub vcl_hash {
     if (req.http.Cookie && req.url ~ ”/user-cart”) { 
       # add Cookie in hash 
       hash_data(req.http.Cookie); 
     }
}

This second examples follow the same logic as the previous one adding the Cookie header to the scheme when creating a hash key for a new object to be inserted in cache or when using it to lookup an object already in cache.

The main difference here is that we want to run an extra check, making sure the requested URL is related to the user-cart of an e-commerce website which is where the payment happens and where you want to make sure not to be messy.

Butterfly effect and tech

Can a butterfly flapping wings in Australia cause a tornado in Florida?
Yes. Or no. We can’t say it with certainty, but it could happen.

You probably have heard this question already, it’s known as the “Butterfly Effect” and, in chaos theory, it represents the sensitive dependence on initial conditions in which a small change can result in large differences in a later state.
The term was coined by Edward Lorenz who discovered the effect while running studies on his weather model: he ran the same experiment twice, in one of the two attempts the data was rounded(at the 6th digit after the comma) while in the other one it was not. In a system of 12 equations you would expect that such a small change has no influence on the final result: that was not the case. A very small change in initial conditions had created a significantly different outcome.

For sure this is something we can prove in any mathematical systems, but does it affect out everyday life as well?
Maybe.

In 1907 Thomas W. Lawson wrote “Friday the thirteenth”,  the U.S. economy loses about 900USD million on that day, that’s because people are scared to go on vacation, to work, do shopping or any other activity, a large amount of people just stay home scared of doing anything. The stock market shows average gains of just 0.2% or less on that day.
It was supposed to be nothing else but a novel, right? Turned out to be something way bigger than that.

The butterfly effects is widespread among IT and tech as well. We have had a lot of wrong predictions, which could have been true, but then a butterfly came and suddenly they became all irrelevant.
In  1943 Thomas Watson, president of IBM, said “I think there is a world market for maybe five computers.” Do I have to add anything here? I don’t think so.

Other than Watson, we had many other wrong predictions, so… does it really make sense to forecast for the tech future and/or even make plans?

I have my own idea on this, it is obviously content delivery and web performance related.
Just to spill some beans: I don’t 100% believe in this new Edge Computing technology, it is probably going to happen, but it won’t be as big as the cloud revolution. And no, no-body needs a more distributed architecture.
I will talk more in detail about this in the upcoming LDNWebPerf in London on the 7th of March 2018.

 

A few interesting sources:

 

About CDNs: is a nordic CDN the same as a southern one?

“In theory, there is no difference between theory and practice. But, in practice, there is.” 

CDN, we all know what a Content Delivery Network is and maybe also how it works. At least in theory. It’s easy:

It’s a network, therefore made of cables, servers and other hardware parts
which delivers some sort of content. Just to share a real life example: Olympic games are recorded(that’s the content) and delivered to you via internet, live TV, OTT or any other possible device(and that’s the “deliver”), while the infrastructure which does the deliver is the network. Easy, right?

In theory, again, a CDN could be built anywhere. I mean, we went on the moon and we now have a Tesla approaching Mars and the sun, if we can do that, we can definitely shuffle content back and forth all over the world easily. Nope.
Turns out, sometimes we can build a CDN easily, some others it requires more strategy, but in the end we will be successful, don’t worry about it.

Now, let’s dive onto the technical details.
First of all, my experience is quite vertical and I can share tips/best practices for anything concerning CDNs and caching software, but you should use other sources as well to have a complete overview of both software and hardware.

What do you need to build a CDN?

  1. Content: that’s on your plate. It can be static, dynamic, live streaming, VoD, OTT. It’s really up to you to come up with the best strategy with your web developers, packagers.
  2. You need an ISP, again pick the one that fits your needs best.
  3. You need PoPs, Point of Presence, possible as close as possible to your audience. And that’s where I can help.

A PoP, if you think about it, is nothing else, but a caching node. You want to be close to your end users to reduce latency and you also want to have cache in place to avoid paying content generation costs and overwhelming your network.
A caching node or a PoP consists of software running on a server, the server could be either bare metal or a cloud instance(AWS has more than 30 different machines types available, if you need any help picking the right one reach out because I’ve spent quite some going through the specs of each of them).

Now, let’s assume you have everything you need and you “only” need to assemble the pieces. It won’t make any difference it you try to build a CDN in Norway, in Italy or in Nevada, right?

YOU WISH.

I mean theoretically it works in the same way for each of the three countries, but then it happens that:

  • in Norway temperature might get below -30 degrees(-22 Fahrenheit) and your cables freeze == no signal.
  • In Italy you don’t even have fiber cables because every time we try to dig holes in the ground the extend our infrastructure we discover some sort of Roman building and we can’t move forward with the work == no proper infrastructure
  • In Nevada I don’t even know if it makes sense to have cables in the desert.(we won’t cover this use case)

Hardware has limitations, software hasn’t and can help you achieving high performances even in harsh conditions.

Two examples.

A CDN in Norway:
Audience is located most likely in Oslo and surroundings, fiber cables reach pretty much every corner of the country, hence setting up a CDN is going to be straightforward and fully optimised:

There will be one(or more) origin server in a data center in Oslo, in the same data center there will also be a caching layer shielding the origin server because content is the most important thing for your business. Slow content is fine, no content means no users.
Other than that we will have one or two PoPs  and each of the PoP will have at least two caching servers running in high availability to reduce the strain on the backends and have content replication among the caches.
That’s it. It will work.
How do I know? I’ve lived there and I’ve worked on such projects.

A CDN in Italy:
Audience is more distributed and even more numerous, fiber cables don’t exist everywhere and network availability is really not the best.

There will be one(or more) origin server in a data center in either Milan or Turin where it’s suggested to apply a caching layer which act as shield to avoid killing the web server(s) when spikes happen(we are more than 60million people in Italy, in Norway there are about 5million people, to give you some proportions). Now, we want to have other PoPs in various Italian regions and here the best approach would be to create a hybrid Content Delivery Network where you rely both on commercial CDNs provider and your own infrastructure(where possible). You own your architecture and this gives you huge content control, optimisation and user personalisation skills, do not underestimate this. Combining private and commercial CDN you can reach every corner of the Italian market.

Why this? I’m currently living in Italy and I had the chance to investigate the possible way to tackle such challenges.

If you have any questions just shoot them and I’ll follow-up.

Pornhub, GDPR and your right to be forgotten when you j**k off

It’s that time of the year again: in early January Pornhub released and AMAZING 20167statistics report.

Pornhub is a huge porn platform, I believe it is enough to let you know in 2017 they streamed 3,732 Petabytes of data, which makes for 7,101 GB per minutes and 118 GB per second.
That is enough data to fill the storage of all of the world’s iPhones currently in use.
This should give you a good idea of how humongous Pornhub is.

Screen Shot 2018-02-08 at 21.28.27.png
If, to the previous statement, we add the porn market is extremely competitive(you want to access the content NOW, you don’t want to wait for it), we can easily say Pornhub, from a technical point of view, is a little jewel.

Going trough the report it’s clear they serve content EVERYWHERE in the world, meaning they have to rely on CDNs to fulfil their audience requests.
I expected their infrastructure to be multilayered, with several origin servers distributed all over the continents, caching layers and a CDN layer. Given they mostly serve VoD and possibly OTT streams they must have a lot of storage, bandwidth available and amazing caching solutions.

I also curled a random video from pornhub.com:

Screen Shot 2018-02-08 at 21.38.46.png

Those response headers are PERFECT. Nothing really is revealed of the way they do streaming, nothing about the servers they use or any other technology making it way more difficult to possibly attack them.

</end of my technical praise for Pornhub>

Now, to the most scary part.

“Thanks to the anonymized data provided by Google Analytics, Pornhub’s statisticians are able to build an accurate picture of the demographic makeup of our visitors including their gender, age and even interests.”

That’s what we read in the report.
But, is it really anonymous?

A few things they know about each of their users:

  • Age
  • Gender
  • What you want to watch
  • How long you browse around their website
  • The device you use
  • The operative system of the device
  • The web browser you use
  • The exact date and time of your visit(s)

And, of course, your IP address.
So, again, is this anonymous? NO.
They might use those data to try to sell you something, or they might collect data and sell it to someone else, who will eventually try to sell you something(most likely). After all, when it’s on the web and it’s for free, you are the product and that is now widely known.

With the new GDPR kicking in on May the 25th, you can stop being the product, and instead ask them(or for what matters to anyone else who collected some of your data) to delete all your record. You will have the right to be forgotten. And, again, you can refuse them to collect any sort of personal user data that can somehow pinpoint at you.

Well, it will be extremely interesting to see how gigantic company will adapt to these new regulation. Pornhub is just one of the biggest and possible the safest, but Facebook and Google also must be mentioned among others and those two gigantic corporation which make a living out of users’ information will have to reshape the whole way analytics are currently done.
It will be thrilling to see how users personalisation will be done when nobody will be allowed to collect people’s specific details.

 

Monkigras 2018

It’s been a couple of days since the end of Monkigras*, this year’s theme was “Sustaining Craft”
As usual the Monk crew did a STUNNING job, this conference is probably one of my favourite and if I could I would buy a ticket for Monkigras 2019 RIGHT NOW. (James you should consider selling early early birds tickets).
Speakers and topic discuss were simply mind blowing and the relaxed atmosphere among the attendees is exactly what I look for in these kind of events. No need to say all of this is framed by good food and beer.

Anyhow this year the spectrum of speakers has been incredibly wide, we had the chance to listen to people working for 100% tech companies and people working in the fashion industry, going trough a politics activist. Very diverse voices which somehow managed to highlight how important it is to be sustainable in anything we do/hack/work on.

There are several summaries of the conference out there, one I’d like to point to is this one.
To be honest I haven’t taken any notes as I was completely fascinated by the speakers and didn’t want to lose a single word.

Also, Monkigras is one of the few conference (I know of) which truly focuses on diversity and inclusion.

Please consider attending and speaking next year.

And congrats once more to the organizers.

 

* Monkigras is an AMAZING conference held annually in London where interesting and unique topics are discussed a by a very diverse set of international speakers.
(https://monkigras.com/)