Automattic's Ominvorous CDN

Started by dubsartur, June 09, 2022, 02:24:23 AM

Previous topic - Next topic

dubsartur

Most of us are familiar with content delivery networks: instead of serving all requests for a file from one central server, you scatter copies across different servers in different locations and fetch the closest one.  It helps giant sites which serve lots of photos, audio, and video reduce lag time and prevent any one server from being overwhelmed or taking the file offline if it becomes inacessible. 

But did you know that for any URL of a media file like this picture of Sir David Attenborough https://ichef.bbci.co.uk/news/976/cpsprodpb/125EA/production/_125324257_hi076596136.jpg you can upload it to Automattic's CDN just by surfing to any of these URLs:

And once you do that, it will be distributed around their global network of servers and stay on them indefinitely.  For more information and live updates see this blog post

Tusky

That's interesting.

Most good (or "non creepy") CDN providers will have a TTL on cache files, and also not copy things unless you have asked them to!
<< Signature redacted >>

dubsartur

#2
Quote from: Tusky on June 09, 2022, 05:25:26 AM
That's interesting.

Most good (or "non creepy") CDN providers will have a TTL on cache files, and also not copy things unless you have asked them to!
And the only way to see if they have a copy of something is to ping a URL, which causes them to make a copy.  That is why the DWaves blog thought they were being given a list of URLs as soon as they were uploaded to a blog, because as soon as they were uploaded to the blog they could be found at i0.wp.com - i1.wp.com and i2.wp.com  its a catch-22.

By indefinitely I mean that its their discretion how long files live on their CDN, and outsiders don't know the policy.

Edit: someone has checked, and they do not respect robots.txt A domain can reject all crawlers and Automattic will still copy files from it for anyone who pings their domains.

dubsartur

#3
It seems that many corporate social media sites and big web services will add media from third-party sites to their CDN to anyone who pings it.  I am still struggling to find a legitimate reason for this.

https://external-content.duckduckgo.com/iu/?u=https://pfp.easrng.net/&f=1
https://www.gravatar.com/avatar/1234567890abcdefedcba09876543210?s=100&r=g&d=https://pfp.easrng.net/

Obviously if you upload an image to FB they process it and save it, and if your profile on a site has third-party files in it there are advantages to caching them locally so every page visit does not ping that third-party site, but why on earth set up the system to scrape any random file that anyone tells it to scrape?  And why not be open about this in the way that respectable bots have a page clearly explaining what they do, who has access to the information collected, and how to block them?  Or publicly state how long files will stay on the system after the last time they are called?

If I copy a file and upload it to a website, there is normally accountability (ie. I have to create some kind of account, and they have a clear policy for what I can post and under what circumstances it will be taken down).  Everyone knows that if you let random IP addresses add files to your site, it will get filled up with goatse.cx and memes and HeRbAl ViAgRa ads.

Tusky

Quote from: dubsartur on June 14, 2022, 06:33:16 AM
why on earth set up the system to scrape any random file that anyone tells it to scrape? 

From your examples, I can see why duckduckgo would need to do this since it's a search engine. It allows for same origin header security to work so you can provide a cached version of a web page or image in search results without certificate warnings.

Quote from: dubsartur on June 14, 2022, 06:33:16 AM
Everyone knows that if you let random IP addresses add files to your site, it will get filled up with goatse.cx and memes and HeRbAl ViAgRa ads.

They would only be able to add the first instance of the file, which would then be cached. I'm not sure what the point would be of a bot that proliferated a nefarious image in one place, then pinged someone that used this third party caching in order that it be stored there as well?

If it's just a question of privacy - if you have uploaded an image to the internet and wish to view it there, then it will need to be publicly accessible in some manner. You can add measures to add privacy for an image. Facebook and sites like that should have some privacy measures. Like for this photo of mine:

https://www.facebook.com/photo.php?fbid=10154283422853552&set=pb.562178551.-2207520000..&type=3

But it needs to be accessible on the internet somewhere, otherwise I could not view it.

https://scontent-lhr8-1.xx.fbcdn.net/v/t31.18172-8/13422273_10154283422853552_7241542756728080312_o.jpg?_nc_cat=106&ccb=1-7&_nc_sid=cdbe9c&_nc_ohc=YmLDa7R4LIQAX_fIHFg&_nc_ht=scontent-lhr8-1.xx&oh=00_AT-NkMzxZv5BTxKWWLSnxGgSZr9BgXCuJkuYKcmjj_pi0w&oe=62CEB04F

The above uses facebook's CDN and some hashing as well, so this one I don't think works with the examples
<< Signature redacted >>