Message ID | q3gayrsulu424e2qr5eg7zfs2rgy5ucluuw73o2pjcxmehvvmp@qxy723fyda3x (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Forbidden requests for kernel.org/releases.json | expand |
On Thu, Apr 10, 2025 at 10:05:28AM +0200, Daniel Gomez wrote: > We've started encountering "HTTP Error 403: Forbidden" errors in kdevops > when querying https://www.kernel.org/releases.json from our CI environments/ > deployments. We're using a Python script with the urllib library to fetch the > latest kernel release information [1]. > > As a temporary workaround [2], we are testing a User-Agent header to mimic a > browser request. To solve this properly, we have the following questions: > > * What is the recommended approach for automated tools to access > kernel.org/releases.json? > > * Are there any rate limits we should be aware of? > > * Would it be possible to serve releases.json from a CDN-backed subdomain to > reduce load on the main site? We could mirror the subdomain directory and > enable pointers for our datacenter network. +1 I have a script which tests whether a lore link: URL I'm adding to patches, is correct. I.e. whether https://lore.kernel.org/r/<Message-ID> can be read. What would be the suggested thing to do in such cases? Thx.
On Thu, Apr 10, 2025 at 12:30:37PM +0100, Borislav Petkov wrote: > On Thu, Apr 10, 2025 at 10:05:28AM +0200, Daniel Gomez wrote: > > We've started encountering "HTTP Error 403: Forbidden" errors in kdevops > > when querying https://www.kernel.org/releases.json from our CI environments/ > > deployments. We're using a Python script with the urllib library to fetch the > > latest kernel release information [1]. > > > > As a temporary workaround [2], we are testing a User-Agent header to mimic a > > browser request. To solve this properly, we have the following questions: > > > > * What is the recommended approach for automated tools to access > > kernel.org/releases.json? > > > > * Are there any rate limits we should be aware of? > > > > * Would it be possible to serve releases.json from a CDN-backed subdomain to > > reduce load on the main site? We could mirror the subdomain directory and > > enable pointers for our datacenter network. > > +1 > > I have a script which tests whether a lore link: URL I'm adding to patches, is > correct. I.e. whether > > https://lore.kernel.org/r/<Message-ID> > > can be read. > > What would be the suggested thing to do in such cases? FYI, I read this thread [1] recently where b4 was also failing on that type of URL. Quoting the explanation: "The anubis bot protection that I put in place yesterday required remapping some of the mountpoints, such as the legacy /r/. Internally, b4 has been using /all/ instead of /r/ for a while, but people who had b4.midmask set to a URL with /r/ in it experienced problems. Fixing the /r/ mount fixed the problem." "what is the canonical URL we should use for Link: tags? https://lore.kernel.org/r or /all?" "You can continue to use /r/ in URLs, or just omit /r/ entirely. Don't use /all/." [1] https://fosstodon.org/@brauner@mastodon.social/114281631562783020 > > Thx. > > -- > Regards/Gruss, > Boris. > > https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Apr 10, 2025 at 10:05:28AM +0200, Daniel Gomez wrote: > We've started encountering "HTTP Error 403: Forbidden" errors in kdevops > when querying https://www.kernel.org/releases.json from our CI environments/ > deployments. We're using a Python script with the urllib library to fetch the > latest kernel release information [1]. Yes, I'm trying to deal with bots who don't identify themselves. We're seeing many requests per second from user-agents like "python-requests/x.x" or "Python-urllib/x.x" or "Java/x.x" etc, and it's impossible for us to tell good bots from bad bots if they don't identify themselves properly. > * What is the recommended approach for automated tools to access > kernel.org/releases.json? Set your user-agent to something like: "kdevops-ci/{version} (contact@address.here)" -K
On Thu, Apr 10, 2025 at 08:45:30AM -0400, Konstantin Ryabitsev wrote: > Set your user-agent to something like: > > "kdevops-ci/{version} (contact@address.here)" > Hi Konstantin, Is this something you'd like continuous integration tools use in generation? I'm using the git CLI called out from go co, but I could do something like export GIT_HTTP_USER_AGENT="gce-xfstests-20250411/3-g42bcd9aa tytso@thunk.org" (Where the version would be a lightly edited version of $(git describe) from the xfstests-bld repository.) The other alternative that I've tried is to replace git.kernel.org with kernel.googlesource.com as the git mirror, is supposed to be only a few minutes behind git.kernel.org and presumably is closer to a GCE VM from a network perspective. Do you have any advice or preference about the adviseability of these approaches? - Ted
On Fri, Apr 11, 2025 at 10:18:00AM -0400, Theodore Ts'o wrote: > Hi Konstantin, > > Is this something you'd like continuous integration tools use in > generation? Yes, I think that in general it's just a good form for (well-behaved) bots to report something other than the default library name and version in their user-agent. When dealing with distributed crawler bots, a user-agent string is often the only thing we have to rely on when blocking access, so if your bot is indistinguishable from a hostile bot, you will be caught in the carnage. > I'm using the git CLI called out from go co, but I could > do something like > > export GIT_HTTP_USER_AGENT="gce-xfstests-20250411/3-g42bcd9aa tytso@thunk.org" For actual git requests it's fine if it just has git's default user-agent. Obviously, we are not going to start blocking that. :) > The other alternative that I've tried is to replace git.kernel.org > with kernel.googlesource.com as the git mirror, is supposed to be only > a few minutes behind git.kernel.org and presumably is closer to a GCE > VM from a network perspective. I'm fine with that as well -- just as long as you keep in mind that it can go away at any time the way many Google things sometimes do. I'm also considering running stable/next/mainline forks on several major forges as mirror-only repos that are updated immediately after each push, so people can use them as an alternative to googlesource. -K
On Fri, 2025-04-11 at 11:25 -0400, Konstantin Ryabitsev wrote: > On Fri, Apr 11, 2025 at 10:18:00AM -0400, Theodore Ts'o wrote: [...] > > The other alternative that I've tried is to replace git.kernel.org > > with kernel.googlesource.com as the git mirror, is supposed to be > > only a few minutes behind git.kernel.org and presumably is closer > > to a GCE VM from a network perspective. > > I'm fine with that as well -- just as long as you keep in mind that > it can go away at any time the way many Google things sometimes do. > I'm also considering running stable/next/mainline forks on several > major forges as mirror-only repos that are updated immediately after > each push, so people can use them as an alternative to googlesource. Just on this point, the load from AI bots is presumably mostly emanating from various public clouds that provide AI services. It does seem to me that those clouds having mirror repositories (even if they aren't public) that their AI training would use would help to lower the AI bot load on kernel.org and provide faster training to the cloud that did this (win/win). Should kernel.org have an official program to facilitate this? Regards, Jaems
On Fri, Apr 11, 2025 at 12:48:45PM -0400, James Bottomley wrote: > On Fri, 2025-04-11 at 11:25 -0400, Konstantin Ryabitsev wrote: > > On Fri, Apr 11, 2025 at 10:18:00AM -0400, Theodore Ts'o wrote: > [...] > > > The other alternative that I've tried is to replace git.kernel.org > > > with kernel.googlesource.com as the git mirror, is supposed to be > > > only a few minutes behind git.kernel.org and presumably is closer > > > to a GCE VM from a network perspective. > > > > I'm fine with that as well -- just as long as you keep in mind that > > it can go away at any time the way many Google things sometimes do. > > I'm also considering running stable/next/mainline forks on several > > major forges as mirror-only repos that are updated immediately after > > each push, so people can use them as an alternative to googlesource. > > Just on this point, the load from AI bots is presumably mostly > emanating from various public clouds that provide AI services. It does > seem to me that those clouds having mirror repositories (even if they > aren't public) that their AI training would use would help to lower the > AI bot load on kernel.org and provide faster training to the cloud that > did this (win/win). Should kernel.org have an official program to > facilitate this? Do we want, as a community, to facilitate GPL violations ?
On Fri, Apr 11, 2025 at 12:48:45PM -0400, James Bottomley wrote: > > I'm fine with that as well -- just as long as you keep in mind that > > it can go away at any time the way many Google things sometimes do. > > I'm also considering running stable/next/mainline forks on several > > major forges as mirror-only repos that are updated immediately after > > each push, so people can use them as an alternative to googlesource. > > Just on this point, the load from AI bots is presumably mostly > emanating from various public clouds that provide AI services. It does > seem to me that those clouds having mirror repositories (even if they > aren't public) that their AI training would use would help to lower the > AI bot load on kernel.org and provide faster training to the cloud that > did this (win/win). Should kernel.org have an official program to > facilitate this? We already do make it very easy to mirror everything we have. You can set up full replicas of git.kernel.org and lore.kernel.org that are updated within seconds -- and I know of companies who maintain such replicas for their internal needs. However, I don't think that will have any measurable impact on LLM learning bots, because it's less effort for such outfits to just buy residential DDoS bot farms and throw them at the internet as fast and as hard as they can. -K
On Fri, Apr 11, 2025 at 11:25:36AM -0400, Konstantin Ryabitsev wrote: > > For actual git requests it's fine if it just has git's default user-agent. > Obviously, we are not going to start blocking that. :) A while back we did get blocked once or twice; I assume because of some IP Address or IP range rate limit? The following day, the Kernel Compilation Service (KCS) VM had been shutdown and restarted with a new IP address, we had no trouble getting the new linux-next branch. Would having a different user-agent help in that case? > I'm fine with that as well -- just as long as you keep in mind that it can go > away at any time the way many Google things sometimes do. I'm also considering > running stable/next/mainline forks on several major forges as mirror-only > repos that are updated immediately after each push, so people can use them as > an alternative to googlesource. What I might do is to have my system silently rewrite git.kernel.org to one or more mirrors, with an automatic fallback if particular mirror disappears. That does have the risk if the mirror sticks around, but stops updating. I suspect that's less likely to happen, and presumably we can either (a) have some kind of hueristic for those branches which are known to be regularly updated, or (b) rely on a human to notice that particular failure case. - Ted
On Fri, 2025-04-11 at 13:00 -0400, Konstantin Ryabitsev wrote: > On Fri, Apr 11, 2025 at 12:48:45PM -0400, James Bottomley wrote: > > > > > > I'm fine with that as well -- just as long as you keep in mind that it can go away at any time the way many Google things sometimes do. I'm also considering running stable/next/mainline forks on several major forges as mirror-only repos that are updated immediately after each push, so people can use them as an alternative to googlesource. > > > > > > Just on this point, the load from AI bots is presumably mostly > > emanating from various public clouds that provide AI services. It does seem to me that those clouds having mirror repositories (even if they aren't public) that their AI training would use would help to lower the AI bot load on kernel.org and provide faster training to the cloud that did this (win/win). Should kernel.org have an official program to facilitate this? > > > We already do make it very easy to mirror everything we have. You can set up full replicas of git.kernel.org and lore.kernel.org that are updated within seconds -- and I know of companies who maintain such replicas for their internal needs. OK, where's the URL describing this? in case I happened to know a major cloud provider who might be interested ... > However, I don't think that will have any measurable impact on LLM learning bots, because it's less effort for such outfits to just buy > residential DDoS bot farms and throw them at the internet as fast and as hard as they can. Well, carrot and stick: if you're busy locking out AI crawlers because of DDoS farms, then even cloud based AI crawlers get caught, so it acts as an incentive to cloud providers to set this up to attract business. Regards, James
On Thu, Apr 10, 2025 at 08:45:30AM -0400, Konstantin Ryabitsev wrote: > On Thu, Apr 10, 2025 at 10:05:28AM +0200, Daniel Gomez wrote: > > We've started encountering "HTTP Error 403: Forbidden" errors in kdevops > > when querying https://www.kernel.org/releases.json from our CI environments/ > > deployments. We're using a Python script with the urllib library to fetch the > > latest kernel release information [1]. > > Yes, I'm trying to deal with bots who don't identify themselves. We're seeing > many requests per second from user-agents like "python-requests/x.x" or > "Python-urllib/x.x" or "Java/x.x" etc, and it's impossible for us to tell good > bots from bad bots if they don't identify themselves properly. > > > * What is the recommended approach for automated tools to access > > kernel.org/releases.json? > > Set your user-agent to something like: > > "kdevops-ci/{version} (contact@address.here)" Would it help things if you just delayed response to *every* query by a couple of seconds? While I love it that lore.kernel.org answers my queries more or less instantly, my user experience wouldn't suffer significantly if I had to wait for a short (to humans) but long (to bots) time. -Tony
James Bottomley <James.Bottomley@HansenPartnership.com> writes: > On Fri, 2025-04-11 at 11:25 -0400, Konstantin Ryabitsev wrote: >> On Fri, Apr 11, 2025 at 10:18:00AM -0400, Theodore Ts'o wrote: > [...] >> > The other alternative that I've tried is to replace git.kernel.org >> > with kernel.googlesource.com as the git mirror, is supposed to be >> > only a few minutes behind git.kernel.org and presumably is closer >> > to a GCE VM from a network perspective. >> >> I'm fine with that as well -- just as long as you keep in mind that >> it can go away at any time the way many Google things sometimes do. >> I'm also considering running stable/next/mainline forks on several >> major forges as mirror-only repos that are updated immediately after >> each push, so people can use them as an alternative to googlesource. > > Just on this point, the load from AI bots is presumably mostly > emanating from various public clouds that provide AI services. Have a look at Bright Data - they claim 100M+ *residential* IPs for scraping. They seem to operate a VPN service for "free", to use it you just have to allow them to use your connection for this kind of stuff. jon
On Fri, Apr 11, 2025 at 01:13:10PM -0400, James Bottomley wrote: > > We already do make it very easy to mirror everything we have. You can > set up full replicas of git.kernel.org and lore.kernel.org that are > updated within seconds -- and I know of companies who maintain such > replicas for their internal needs. > > OK, where's the URL describing this? in case I happened to know a major > cloud provider who might be interested ... I'll see if I can update our docs. I've published a few things in the past as blog posts, but there isn't a consolidated document. -K
On Thu, Apr 10, 2025 at 12:30:37PM +0200, Borislav Petkov wrote: > +1 > > I have a script which tests whether a lore link: URL I'm adding to patches, is > correct. I.e. whether > > https://lore.kernel.org/r/<Message-ID> > > can be read. > > What would be the suggested thing to do in such cases? You can continue doing it -- this is a lightweight operation. Just use a HEAD request instead of a GET request. E.g.: [[ $(curl -o/dev/null -sIw '%{response_code}' https://lore.kernel.org/all/20250411201912.2872-1-annie.li@oracle.co/) -gt 200 ]] && echo "ivalid" || echo "valid" -K
On Fri, Apr 11, 2025 at 01:00:24PM -0400, Konstantin Ryabitsev wrote: > On Fri, Apr 11, 2025 at 12:48:45PM -0400, James Bottomley wrote: > > > I'm fine with that as well -- just as long as you keep in mind that > > > it can go away at any time the way many Google things sometimes do. > > > I'm also considering running stable/next/mainline forks on several > > > major forges as mirror-only repos that are updated immediately after > > > each push, so people can use them as an alternative to googlesource. > > > > Just on this point, the load from AI bots is presumably mostly > > emanating from various public clouds that provide AI services. It does > > seem to me that those clouds having mirror repositories (even if they > > aren't public) that their AI training would use would help to lower the > > AI bot load on kernel.org and provide faster training to the cloud that > > did this (win/win). Should kernel.org have an official program to > > facilitate this? > > We already do make it very easy to mirror everything we have. You can set up > full replicas of git.kernel.org and lore.kernel.org that are updated within > seconds -- and I know of companies who maintain such replicas for their > internal needs. Does *that* setup also leverage CDN? My gathering is that kernel.org would need to opt-in for that so perhaps it cannot scale well. What I'm talking about, say, we have good CI citizen infrastructure that not only does it not want to DDOS kernel.org, but *also* wants to leverage its own good-mirror citizens, *but* without having to require changes to userspace to use these mirrors, do we have a solution for this? Seems debian.org uses Varnish HTTP caching layers and not sure if that may do it: curl -I http://deb.debian.org/debian/ HTTP/1.1 200 OK Connection: keep-alive Content-Length: 6123 Server: Apache X-Content-Type-Options: nosniff X-Frame-Options: sameorigin Referrer-Policy: no-referrer X-Xss-Protection: 1 Permissions-Policy: interest-cohort=() X-Clacks-Overhead: GNU Terry Pratchett Content-Type: text/html;charset=UTF-8 Via: 1.1 varnish, 1.1 varnish Accept-Ranges: bytes Age: 0 Date: Fri, 11 Apr 2025 20:51:20 GMT X-Served-By: cache-ams21082-AMS, cache-sjc1000099-SJC X-Cache: HIT, MISS X-Cache-Hits: 2, 0 X-Timer: S1744404680.387211,VS0,VE144 Vary: Accept-Encoding Luis
On Fri, Apr 11, 2025 at 01:56:12PM -0700, Luis Chamberlain wrote: > > We already do make it very easy to mirror everything we have. You can set up > > full replicas of git.kernel.org and lore.kernel.org that are updated within > > seconds -- and I know of companies who maintain such replicas for their > > internal needs. > > Does *that* setup also leverage CDN? My gathering is that kernel.org > would need to opt-in for that so perhaps it cannot scale well. No, we can be completely unaware of it if it's for your in-house needs. > What I'm talking about, say, we have good CI citizen infrastructure that > not only does it not want to DDOS kernel.org, but *also* wants to > leverage its own good-mirror citizens, *but* without having to require > changes to userspace to use these mirrors, do we have a solution for > this? You can use git's insteadOf magic to make all git requests go to your local mirror. E.g. by putting this into /etc/gitconfig on your CI nodes: [url "git://your.local.mirror.url"] insteadOf = git://git.kernel.org insteadOf = https://git.kernel.org This way you can quickly swap between using upstream and using a local mirror without modifying your scripts. -K
On Fri, Apr 11, 2025 at 05:04:36PM -0400, Konstantin Ryabitsev wrote: > On Fri, Apr 11, 2025 at 01:56:12PM -0700, Luis Chamberlain wrote: > > > We already do make it very easy to mirror everything we have. You can set up > > > full replicas of git.kernel.org and lore.kernel.org that are updated within > > > seconds -- and I know of companies who maintain such replicas for their > > > internal needs. > > > > Does *that* setup also leverage CDN? My gathering is that kernel.org > > would need to opt-in for that so perhaps it cannot scale well. > > No, we can be completely unaware of it if it's for your in-house needs. > > > What I'm talking about, say, we have good CI citizen infrastructure that > > not only does it not want to DDOS kernel.org, but *also* wants to > > leverage its own good-mirror citizens, *but* without having to require > > changes to userspace to use these mirrors, do we have a solution for > > this? > > You can use git's insteadOf magic to make all git requests go to your local > mirror. E.g. by putting this into /etc/gitconfig on your CI nodes: > > [url "git://your.local.mirror.url"] > insteadOf = git://git.kernel.org > insteadOf = https://git.kernel.org Beautiful, thanks! And... do we know if most cloud providers mirror kernel.org? If so then CIs that want to leverage cloud can use the trick above for each cloud solution. > This way you can quickly swap between using upstream and using a local mirror > without modifying your scripts. That's hella cool. Luis
On Fri, Apr 11, 2025 at 04:38:07PM -0400, Konstantin Ryabitsev wrote: > On Fri, Apr 11, 2025 at 01:13:10PM -0400, James Bottomley wrote: > > > We already do make it very easy to mirror everything we have. You can > > set up full replicas of git.kernel.org and lore.kernel.org that are > > updated within seconds -- and I know of companies who maintain such > > replicas for their internal needs. > > > > OK, where's the URL describing this? in case I happened to know a major > > cloud provider who might be interested ... > > I'll see if I can update our docs. I've published a few things in the past as > blog posts, but there isn't a consolidated document. So is this not completely up-to-date? https://www.kernel.org/mirroring-kernelorg-repositories.html - Ted
On Fri, Apr 11, 2025 at 04:54:25PM -0400, Konstantin Ryabitsev wrote: > You can continue doing it -- this is a lightweight operation. Just use a HEAD > request instead of a GET request. Ok, here's what I have now: headers = { "User-Agent": "Boris patch massager script vp.py (bp@alien8.de)" } get = requests.head(link_url, headers=headers) print(get.headers) and for that done on the URL: https://lore.kernel.org/20250414150951.5345-1-bp@kernel.org it returns 302 with the Location header redirecting to the same thing but in the /all/ range. {'Server': 'nginx', 'Date': 'Tue, 15 Apr 2025 08:52:28 GMT', 'Content-Type': 'text/plain', 'Content-Length': '79', 'Connection': 'keep-alive', 'Age': '0', 'Location': 'http://lore.kernel.org/all/20250414150951.5345-1-bp@kernel.org/', 'Via': '1.1 varnish (Varnish/6.6)', 'X-Varnish': '17023135', 'X-Frame-Options': 'DENY', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15768001', 'Content-Security-Policy': "default-src 'self'; worker-src 'self' blob:; style-src 'self' 'unsafe-inline'; img-src https:"} Now, if I query the URL in the /all/ range, it gives 301: {'Server': 'nginx', 'Date': 'Tue, 15 Apr 2025 08:56:39 GMT', 'Content-Type': 'text/plain', 'Content-Length': '79', 'Connection': 'keep-alive', 'Age': '0', 'Location': 'http://lore.kernel.org/all/20250414150951.5345-1-bp@kernel.org/', 'Via': '1.1 varnish (Varnish/6.6)', 'X-Varnish': '9754343', 'X-Frame-Options': 'DENY', 'X-Content-Type-Options': 'nosniff', 'X-XSS-Protection': '1; mode=block', 'Strict-Transport-Security': 'max-age=15768001', 'Content-Security-Policy': "default-src 'self'; worker-src 'self' blob:; style-src 'self' 'unsafe-inline'; img-src https:"} giving me the unencrypted http:// Location and if I do that it gives me 301 again to the *encrypted* URL: {'Server': 'nginx', 'Date': 'Tue, 15 Apr 2025 08:57:15 GMT', 'Content-Type': 'text/html', 'Content-Length': '162', 'Connection': 'keep-alive', 'Location': 'https://lore.kernel.org/all/20250414150951.5345-1-bp@kernel.org'} LOL. So, what would be the best and the lowest overhead thing to use? All I know is that people requested the https:// variant in Links in the past so we probably should keep doing that. Thx.
diff --git a/scripts/generate_refs.py b/scripts/generate_refs.py index 5171414..41011cf 100755 --- a/scripts/generate_refs.py +++ b/scripts/generate_refs.py @@ -302,7 +302,9 @@ def kreleases(args) -> None: reflist = [] if _check_connection("kernel.org", 80): - with urllib.request.urlopen("https://www.kernel.org/releases.json") as url: + _url = "https://www.kernel.org/releases.json" + req = urllib.request.Request(_url, headers={"User-Agent": "Mozilla/5.0"}) + with urllib.request.urlopen(req) as url: data = json.load(url) for release in data["releases"]: