Spyke
lemmy.world

When a firm outright admits to bypassing or trying to bypass measures taken to keep them out, you think that would be a slam dunk case of unauthorized access under the CFAA with felony enhancements.

255
lemmy.world

Fuck that. I don't need prosecutors and the courts to rule that accessing publicly available information in a way that the website owner doesn't want is literally a crime. That logic would extend to ad blockers and editing HTML/js in an "inspect element" tag.

104
lemmy.world

That logic would not extend to ad blockers, as the point of concern is gaining unauthorized access to a computer system or asset. Blocking ads would not be considered gaining unauthorized access to anything. In fact it would be the opposite of that.

60
lemmy.world

gaining unauthorized access to a computer system

And my point is that defining "unauthorized" to include visitors using unauthorized tools/methods to access a publicly visible resource would be a policy disaster.

If I put a banner on my site that says "by visiting my site you agree not to modify the scripts or ads displayed on the site," does that make my visit with an ad blocker "unauthorized" under the CFAA? I think the answer should obviously be "no," and that the way to define "authorization" is whether the website puts up some kind of login/authentication mechanism to block or allow specific users, not to put a simple request to the visiting public to please respect the rules of the site.

To me, a robots.txt is more like a friendly request to unauthenticated visitors than it is a technical implementation of some kind of authentication mechanism.

Scraping isn't hacking. I agree with the Third Circuit and the EFF: If the website owner makes a resource available to visitors without authentication, then accessing those resources isn't a crime, even if the website owner didn't intend for site visitors to use that specific method.

18
Glitchvidreply
lemmy.world

When sites put challenges like Anubis or other measures to authenticate that the viewer isn't a robot, and scrapers then employ measures to thwart that authentication (via spoofing or other means) I think that's a reasonable violation of the CFAA in spirit — especially since these mass scraping activities are getting attention for the damage they are causing to site operators (another factor in the CFAA, and one that would promote this to felony activity.)

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

19
ubergeekreply
lemmy.today

The fact is these laws are already on the books, we may as well utilize them to shut down this objectively harmful activity AI scrapers are doing.

Silly plebe! Those laws are there to target the working class, not to be used against corporations. See: Copyright.

10
lemmy.world

Nah, that would also mean using Newpipe, YoutubeDL, Revanced, and Tachiyomi would be a crime, and it would only take the re-introduction of WEI to extend that criminalization to the rest of the web ecosystem. It would be extremely shortsighted and foolish of me to cheer on the criminalization of user spoofing and browser automation because of this.

8
Glitchvidreply
lemmy.world

Do you think DoS/DDoS activities should be criminal?

If you're a site operator and the mass AI scraping is genuinely causing operational problems (not hard to imagine, I've seen what it does to my hosted repositories pages) should there be recourse? Especially if you're actively trying to prevent that activity (revoking consent in cookies, authorization captchas).

In general I think the idea of "your right to swing your fists ends at my face" applies reasonably well here — these AI scraping companies are giving lots of admins bloody noses and need to be held accountable.

I really am amenable to arguments wrt the right to an open web, but look at how many sites are hiding behind CF and other portals, or outright becoming hostile to any scraping at all; we're already seeing the rapid death of the ideal because of these malicious scrapers, and we should be using all available recourse to stop this bleeding.

5

DoS attacks are already a crime, so of course the need for some kind of solution is clear. But any proposal that gatekeeps the internet and restricts the freedoms with which the user can interact with it is no solution at all. To me, the openness of the web shouldn't be something that people just consider, or are amenable to. It should be the foundation in which all reasonable proposals should consider as a principle truth.

3

That same logic is how Aaron Swartz was cornered into suicide for scraping JSTOR, something widely agreed to be a bad idea by a wide range of lawspeople including SCOTUS in its 2021 decision Van Buren v. US that struck this interpretation off the books.

4
lemmy.world

If I put a banner on my site that says "by visiting my site you agree not to modify the scripts or ads displayed on the site," does that make my visit with an ad blocker "unauthorized" under the CFAA?

How would you “authorize” a user to access assets served by your systems based on what they do with them after they've accessed them? That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA. Especially because you’re not actually taking any steps to deny these people access either.

AI scrapers on the other hand are a type of users that you’re not authorizing to begin with, and if you’re using CloudFlares bot protection you’re putting into place a system to deny them access. To purposefully circumvent that access would be considered unauthorized.

3

That doesn’t logically follow so no, that would not make an ad blocker unauthorized under the CFAA.

The CFAA also criminalizes "exceeding authorized access" in every place it criminalizes accessing without authorization. My position is that mere permission (in a colloquial sense, not necessarily technical IT permissions) isn't enough to define authorization. Social expectations and even contractual restrictions shouldn't be enough to define "authorization" in this criminal statute.

To purposefully circumvent that access would be considered unauthorized.

Even as a normal non-bot user who sees the cloudflare landing page because they're on a VPN or happen to share an IP address with someone who was abusing the network? No, circumventing those gatekeeping functions is no different than circumventing a paywall on a newspaper website by deleting cookies or something. Or using a VPN or relay to get around rate limiting.

The idea of criminalizing scrapers or scripts would be a policy disaster.

4
lemmy.world

Site owners currently do and should have the freedom to decide who is and is not allowed to access the data, and to decide for what purpose it gets used for. Idgaf if you think scraping is malicious or not, it is and should be illegal to violate clear and obvious barriers against them at the cost of the owners and unsanctioned profit of the scrapers off of the work of the site owners.

1
lemmy.world

to decide for what purpose it gets used for

Yeah, fuck everything about that. If I'm a site visitor I should be able to do what I want with the data you send me. If I bypass your ads, or use your words to write a newspaper article that you don't like, tough shit. Publishing information is choosing not to control what happens to the information after it leaves your control.

Don't like it? Make me sign an NDA. And even then, violating an NDA isn't a crime, much less a felony punishable by years of prison time.

Interpreting the CFAA to cover scraping is absurd and draconian.

0
lemmy.world

If you want anybody and everyone to be able to use everything you post for any purpose, right on, good for you, but don't try to force your morality on others who rely on their writing, programming, and artworks to make a living to survive.

1

I'm gonna continue to use ad blockers and yt-dlp, and if you think I'm a criminal for doing so, I'm gonna say you don't understand either technology or criminal law.

0
cm0002reply
piefed.world

You say, just as news breaks that the top German court has over turned a decision that declared "AD blocking isn't piracy"

6
lemmy.world

Unauthorized access into a computer system and “Piracy” are two very different things.

6
cm0002reply
piefed.world

Please instruct me on how I go to the timeline where the legal system always makes decisions based on logic, reasoning, evidence and fairness and not...the opposite...of all those things

You have a lot of trust placed in the courts to actually do the right thing

3
lemmy.world

I’m not saying courts couldn’t pass a new law saying whatever they want. But the laws we have today would not allow for ad blocking to be considered unauthorized access. Not under the CFAA as mentioned.

I said “The logic would not extend to that” not that a legal system could not act illogically.

3

The original comment reply to you was all about how the legal system would act, that's the primary concern. All it would take is a Trump loyalist judge, a Trump leaning appeals court and the right-wing Supreme Court and boom suddenly the CFAA covers a whole lot more than what was "logical"

1
Demdarureply
lemmy.world

Ehhhh, you are gaining access to content due to assumption you are going to interact with ads and thus, bring revenue to the person and/or company producing said content. If you block ads, you remove authorisation brought to you by ads.

-2

That doesn’t make any logical sense. You cant tie legal authorization to an unsaid implicit assumption, especially when that is in turn based on what you do with the content you’ve retrieved from a system after you’ve accessed and retrieved it.

When you access a system, are you authorized to do so, or aren’t you? If you are, that authorization can’t be retroactively revoked. If that were the case, you could be arrested for having used a computer at a job, once you’ve quit. Because even though you were authorized to use it and your corporate network while you worked there, now that you’ve quit and are no longer authorized that would apply retroactively back to when you DID work there.

4

Carefull, this way even not looking at an ads positioned at the bottom of the page (or anyway not visible without scrolling) would mean to remove authorisation brought to you by ads.

1
kibiz0rreply
midwest.social

They already prosecute people under the unauthorized access provision. They just don’t prosecute rich people under it.

28

They prosecuted and convicted a guy under the CFAA for figuring out the URL schema for an AT&T website designed to be accessed by the iPad when it first launched, and then just visiting that site by trying every URL in a script. And then his lawyer (the foremost expert on the CFAA) got his conviction overturned:

https://www.eff.org/cases/us-v-auernheimer

We have to maintain that fight, to make sure that the legal system doesn't criminalize normal computer tinkering, like using scripts or even browser settings in ways that site owners don't approve of.

14
jvereply
lemmy.world

Right? Isn’t this a textbook DMCA violation, too?

9

for us, not for them. wait until they argue in court that actually its us at fault and we need to provide access or else

4
lemmy.dbzer0.com

It's difficult to be a shittier company than OpenAI, but Perplexity seems to be trying hard.

245
BigFigreply
lemmy.world

Step 1, SOMEHOW find a more punchable face than Altman

74
pyrereply
lemmy.world

yeah. still not worth dealing with fucking cloudflare. fuck cloudflare.

30
lemmy.world

That would be terrible for a lot of people as they are the only company providing such services that doesn't charge for traffic.

6
Int32reply
lemmy.dbzer0.com

They can use web.archive.org as a cdn(I do that to cloudflare websites). But honestly, cloudflare or not, the internet is broken.

2
turmoilreply
feddit.org

Using archive.org as a CDN at the scale of Cloudflare would be an immediate death sentence for archive.org.

3
Int32reply
lemmy.dbzer0.com

just take a snapshot of your website... then make all links to your website link to that snapshot, and turn your server off.

0
oppy1984reply
lemdro.id

I'm out of the loop, what's wrong with cloud flare?

2
ubergeekreply
lemmy.today

Centralization, mostly, but also their hands-off approach to most fascist content.

7

They kind of have to be hands off or risk losing safe harbor protections.

5

I get the centralization concerns, but I would think that's on the consumer since there are other options. As for the fascist content, as another commenter said, they could risk their safe harbor if they started stated regulating content that they weren't legally required to regulate.

Just my thoughts.

3
feddit.org

Perplexity argues that a platform’s inability to differentiate between helpful AI assistants and harmful bots causes misclassification of legitimate web traffic.

So, I assume Perplexity uses appropriate identifiable user-agent headers, to allow hosters to decide whether to serve them one way or another?

106
lime!reply
feddit.nu

yeah it's almost like there as already a system for this in place

39
lime!reply
feddit.nu

i really wish we wouldn't do those. feels too reddity.

but thanks.

1
ubergeekreply
lemmy.today

And I'm assuming if the robots.txt state their UserAgent isn't allowed to crawl, it obeys it, right? :P

11
Kissakireply
feddit.org

No, as per the article, their argumentation is that they are not web crawlers generating an index, they are user-action-triggered agents working live for the user.

4

Except, it's not a live user hitting 10 sights all the same time, trying to crawl the entire site... Live users cannot do that.

That said, if my robots.txt forbids them from hitting my site, as a proxy, they obey that, right?

3

Its not up to the hoster to decide whom to serve content. Web is intended to be user agent agnostic.

-1
europe.pub

Uh, are they admitting they are trying to circumvent technological protections setup to restrict access to a system?

Isn’t that a literal computer crime?

73
utopiahreply
lemmy.world

puts on evil hat CloudFlare should DRM their protection then DMCA Perplexity and other US based "AI" companies to oblivion. Side effect, might break the Internet.

16

As far as security is concerned, their w's are pretty common tbh. It's just the whole centralization issue.

54

That’s the entire point, dipshit. I wish we got one of the cool techno dystopias rather than this boring corporate idiot one.

58

I'm still holding out for Stephen Hawking to mail out Demon Summoning programs.

15
lemmy.world

You'd think that a competent technology company, with their own AI would be able to figure out a way to spoof Cloudflare's checks. I'd still think that.

51
[deleted]reply
lemmy.world

Or find a more efficient way to manage data, since their current approach is basically DDOSing the internet for training data and also for responding to user interactions.

69
fluxreply
lemmy.ml

This is not about training data, though.

Perplexity argues that Cloudflare is mischaracterizing AI Assistants as web crawlers, saying that they should not be subject to the same restrictions since they are user-initiated assistants.

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

I think the solution is quite clear, though: either make use of the user identity to walz through the blocks, or even make use of the user browser to do it. Once a captcha appears, let the user solve it.

Though technically making all this happen flawlessly is quite a big task.

-3
[deleted]reply
lemmy.world

Personally I think that claim is a decent one: user-initiated request should not be subject to robot limitations, and are not the source of DDOS attack to web sites.

They are one of the sources!

The AI scraping when a user enters a prompt is DDOSing sites in addition to the scraping for training data that is DDOSing sites. These shitty companies are repeatedly slamming the same sites over and over again in the least efficient way because they are not using the scraped data from training when they process a user prompt that does a web search.

Scraping once extensively and scraping a bit less but far more frequently have similar impacts.

1
fluxreply

When user enters a prompt, the backend may retrieve a handful a pages to serve that prompt. It won't retrieve all the pages of a site. Hardly different from a user using a search engine and clicking 5 topmost links into tabs. If that is not a DoS attack, then an agent doing the same isn't a DDoS attack.

Constructing the training material in the first place is a different matter, but if you're asking about fresh events or new APIs, the training data just doesn't cut it. The training, and subsequenctly the material retrieval, has been done a long time ago.

1

see, but they're not competent. further, they don't care. most of these ai companies are snake oil. they're selling you a solution that doesn't meaningfully solve a problem. their main way of surviving is saying "this is what it can do now, just imagine what it can do if you invest money in my company."

they're scammers, the lot of them, running ponzi schemes with our money. if the planet dies for it, that's no concern of theirs. ponzi schemes require the schemer to have no long term plan, just a line of credit that they can keep drawing from until they skip town before the tax collector comes

32
lemmy.today

Good. I went through my CF panel, and blocked some of those "AI Assistants" that by default were open, including Perplexity's.

47
lemmybefree.net

I don't like cloudflare but it's nice that they allow people to stop AI scrapping if they want to

38
tempestreply
lemmy.ca

CloudFlare has become an Internet protection racket and I'm not happy about it.

28
Laserreply
feddit.org

It's been this from the very beginning. But they don't fit the definition of a protection racket as they're not the ones attacking you if you don't pay up. So they're more like a security company that has no competitors due to the needed investment to operate.

21
A1kmmreply
lemmy.amxl.com

Cloudflare are notorious for shielding cybercrime sites. You can't even complain about abuse of Cloudflare about them, they'll just forward on your abuse complaint to the likely dodgy host of the cybercrime site. They don't even have a channel to complain to them about network abuse of their DNS services.

So they certainly are an enabler of the cybercriminals they purport to protect people from.

4

Any internet service provider needs to be completely neutral. Not only in their actions, but also in their liability.
Same goes for other services like payment processors.
If companies that provide content-agnostic services are allowed to policy the content, that opens the door to really nasty stuff.

You can't chop everyone's arms to stop a few people from stealing.

If they think their services are being used in a reprehensible manner, what they need to do is alert the authorities, not act like vigilantes.

3

If they acted differently, they'd probably be liable for illegal activity that they proxy for (this is for example relevant for the DMCA safe harbor).

Anyhow, when on their abuse page, I have an option for "Registrar", which is used for "DNS abuse", among others.

1
Electricdreply
lemmybefree.net

they're good at protecting websites but damn, having a company being MITM feels so wrong

4

They’re not. They’re using this as an excuse to become paid gatekeepers of the internet as we know it. All that’s happening is that Cloudflare is using this to menuever into position where they can say “nice traffic you’ve got there - would be a shame if something happened to it”.

AI companies are crap.

What Cloudflare is doing here is also crap.

And we’re cheering it on.

9

This is why companies like Perplexity and OpenAI are creating browsers.

32

I set up a WAF for my company's publicly facing developer portal to block out bot traffic from assholes like these guys. It reduced bot traffic to the site by something like - I kid you not - 99.999%.

Fucking data vultures.

22
lemmy.world

Can someone with more knowledge shine a bit more light on this while situation? Im out of the loop on the technical details

20

AI crawlers tend to overwhelm websites by doing the least efficient scraping of data possible, basically DDOSing a huge portion of the internet. Perplexity already scraped the net for training data and is now hammering it inefficiently for searches.

Cloudflare is just trying to keep the bots from overwhelming everything.

56
lemmy.ca

Cloudflare runs as a CDN/cache/gateway service in front of a ton of websites. Their service is to help protect against DDOS and malicious traffic.

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

This is a response to that from Perplexity who run an AI search company. I don’t actually know how their service works, but they were specifically called out in the announcement and Cloudflare accused them of “stealth scraping” and ignoring robots.txt and other things.

33
lemmy.world

A few weeks ago cloudflare announced they were going to block AI crawling (good, in my opinion). However they also added a paid service that these AI crawlers can use, so it actually becomes a revenue source for them.

I think it's also worth pointing out that all of the big AI companies are currently burning through cash at an absolutely astonishing rate, and none of them are anywhere close to being profitable. So pay-walling the data they use is probably gonna be pretty painful for their already-tortured bottom line (good).

32
Dogiedog64reply
lemmy.world

It's more than simply astonishing, it's mind-blowingly bonkers how much money they have to burn to see ANY amount of return. You think a normal company is bad, blowing a few thousand bucks on materials, equipment, and labor per day in order to make a few bucks revenue (not profit)? AI companies have to blow HUNDREDS OF BILLIONS on massive data center complexes in order to train their bots, and then the energy cost and water cost of running them adds a couple more million a day. ALL so they can make negative hundreds of dollars on every prompt you can dream of.

The ONLY reason AI firms are still a thing in the current tech tree is because Techbros everywhere have convinced the uberwealthy VC firms that AGI is RIGHT AROUND THE CORNER, and will save them SO much money on labor and efficiency that it'll all be worth it in permanent, pure, infinite profit. If that sounds like too much of a pipe dream to be realistic, congratulations, you're a sane and rational human being.

16

It’s more than simply astonishing, it’s mind-blowingly bonkers how much money they have to burn to see ANY amount of return

See, that's the trick, and it's used by LOADS of startups:

You don't actually have to see a return... You just have to have a good story showing there MAY be a GIANT return. The founders collect enormous salaries (Funded by VC dollars, not their own), they burn through the money to create more illusion, then ask for more, then burn through that, foretelling of the coming days when the money is just coming!

Meanwhile, just before it's "projected" to become insanely profitable, they sell out to someone, walk away with a giant check, and the product evaporates.

7

they already said they wernt profitable, they are trying to keep on life support til the VC funds run out.

0

they don't outright block ai crawlers. they added some new tools and options for managing or blocking ai bot traffic which the cloudflare customer can choose to use or to not use.

im running a free educational resource and i let the crawlers hit my site all they want because its useful knowledge unavailable anywhere else and it's served to them from cloudflare's free tier cache. i just don't know why they have to read it ten thousand times a day.

6

But the website owner can still choose to continue blocking them right? Without using additional stuff like Anubis that is.

4
BetaDoggo_reply
lemmy.world

Perplexity (an "AI search engine" company with 500 million in funding) can't bypass cloudflare's anti-bot checks. For each search Perplexity scrapes the top results and summarizes them for the user. Cloudflare intentionally blocks perplexity's scrapers because they ignore robots.txt and mimic real users to get around cloudflare's blocking features. Perplexity argues that their scraping is acceptable because it's user initiated.

Personally I think cloudflare is in the right here. The scraped sites get 0 revenue from Perplexity searches (unless the user decides to go through the sources section and click the links) and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

22
lemmy.world

…and Perplexity's scraping is unnecessarily traffic intensive since they don't cache the scraped data.

That seems almost maliciously stupid. We need to train a new model. Hey, where’d the data go? Oh well, let’s just go scrape it all again. Wait, did we already scrape this site? No idea, let’s scrape it again just to be sure.

7

It's worth giving the article a read. It seems that they're not using the data for training, but for real-time results.

1

They do it this way in case the data changed, similar to how a person would be viewing the current site. The training was for the basic understanding, the real time scraping is to account for changes.

It is also horribly inefficient and works like a small scale DDOS attack.

-1
rdrireply
lemmy.world

First we complain that AI steals and trains on our data. Then we complain when it doesn't train. Cool.

-2
ubergeekreply
lemmy.today

I think it boils down to "consent" and "remuneration".

I run a website, that I do not consent to being accessed for LLMs. However, should LLMs use my content, I should be compensated for such use.

So, these LLM startups ignore both consent, and the idea of remuneration.

Most of these concepts have already been figured out for the purpose of law, if we consider websites much akin to real estate: Then, the typical trespass laws, compensatory usage, and hell, even eminent domain if needed ie, a city government can "take over" the boosted post feature to make sure alerts get pushed as widely and quickly as possible.

1
rdrireply
lemmy.world

That all sounds very vague to me, and I don't expect it to be captured properly by law any time soon. Being accessed for LLM? What does it mean for you and how is it different from being accessed by a user? Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Is it okay for a person to access your site? Is it okay for a script written by that person to fetch data every day automatically? Would it be okay for a user to dump a page of your site with a headless browser? Would it be okay to let an LLM take a look at it to extract info required by a user? Have you heard about changedetection.io project? If some of these sound unfair to you, you might want to put a DRM on your data or something.

Would you expect a compensation from me after reading your comment?

1
ubergeekreply
lemmy.today

That all sounds very vague to me, and I don’t expect it to be captured properly by law any time soon.

It already has been captured, properly in law, in most places. We can use the US as an example: Both intellectual property and real property have laws already that cover these very items.

What does it mean for you and how is it different from being accessed by a user?

Well, does a user burn up gigawatts of power, to access my site every time? That's a huge different.

Imagine you host a weather forecast. If that information is public, what kind of compensation do you expect from anyone or anything who accesses that data?

Depends on the terms of service I set for that service.

Is it okay for a person to access your site?

Sure!

Is it okay for a script written by that person to fetch data every day automatically?

Sure! As long as it doesn't cause problems for me, the creator and hoster of said content.

Would it be okay for a user to dump a page of your site with a headless browser?

See above. Both power usage and causing problems for me.

Would it be okay to let an LLM take a look at it to extract info required by a user?

No. I said, I do not want my content and services to be used by and for LLMs.

Have you heard about changedetection.io project?

I have now. And should a user want to use that service, that service, which charges 8.99/month for it needs to pay me a portion of that, or risk having their service blocked.

There no need to use it, as I already provide RSS feeds for my content. Use the RSS feed, if you want updates.

If some of these sound unfair to you, you might want to put a DRM on your data or something.

Or, I can just block them, via a service like Cloud Flare. Which I do.

Would you expect a compensation from me after reading your comment?

None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

1

Both intellectual property and real property have laws already that cover these very items.

And it causes a lot of trouble to many people and pains me specifically. Information should not be gated or owned in a way that would make it illegal for anyone to access it under proper conditions. License expiration causing digital work to die out, DRM causing software to break, idiotic license owners not providing appropriate service, etc.

Well, does a user burn up gigawatts of power, to access my site every time?

Doing a GET request doesn't do that.

As long as it doesn't cause problems for me, the creator and hoster of said content.

What kind of problems that would be?

Both power usage and causing problems for me.

?? How? And what?

do not want my content and services to be used by and for LLMs.

You have to agree that at one point "be used by LLM" would not be different from "be used by a user".

which charges 8.99/month

It's self-hosted and free.

Use the RSS feed, if you want updates.

How does that prohibit usage and processing of your info? That sounds like "I won't be providing any comments on Lemmy website, if you want my opinion you can mail me at [email protected]"

I can just block them, via a service like Cloud Flare. Which I do.

That will never block all of them. Your info will be used without your consent and you will not feel troubled from it. So you might not feel troubled if more things do the same.

None. Unless you're wanting to access if via an LLM. Then I want compensation for the profit driven access to my content.

What if I use my local hosted LLM? Anyway, the point is, selling text can't work well, and you're going to spend much more resources on collecting and summarizing data about how your text was used and how others benefited from it, in order to get compensation, than it worths.

Also, it might be the case that some information is actually worthless when compared to a service provided by things like LLM, even though they use that worthless information in the process.

I'm all for killing off LLMs, btw. Concerns of site makers who think they are being damaged by things like Perplexity are nothing compared to what LLMs do to the world. Maybe laws should instead make it illegal to waste energy. Before energy becomes the main currency.

1
lemmy.world

they cant get their ai to check a box that says "I am not a robot"? I'd think thatd be a first year comp sci student level task. And robots.txt files were basically always voluntary compliance anyway.

12
Dr. Moosereply
lemmy.world

Cloudflare actually fully fingerprints your browser and even sells that data. Thats your IP, TLS, operating system, full browser environment, installed extensions, GPU capabilities etc. It's all tracked before the box even shows up, in fact the box is there to give the runtime more time to fingerprint you.

17

Yeah and the worst part is it doesn't fucking work for the one thing it's supposed to do.

The only thing it does is stop the stupidest low effort scrapers and forces the good ones to use a browser.

6

you're not wrong, but it also allows more than 99.8% of the bot traffic through too on text challenges. Its like the TSA of website security. Its mostly there to keep the user busy while cloudflare places itself in a man in the middle of your encrypted connection to a third party. The only difference between cloudflare and a malicious attacker is cloudflares stated intention not to be evil. With that and 3 dollars I can buy myself a single hard shell taco from tacobell.

1

Here comes the ridiculous offer to buy Google chrome with money they don't have: easy delicious scraping directly from the user source

10
lemmy.world

I can’t get over their CEO that looks like a nine year old. Not sure what it is about him

9

I think it's the beard, it makes his cheeks look puffed up a bit. His whole expression kinda looks like a grouchy toddler.

1
lemmy.ml

No I'm telling Perplexity, they can just buy their obstacle

People who use the things you have described, for free are themselves the products being sold
this is implied in the price

2
jqubedreply
lemmy.world

I think in Cloudflare’s case the free tier website owners are more an example of just giving the users a limited product in hopes of enticing them to upgrade to the paid product with more features and better performance. Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot, but I don’t think that really gives the same ROI like Facebook or other services extract from their “free” services where the users are the actual product.

3

Cloudflare might get some benefit in the ability to track end-users across more websites as part of their efforts to determine who is a real human versus a potentially-malicious bot

It lets them get a very wide base to test products against, which in and of itself, is a huge benefit. They can test out far more edge-cases than anyone else in the industry at the moment.

3

It's a spectrum and Cloudflare has snuffed out or gobbled up quite everyone they need to before the end the honeymoon phase.

0
lemmy.world

Is there some simply deployable PHP honeytrap for AI crawlers?

8

Used to make tarpits with reverse proxies. Accept the connection and then set the responses for a few seconds before default TCP timeout. Doesn't eat much resource as long as you have enough TCP connections and can reuse them effectively.

2

You could probably route all requests to your site from them, back at themselves, so they DDoS themselves, and on top off it, cost them more because their endpoint needs to process things via their LLM.

1

next step: cloudflare sends hit squads to blow up the source of these slimy data grabber attacks

3

I don't see a problem here. Maybe Perplexity should consider the reasons WHY Cloudflare have a firewall...?

2
lemmybefree.net

They do have a point though. It would be great to let per-prompt searches go through, but not mass scrapping

I believe a lot of websites don't want both though

2
threeganzireply
sh.itjust.works

Does it not need to be scraped to be indexed, assuming it’s semi-typical RAG stuff?

2

I assume their script does some search engine stuff like query google or bing and then "scrap" the links they go on

Some selenium stuff

1

I really hope Cloudflare doesn't eventually evolve into a shitty ass company, so far I like them very much, and all this massive L for AI only improves my opinion on them.

1
lemmy.world

I've developed my own agent for assisting me with researching a topic I'm passionate about, and I ran into the exact same barrier: Cloudflare intercepts my request and is clearly checking if I'm a human using a web browser. (For my network requests, I've defined my own user agent.)

So I use that as a signal that the website doesn't want automated tools scraping their data. That's fine with me: my agent just tells me that there might be interesting content on the site and gives me a deep link. I can extract the data and carry on my research on my own.

I completely understand where Perplexity is coming from, but at scale, implementations like this Perplexity's are awful for the web.

(Edited for clarity)

-2
lemmy.world

I hate to break it to you but not only does Cloudflare do this sort of thing, but so does Akamai, AWS, and virtually every other CDN provider out there. And far from being awful, it’s actually protecting the web.

We use Akamai where I work, and they inform us in real time when a request comes from a bot, and they further classify it as one of a dozen or so bots (search engine crawlers, analytics bots, advertising bots, social networks, AI bots, etc). It also informs us if it’s somebody impersonating a well known bot like Google, etc. So we can easily allow search engines to crawl our site while blocking AI bots, bots impersonating Google, and so on.

7

What I meant with "things like this are awful for the web," I meant that automation through AI is awful for the web. It takes away from the original content creators without any attribution and hits their bottom line.

My story was supposed to be one about responsible AI, but somehow I screwed that up in my summary.

4

Ooh, that's though sweetheart. If the owners of those servers want you to visit, they'll just choose another WAF than CF's.

All zero of them.

-2
lemmy.world

The amount of people just reacting to the headline in the comments on these kinds of articles is always surprising.

Your browser acts as an agent too, you don’t manually visit every script link, image source and CSS file. Everyone has experienced how annoying it is to have your browser be targeted by Cloudflare.

There’s a pretty major difference between a human user loading a page and having it summarized and a bot that is scraping 1500 pages/second.

Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted. They exist to provide their clients with services, including bot mitigation. But a user initiated operation isn’t the same as a bot.

Which is the point of the article and the article’s title.

It isn’t clear why OP had to alter the headline to bait the anti-ai crowd.

-6
[deleted]reply
lemmy.world

But a user initiated operation isn’t the same as a bot.

Oh fuck off with that AI company propaganda.

The AI companies already overwhelmed sites to get training data and are repeating their shitty scraping practices when users interact with their AI. It's the same fucking thing.

Web crawlers for search engines don't scrape pages every time a user searches like AI does. Both web crawlers and scrapers are bots, and how a human initiates their operation, scheduled or not, doesn't matter as much as the fact that they do things very differently and only one of the two respects robots.txt.

13
FauxLivingreply
lemmy.world

There’s no difference in server load between a user looking at a page and a user using an AI tool to summarize the page.

The AI companies already overwhelmed sites to get training data and are repeating their shitty scraping practices when users interact with their AI. It’s the same fucking thing.

You either didn’t read the article or are deliberately making bad faith arguments. The entire point of the article is that the traffic that they’re referring to is initiated by a user, just like when you type an address into your browser’s address bar.

This traffic, initiated by a user, creates the same server load as that same user loading the page in a browser.

Yes, mass scraping of web pages creates a bunch of server load. This was the case before AI was even a thing.

This situation is like Cloudflare presenting was a captcha in order to load each individual image, css or JavaScript asset into a web browser because bot traffic pretends to be a browser.

I don’t think it’s too hard to understand that a bot pretending to be a browser and a human operated browser are two completely different things and classifying them as the same (and captchaing them) would be a classification error.

This is exactly the same kind of error. Even if you personally believe that users using AI tools should be blocked, not everyone has the same opinion. If Cloudflare can’t distinguish between bot requests and human requests then their customers can’t opt out and allow their users to use AI tools even if they want to.

-6
[deleted]reply
lemmy.world

There is no difference between emptying a glass of water and draining swimming pool either if you ignore the total volume of water.

1
FauxLivingreply
lemmy.world

I, too, can make any argument sound silly if I want to argue in bad faith.

A user cannot physically generate as much traffic as a bot.

Just like a glass of water cannot physically contain as much water as a swimming pool, so pretending the two are equal is ignorant in both cases.

-1
[deleted]reply
lemmy.world

A user cannot physically generate as much traffic as a bot.

You are so close to getting it!

2

The AI doesn't just do a web search and display a page, in grabs the search results and scrapes multiple pages far faster than a person could.

It doesn't matter whether a human initiated it when the load on the website is far, far higher and more intrusive in a shorter period of time with AI compared to a human doing a web search and reading the cobtent themselves.

2

There’s no difference in server load between a user looking at a page and a user using an AI tool to summarize the page.

There is, in scale.

1

I think part of the issue is that it does act more like a search engine crawler than a traditional user. A lot of sites rely on real human traffic for revenue (serving ads, requests to sign up for Patreon, using affiliate links, etc) that gets bypassed by these bots. Hell in some cases the people running the sites are just looking for interaction. So while there is a spike in traffic, and potentially cost, the people running these sites aren't getting the benefit of that traffic.

Basically these have the same issues as the summaries that Google does in their search results but, potentially, have much larger impact on the host's bandwidth

5
FauxLivingreply
lemmy.world

It isn’t opt in.

You can block all bot page scraping, and also block user initiated AI tools or you can block no traffic.

There isn’t an option to block bot page scraping but allow user initiated AI tools.

Because, as the article points out, Cloudflare is not able to distinguish between the two

0
ubergeekreply
lemmy.today

Thats not true, I just viewed my panel in CF, and Perplexity is an optional block, which by default is off.

2
FauxLivingreply
lemmy.world

They must be A/B testing a new feature then, it’s not on mine

1

Log into your dashboard, click "AI Audit", and you'll see the toggles.

3

There’s a pretty significant difference in request rate. A tool trying to search and summarize will hit a search engine once, and each website maybe 5 times (if every search engine link points to the site).

A bot trying to scrape content from a website can generate thousands or tens of thousands of requests per second.

1

Cheering for Cloudflare to be the arbiter of what technologies are allowed is incredibly short sighted

Except, they don't. It's a toggle, available to users, and by default, allows Perplexity's scraping.

2
kbin.earth

In a better timeline, we wouldn't need to cheer the victory of one megacorporation over another, they would both be the losers. But also people are still capable of holding two thoughts simultaneously.

For instance, we'd all be happy to see Apple lose the Epic Games lawsuit and be forced out of their monopoly on app stores on iOS. But those same people are aware it would allow Epic to continue being a disgusting company.

bait the anti-ai crowd

Oh I see lol

1

What does any of that have to do with the fact that Cloudflare isn’t able to classify traffic in order to distinguish between human user generated traffic and mass scraping bot traffic?

If they’re incapable of distinguishing the two, then their customers are having legitimate user requests blocked by Cloudflare with no ability to opt out.

Oh I see lol

Yeah, I think people who’re unable to think rationally about a problem because they made up their mind before knowing any of the details are intellectually lazy.

0
unpossumreply
sh.itjust.works

Thank you for trying to fight the irrational anti-AI brainrot on lemmy! It’s probably a lost cause, but your efforts are appreciated :)

-2
lemmy.world

It's insane that anyone would side with Cloudflare here. To this day I cant visit many websites like nexusmods just because I run Firefox on Linux. The Cloudflare turnstile just refreshes infinitely and has been for months now.

Cloudflare is the biggest cancer on the web, fucking burn it.

-8
Dremorreply
lemmy.world

Linux and Firefox here. No problem at all with Cloudflare, despite having more or less as much privacy preserving add-on as possible. I even spoof my user agent to the latest Firefox ESR on Linux.

Something's may be wrong with your setup.

22

I suspect a lot of it comes down to your ISP. Like the original commentor I also frequently can't pass CloudFlare turnstile when on Wifi, although refreshing the page a few times usually gets me through. Worst case on my phone's hotspot I can much more consistently pass. It's super annoying and combined with their recent DNS outage has totally ruined any respect I had for CloudFlare.

Interesting video on the subject: https://youtu.be/SasXJwyKkMI

4
Dr. Moosereply
lemmy.world

Thats not how it works. Cf uses thousands of variables to estimate a trust score and block people so just because it works for you doesn't mean it works.

0
Dremorreply
lemmy.world

Same goes the other way. It's not because it doesn't work for you that it should go away.

That technology has its uses, and Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The decision is for the website owner to take, taking into consideration the advantages of filtering out a majority of bots and the disadvantages of loosing some legitimate traffic because of false positives. If you get Cloudflare challenge, chances are that he chosed that the former vastly outclass the later.

Now there are some self-hosted alternatives, like Anubis, but business clients prefer SaaS like Cloudflare to having to maintain their own software. Once again it is their choices and liberty to do so.

4
Dr. Moosereply
lemmy.world

lmao imagine shilling for corporate Cloudflare like this. Also false positive vs false negative are fundamentally not equal.

Cloudflare is probably aware that there are still some false positive, and probably is working on it as we write.

The main issue with Cloudflare is that it's mostly bullshit. It does not report any stats to the admins on how many users were rejected or any false positive rates and happily put's everyone under "evil bot" umbrella. So people from low trust score environments like Linux or IPs from poorer countries are under significant disadvantage and left without a voice.

I'm literally a security dev working with Cloudflare anti-bot myself (not by choice). It's a useful tool for corporate but a really fucking bad one for the health of the web, much worse than any LLM agent or crawler, period.

0

So people from low trust score environments like Linux

Linux user here, Cloudflare hasn't blocked access to a single page for me unless I use a VPN, which then can trigger it.

1

Ah, the good old "you dont agree with me so you must be shilling for X" argument. I suppose you are shilling for the bots then, am I right ?

0
dodosreply
lemmy.world

I'm on Linux with Firefox and have never had that issue before (particularly nexusmods which I use regularly). Something else is probably wrong with your setup.

14
Dr. Moosereply
lemmy.world

"Wrong with my setup" - thats not how internet works.

I'm based in south east asia and often work on the road so IP rating probably is the final crutch in my fingerprint score.

Either way this should be no way acceptible.

-6

That is exactly how the internet works. That's always how the internet has worked.

1

It happened to me before until I did a Google search. It was my VPN web protection. It was too " over protective".

Check your security settings, antivirus and VPN

1
lemmy.ca

I actually agree with them

This feels like cloudflare trying to collect rent from both sides instead of doing what’s best for the website owners.

There is a problem with AI crawlers, but these technologies are essentially doing a search, fetching a several pages, scanning/summarizing them, then presenting the findings to the user.

I don’t really think that’s wrong, it’s just a faster version of rummaging through the SEO shit you do when you Google something.

(I’ve never used perplexity, I do use Kagi’s ki assistant for similar search. It runs 3 searches and scans the top results and then provides citations)

-23
drspodreply
lemmy.ml

What’s best for the website owners is to have people actually visit and interact with their website. Blocking AI tools is consistent with that.

36
lemmy.ca

For a lot of AI search I actually end up reading the pages, so I don’t know how much this stops that

-10
AstralPathreply
lemmy.ca

You're the outlier, I promise. People are literally forfeiting their brains in favor of an LLM transplant hese days.

16
Pennomireply
lemmy.world

On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing. Not saying that AI is the right move, but I can understand not wanting to visit the actual page any more.

3

On the flip side, most websites are so ad-ridden these days a reader mode or other summary tool is almost required for normal browsing.

Firefox with uBlock Origin works perfectly fine and pages load faster without the ads!

8
kbin.earth

Maybe I missed something, but ublock still works very fine for me, even on mobile. And running a pihole, while not trivial, also takes care of some ad traffic. Firefox coems with a reader mode (a feature I really like even with the adblockers!).

So why do people not want to visit pages anymore, if all these tools already existed?

5

Most people aren’t technical enough to install an ad blocker, believe it or not.

2

i put ublock origin, or another adblock on all my browsers, including phone ones and forks.

1
r00tyreply
kbin.life

Well. Try running a web server and you'll find quite quickly that you get hit quick and hard by AI crawlers that do not respect server operators. Unlike web crawlers of old, these will hit a site over and over with sometimes 100s, even 1000s of requests per second to strip mine all the content they can find, as quickly as possible.

When you try to block them by user agent, they start faking real client user agents.

When you block the AS Numbers involved traffic starts to go down. But there's still a large number of non organic requests, coming from, well frankly everywhere. Cellular network in Brazil, cable internet in the USA, other non business subcribers in other countries around the world.

How do I know they're not organic? Turn on cloudflare managed challenge and they all go away.

So, personally that's my biggest beef against them. Yes ripping off data without permission is bad already, but this level of trying to bypass any clear sign we do not want you is far worse.

24

Yeah that’s fair, and I do agree with Cloudflare stamping out that behaviour.

What I’m trying to say is there are cases where AI agents act for the user in what the traditional user agent role of browsers would be.

ETA: That doesn’t excuse things like not having a search index to prevent mass scale access, this would be near 1-1 access patterns per user, which would be infrequent/spaced out

3
FauxLivingreply
lemmy.world

The point of the article is that there is a difference between a bot which is just scraping data as fast as possible and a user requesting information for their own use.

Cloudflare doesn’t distinguish these things. It would be like Cloudflare blocking your browser because it was automatically fetching JavaScript from multiple sources in order to render the page you navigated to.

I’m sure you can recognize how annoyed you would be with Cloudflare if you had to enter 4 captchas in order to load a single web page or, as here, have your page fail to load some elements that you requested because Cloudflare thinks fetching JavaScript or pre caching links is the same as web crawler activity.

-5
r00tyreply
kbin.life

Yes, but my point is I cannot tell the difference. If they can convince cloudflare they deserve special treatment and exemption then they can probably get it.

I would argue there being a difference "depends" though. There's two problems I see. They are only potentially not guilty of one.

The first problem is, that AI crawlers are a true DDoS and this is I think the main reason most (including myself) do not want them. They cause performance issues by essentially speed running collecting every unique piece of data from your site. If they're dynamic as the article says then they are potentially not doing this. I cannot say for sure here.

The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic. In this case, I would bet some money that this company is taking the data from these sites, not providing ad revenue or organic traffic and serving it to the querying user with their own ads included. In which case, this is also very very bad.

So, their beef is only potentially partially valid. Like I say, if they can convince cloudflare, and people like me to add exceptions for them, then great. So far though, I'm not convinced. AI scrapers have a bad reputation in general, and it's deserved. They need to do a LOT to escape that stigma.

5
FauxLivingreply
lemmy.world

This isn’t about AI crawlers. This is about users using AI tools.

There’s a massive difference in server load between a user summarizing one page from your site and a bot trying to hit every page simultaneously.

The second problem is, many sites are monetized from advert revenue or otherwise motivated by actual organic traffic.

Should Cloudflare block users who use ad block extensions in their browser now?

The point of the article is that Cloudflare is blocking legitimate traffic, created by individual humans, by classifying that traffic as bot traffic.

Bot traffic is blocked because it creates outsized server load, this is something that user created traffic doesn’t do.

People use Cloudflare to protect their sites against bot traffic so that human users can access the site without it being ddos’d by bot traffic. By classifying user generated traffic and scraper generated traffic as the same thing, Cloudflare is incorrectly classifying traffic and blocking human users from accessing websites,

Websites are not able to opt out of this classification scheme. If they want to use Cloudflare for bot protection then they have to also agree that users using AI tools cannot access their sites even if the website owner wants to allow it. Cloudflare is blocking legitimate traffic and not allowing their customers to opt out of this scheme.

It should be pretty easy to understand how a website owner would be upset if their users couldn’t access their website.

-1
r00tyreply
kbin.life

And their "AI tool" looks just like the hundreds of AI scraping bots. And I've already said the answer is easy. They need to differentiate themselves enough to convince cloudflare to make an exception for them.

Until then, they're "just another AI company scraping data"

3
FauxLivingreply
lemmy.world

Well, Cloudflare is adding, to the control panel, the ability to whitelist Perplexity and other AI sources (default: on).

Looks like they differentiated themselves enough.

2

That option is only likely to be for paid accounts. The freebie users like me have to make our own anti bot WAF rules. Or, as I do, just toss every page I expect a user to be using via managed challenge. Adding exceptions uses up precious space in those rules which I've used to put in exceptions for genuine instance to instance traffic.

But I am glad they were able to convince cloudflare. Good for them.

1

Cloudflare doesn’t distinguish these things

It does.

You just make useragent like "AI bot request initiated by user" and the website owners will decide for themselves to allow your traffic or not.

If your bot pretends to not be a bot, it should be blocked.

Edit. Btw Openai does this.

1
kopasz7reply
sh.itjust.works

Search engines been going relatively fine for decades now. But the crawlers from AI companies basically DDOS hosts in comparison, sending so many requests in such a short interval. Crawling dynamic links as well that are expensive to render compared to a static page, ignoring the robots.txt entirely, or even using it discover unlinked pages.

Servers have finite resources, especially self hosted sites, while AI companies have disproportinately more at their disposal, easily grinding other systems to a halt by overwhelming them with requests.

24
pr06lefsreply
lemmy.ml

If a neighborhood is beset by roving bands of thieves, sooner or later strangers will be greeted by a shotgun rather than an invitation to tea, regardless of their intentions. Them's the breaks. Bots are going to take a hit now and their operators are just going to have to deal with it. Sucks when people don't play nice, but this is what you get.

8
FauxLivingreply
lemmy.world

I’m sure people that are attempting to drive to their house in a new vehicle wouldn’t appreciate being riddled with bullets because the neighborhood watch makes no attempt to distinguish between thieves and homeowners.

-4
pr06lefsreply
lemmy.ml

So sad for them. Try not living in a war zone?

2
FauxLivingreply
lemmy.world

It isn’t a war zone, it’s a gated community where the guards have suddenly decided that any vehicle made after 2020 is full of thieves.

They didn’t bother to consult the residents or give them the ability to opt out of having their dinner guests murdered for driving a vehicle the security guards don’t like.

-4
pr06lefsreply
lemmy.ml

So you're a cloudflare customer and you wish they would let the perplexity traffic multiplier through to your website? You can leave cloudflare any time you want.

3
FauxLivingreply
lemmy.world

🙄You’re an Internet user and you don’t like AI so you can leave the Internet anytime you want.

That’s not a good argument, what about the users who want to block mass scraping but want to make their content available to users who are using these tools? Cloudflare exists because it allows legitimate traffic, that websites want, and blocks mass scraping which the sites don’t want.

If they’re not able to distinguish mass scraping traffic from user created traffic then they’re blocking legitimate users that some website owners want.

-3

Yes your "leave the internet any time you want" strawman is not a good argument.

If allowing perplexity while blocking the bad guys is so easy why not find a service that does that for you?

2