Abstract
The open web is not disappearing because publishing has become impossible; it is disappearing because discovery is being absorbed into vendor specific information environments. Google is the central case because of its dominance in web search, but the pattern is broader: crawlers, indexes, operating systems, browsers, assistants, DNS resolvers, VPNs, advertising systems, and policy processes are converging into private gardens that present themselves as the web. Findability inside these gardens depends less on public availability than on compatibility with their measurement, monetization, legal, and editorial machinery. Once LLMs train on and retrieve through those filtered layers, exclusion no longer affects only search traffic; it shapes the corpus from which future answers are generated.
The point is not to catalogue individual removals, demotions, or misclassifications. A catalogue would turn the argument into a list of symptoms. The concern here is the generic behavioral shift observed over decades: discovery moves from open publication toward measurable participation in infrastructure controlled by a small number of intermediaries.
The excluded layer is what is often called the small web: personal sites, independent archives, hobbyist documentation, technical notes, volunteer projects, and other noncommercial knowledge that can be public without being easily measured.
Googlecentrismus
This article’s Googlecentrismus is intentional because the risk is asymmetrical. Google does not merely operate one search interface among many; StatCounter reports 90% global search engine share for Google in April 20261, and Google Search Central describes Search2 as a fully automated system whose crawlers regularly explore the web, discover pages, and add them to a large index. Exclusion from Google is therefore not a failure to appear in one directory. It is exclusion from the dominant public discovery layer.
Google is not unique in kind. A short list of other engines also tries to maintain broad web discovery infrastructure: Bing3, Yandex4, Baidu5, and regional engines with substantial domestic presence, such as Naver and Seznam6. They crawl, index, rank, filter, and govern documents through similar classes of mechanisms, although at smaller global scale or within narrower linguistic, national, or commercial markets. The argument keeps this Googlecentrismus because Google is where these pressures are largest, not because the pattern belongs only to Google.
The recent rise of alternative search engines does not remove this structure. One family is front end, metasearch, or hybrid search: DuckDuckGo7 largely sources traditional links and images from Bing while combining them with its own crawler and specialised sources; Kagi combines its own indexes, anonymised calls to major providers, specialised engines, and manually shaped ranking tools8; Brave Search9 uses its own index for web search results.
The pattern also appears when the direction is reversed. Brave began from the browser side and added Brave Search10 in 2021. DuckDuckGo ties search to its own browser and subscription VPN11. Kagi has Orion12, a privacy focused browser. Better incentives do not change the structural convergence: search, browser, and network surfaces move into one information environment.
These systems can improve incentives, privacy, and ranking taste, but they remain constrained by the same scarcity: there are only a few large web indexes to draw from.
A more institutional response is replication of the index layer itself: OpenWebSearch.eu13 proposes an open European web data infrastructure for search, research, and LLM applications; this is useful as a counterweight to Google and Bing, but it does not dissolve the index problem. It produces another major index, more public and bureaucratic than commercial alternatives, but still governed through crawl policy, inclusion criteria, institutional priorities, and operational limits.
The other family is independent crawling with smaller, deliberately scoped, or distributed indexes. Yep maintains its own web index through Ahrefs infrastructure14; Wiby15 explicitly says it is not meant to index the entire web and prefers human seeded discovery; Marginalia16 is an independent open source search engine that foregrounds obscure, noncommercial, and small web discovery; Teclis17 combines its own crawl with Kagi Small Web and Marginalia results for noncommercial web search; smallweb.cc18 is a manually curated collection of indie websites with search, submission, and auxiliary crawlers for metadata and discovery; YaCy19 decentralizes crawling, indexing, and search across peer operated nodes.
These engines matter because they preserve alternative discovery routes. They do not yet change the fact that exclusion at Google scale still defines practical invisibility for most readers.
Auditability fails at the same boundary. A minimal black box audit of
Google inclusion can be formed from domains derived from the pinned
list20: about 36'000 site: queries in the ordinary
interface. Direct automation of that interface runs into Google’s anti
bot boundary because Google treats searches sent by programs, automated
services, search scrapers, and rank checking software as automated
traffic21. The official alternatives do not reproduce ordinary Google
Web Search. Search Console’s URL Inspection API exposes URL level index
data only for properties a user manages, and its quota is enforced per
Search Console property22. Custom Search JSON API is closed to new
customers, must be replaced by existing users by January 1, 2027, and
gives 100 free queries per day, with paid use at $5 per 1'000
additional queries up to 10'000 queries per day23. At this size, the
free tier is not an audit path; paid use still has to be batched, and
the API queries a Programmable Search Engine whose whole web mode is
explicitly limited to a subset of the Google Web Search corpus and can
differ from Google.com site: results23. This does not prove
exclusion, but it shows that a trivial inclusion audit over one curated
small web corpus already requires payment, batching, grandfathered API
access, or owner verified access, and still returns a proxy. The public
index is not public in the audit sense.
The hard problem for small web search is not coverage alone. At the signal level, a human niche website and an automated LLM generated site full of nonsense can look uncomfortably similar: small traffic, few backlinks, no institutional domain, irregular publication, weak metadata, and little external authority. That is already the optimistic comparison: the LLM plus link farm system is stronger along the measures large engines can see. It can produce thousands of pages, buy or revive domains, stage cross links, target queries systematically, and update at machine speed, while the small site merely exists with weak external proof of its own human value. Large search engines tend to resolve this ambiguity through popularity and authority proxies, which can bury both together.
The cleanup evidence comes in layers. Google itself publishes policy and update records, not a public deindexing census: in March 202424 it made scaled content abuse, expired domain abuse, and site reputation abuse explicit policy targets, with nonappearance in results as an explicit sanction for violations. Bevendorff et al.25 give the empirical bridge from SEO spam to this cleanup model. In a year long study of 7'392 product review queries, they found that review spam sites were often deindexed or penalized after ranker updates, that Google’s updates had measurable but mostly short lived effects, and that search engines appear to lose the SEO spam cat and mouse game. Public counts come from third party measurement. In March 2024, Originality.ai26 estimated 1'446 manual actions among about 79'000 content sites, while Search Engine Journal26 reported Ian Nuttall’s 837 fully deindexed sites among 49'345 monitored sites. In May 2025, Indexing Insight27 reported that 25% of 2 million monitored pages were actively removed from Google’s index, with affected sites losing 15%-75% of indexed pages. The 2026 evidence is weaker but points in the same direction: Google’s public dashboard28 confirms a March spam update and a March to April core update, Search Engine Roundtable28 records community reports of higher deindexing since early April, and Search Engine Land’s SE Ranking data28 found 24.1% of top 10 URLs falling out of the top 100 after March 2026, compared with 14.7% after the December 2025 core update.
LLM text detection does not repair this failure. It fails on both sides: OpenAI withdrew its own classifier after reporting only 26% true positives for LLM written text and 9% false positives for human written text29; Liang et al.30 found that several detectors misclassified nonnative English human writing as LLM generated at high rates. The edited case is worse, because the object being classified ceases to be clean. OpenAI31 warned that edited LLM text can evade classifiers; Sadasivan et al.32 showed that recursive paraphrasing sharply reduces detection rates, and Krishna et al.32 showed that paraphrasing can drop DetectGPT accuracy from 70.3% to 4.6% at a 1% false positive rate. Once a human rewrites, cuts, corrects, or extends model output, the detector is asked to infer production history from surface regularities, while search needs to know whether the page is useful, truthful, and worth indexing.
Small web engines avoid this contest by changing the admission rule. They do not try to outmeasure synthetic authority at Google scale; they fall back to curation, submission, and small enough review surfaces. smallweb.cc’s submission page18 accepts personal sites by email under qualitative moderation rules; Kagi’s public small web list33 accepts personal feeds under rules that exclude automated, LLM generated, and spam content; Marginalia’s submission repository34 adds sites before crawl cycles; Wiby’s submission page35 accepts individual pages under broad qualitative rules and states that, in most cases, only the submitted page will be crawled.
This is valuable, but it is a different model: public and community aided selection at small scale, not a general ranking system that can safely distinguish rare human work from synthetic filler across the whole web.
The Index as Measurement System
A search engine appears to be a catalogue: crawlers discover documents, ranking systems estimate relevance, and users receive an ordered list of results. That image is too simple, and not only for the modern web. If telemetry means measurement of attention rather than only hidden client event logging, Google’s original ranking breakthrough was already a telemetry system. PageRank36 replaced the purely lexical ambition of classic TF/IDF style retrieval with a social measurement claim: the public link graph contains evidence of importance that page text alone does not. A hyperlink cost attention, reputation, placement, and maintenance, so it could function as a weak public trace of human editorial judgment rather than merely as a syntactic feature of HTML.
In that narrow sense, PageRank is the last genuine conceptual breakthrough in the argument here: not because later ranking systems are technically trivial, but because they mostly elaborate the same insight under harsher adversarial conditions. The problem is to find measurable traces that are expensive to fake, then adapt as those traces become commercialized, gamed, or exhausted.
SEO is the ambiguous industrial response to that discovery. In its benign form, it makes documents crawlable, legible, and descriptively linked37; in its adversarial form, it manufactures the signals that a ranking system treats as public judgment. Once links become ranking capital, links can be bought, exchanged, hidden, automated, or laundered through expired domains and reputation surfaces; Google’s spam policies38 are the institutional residue of this arms race.
The present surveillance problem follows from that history rather than from a sudden break. The novelty is not measurement itself; it is the migration of decisive ranking evidence from public links to private infrastructure. Search ranking does not stop being a measurement problem when link signals become adversarial. It moves toward signals with higher fraud cost, many of which live inside browsers, accounts, advertising systems, devices, and delivery networks. A page is no longer evaluated only by its text, links, structure, and reputation; it is evaluated through observed user behavior, page experience data, engagement patterns, freshness signals, and the larger commercial surface around the site.
This changes the political economy of visibility. A document can exist, be publicly reachable, contain original expertise, and still fail to become visible if it does not participate in the measurement layer. The failure is not necessarily a manual act of suppression. It can be an ordinary consequence of a system in which the absence of telemetry is difficult to distinguish from the absence of value.
The central problem is therefore not that Google has built a crude content filter. The problem is subtler: Google has built an information environment in which the most measurable web becomes the most findable web, and the most findable web becomes the basis for future measurement. Once search, advertising, browser telemetry, and LLM grounding reinforce one another, the index ceases to represent the open web as such. It represents the portion of the web compatible with Google’s instruments.
Surveillance Compatibility
The first wall is surveillance compatibility. Modern ranking systems draw value from behavioral observation: clicks, long clicks, return behavior, navigational paths, popularity signals, and field data from real browsers. Public reporting from the Google search antitrust record and the 2024 Content API Warehouse leak describes systems such as NavBoost and Chrome derived fields39 that track user interaction and site level visibility.
The Chrome UX Report shows the same structure in a more public form. CrUX is a dataset of real world Chrome user experience data. Google states that it is used by Google Search to inform the page experience ranking factor, and its methodology excludes origins and pages that do not meet a sufficient popularity threshold40. A site cannot submit itself into this dataset by being technically correct, carefully written, or socially useful. It must first be observed often enough by eligible Chrome users.
The client layer is not limited to the browser. In Google’s case, Android and the surrounding Google services create another operating system level telemetry surface. Android usage and diagnostics can report how the device is used and how it works, including app use frequency and network connection quality41; Web & App Activity41 can save activity from Google services and, when enabled, include Chrome history and activity from sites, apps, and devices that use Google services.
The surveillance surface is broader than browser telemetry. CDNs, shared JavaScript and CSS libraries, hosted fonts, analytics endpoints, performance probes, centralized DNS resolvers, and VPN operators all observe fragments of the user’s path before any site local tracker is considered.
PageSpeed Insights42 reinforces this infrastructure model by combining CrUX field data with Lighthouse diagnostics, while Lighthouse42 flags scripts and stylesheets that block first paint and points developers toward delivery, caching, deferral, and dependency changes. The technical advice is often sound as performance engineering; the political consequence is that speed optimization normalizes dependence on shared infrastructure whose operators can observe traffic patterns.
This is network level surveillance rather than page level tracking. The vendor does not need JavaScript inside the document: control over resolution, delivery, or tunneling can reveal domains, requested assets, destination addresses, timing, and traffic volume. DNS sees names; CDNs see the requests they terminate or serve; VPNs see destinations and flow metadata. Even when page bodies remain encrypted, these traces can be enough to reconstruct what users actually read at site, topic, or document granularity.
DNS is the clearest example because the resolver does not need the page
body to learn the shape of browsing. A public resolver such as 1.1.1.1
is in a position to observe client address metadata and the names being
resolved; encrypted DNS protects the query from the local network but
can centralize the same information at the resolver operator.
The privacy concern is not theoretical: the earlier DNS over TLS article43 framed the problem as a shift of surveillance from the local provider to public DNS operators rather than as a complete removal of metadata leakage.
OS integrated privacy relays and VPNs make the problem sharper because they can turn routing into credentialed routing. Apple’s iCloud Private Relay is not a general VPN, but it is sold as an easy privacy layer inside Apple operating systems; Apple tells site operators that Private Relay44 validates the client as an Apple device, validates that the customer has a valid iCloud+ subscription, and presents a coarse location through relay IP addresses. VPN by Google45 is similarly built into supported Pixel devices and depends on account and device eligibility. In parallel, Google Play Integrity exposes app, device, and account verdicts46, while Apple App Attest certifies that a key belongs to a valid app instance47. The user sees a privacy switch; the site may see traffic with claims backed by the vendor.
The same pattern can extend from device legitimacy to age and identity claims. Apple already exposes Wallet based age and identity verification for apps that need to prove a person’s age or identity48. Google Wallet’s digital credentials flow can give apps and websites cryptographically signed identity and age attributes, including threshold claims such as 16+, 18+, or 21+, depending on the relevant content and jurisdiction49. The extension is small: a relay or VPN built into an operating system and vendor account can be expanded beyond a privacy pipe. It can become a channel through which a visited site receives a vendor backed statement that the user is old enough, is not a minor in the relevant jurisdiction, or is located where the vendor says the user is located. The visited site can trust that statement because it is attached to the same account, device, platform, and network stack that already mediates access.
LLM mediation adds another surveillance surface. Search engines, browsers, and privacy products increasingly include chatbots, page summarizers, answer boxes, and browser assistants. The issue is not limited to the largest vendors: DuckDuckGo offers Duck.ai50, Brave offers Leo50, and Kagi offers Assistant and Summarize50. Even when such services are designed with privacy constraints, the interaction changes what can be observed. A vendor can learn the query, the URL or text sent for summarization, the page selected for explanation, and the follow up questions that reveal what the reader considered important.
The conflict over ad blockers intensifies this sorting. Chrome’s
Manifest V3 migration removed webRequestBlocking for most extensions
and pushed blocking extensions toward declarativeNetRequest51.
Google’s extension timeline records Manifest V2 being disabled for
ordinary users in 202552.
For technically savvy audiences, the practical response is migration toward privacy oriented Chromium forks, such as Ungoogled Chromium53, or toward the Firefox ecosystem and its own forks.
The distinction barely matters for ranking telemetry: both paths make browsing less visible to Chrome derived datasets at precisely the same time that the sites these readers value are often small, noncommercial, and unique. A technical niche can therefore become doubly invisible: its publishers do not instrument users, and its readers leave the browser whose telemetry feeds the measurement layer.
This produces the privacy paradox in its shortest form: every choice can be reasonable, yet their combination creates a structural absence in a ranking system that treats observed behavior as evidence of usefulness.
Monetization Compatibility
The second wall is monetization compatibility. The commercial web is rich in signals: advertising impressions, analytics events, search engine optimization campaigns, sponsored distribution, public relations, social amplification, affiliate incentives, and institutional backlinks. These signals are not merely decorative. They create the surrounding evidence by which relevance, authority, popularity, and demand become legible to search systems.
The small web is structurally weaker on this axis. Independent research, hobbyist expertise, small technical archives, dissenting analysis, and volunteer documentation often lack budgets for SEO, publicity, paid discovery, or continuous content operations. Their value is concentrated in the document itself, not in the commercial machinery around it.
This distinction matters because SEO is not one thing. Making a document crawlable, linking it coherently, and using the words by which readers search for it are ordinary legibility work. Buying ranking credit, manufacturing backlink networks, hiding links, or exploiting inherited domain reputation is adversarial signal production. The commercial web can pay for both legitimate legibility and illicit manufacture; the small web often does neither.
Google’s market position makes this more than an ordinary ranking imperfection. United States federal courts have found unlawful monopoly conduct in Google search and search advertising54, and in Google’s open web advertising technology markets54.
The legal findings matter because the same company controls the dominant search engine, the dominant browser, major advertising infrastructure, a major mobile platform, YouTube, and Gemini. When the measurement layer and the monetization layer sit inside the same corporate structure, the visible web tends to converge with the profitable web.
This does not require a simple claim that using Google advertising directly buys organic ranking. The mechanism is broader. Sites built for commercial capture usually generate the surrounding evidence that search systems can read: attention, links, returning users, structured marketing data, and institutional recognition. Sites built for independence generate less of that evidence. The ranking system can remain formally neutral while the signal economy is not neutral at all.
State Removal and Private Editorial Power
The third wall is state removal. Governments can ask Google to remove content from search, YouTube, Blogger, Play, and other services. Google publishes a Transparency Report55 for these requests, which makes this the most visible form of exclusion.
Visibility, however, does not make the mechanism harmless. A search index with global reach must decide how to handle local illegality, national security claims, defamation complaints, hate speech laws, election rules, copyright demands, and court orders. Each decision shifts the practical boundary of accessible knowledge.
The fourth wall is private editorial power. Unlike government removal, private intervention is much harder to observe. Public reporting56 has documented manual adjustments, blacklists, and special handling for sensitive search surfaces.
Some interventions may be justified as spam control, fraud prevention, election integrity, child safety, copyright compliance, or protection against coordinated manipulation. The problem is not that every intervention is illegitimate. The problem is that the public cannot see the criteria, the complainants, the internal deliberation, or the failed appeals.
This asymmetry creates a hierarchy of audibility. States have formal channels. Large corporations have counsel, lobbyists, and platform contacts. Major publishers can create reputational pressure. Organized activists can create public controversy. Independent authors, small scale researchers, and privacy preserving communities usually have none of these channels. They can be excluded by policy, by silence, or by the ordinary indifference of a platform too large to answer them.
The Convergence
These mechanisms do not operate as a checklist. They compound into a Catch-22: a page needs measurable traffic to earn visibility, but it needs visibility to produce the measurable traffic that ranking systems treat as evidence. Privacy preserving and noncommercial publication starts at the disadvantaged end of that loop: weak telemetry, weak commercial trace, and little institutional leverage.
This is why the garden is difficult to perceive from the inside. A user who searches Google sees results, snippets, knowledge panels, maps, videos, and LLM summaries. The experience feels abundant. Missing material has no visual form. A search result page does not show the documents that were never indexed, the sites below telemetry thresholds, the pages removed after complaints, or the knowledge sources that never became visible enough to train future systems. Absence is rendered as completion. At that point the classic cyberpunk intuition returns without its stylish costume: no rain, chrome, or neon lights; only dashboards, API quotas, policy queues, app store rules, telemetry surfaces, and search results that simply stop returning a page.
LLM Amplification
The stakes change once search becomes an input to LLMs. LLMs are trained on web documents, and systems grounded by search use live search results to attach current information to generated answers. Google’s Gemini 1.5 technical report states that its pretraining data includes “web documents”57, and Google Cloud documentation describes grounding Gemini responses with Google Search57. The corpus boundary is not merely abstract: recent work extracted substantial portions of copyrighted books from production LLMs, including near verbatim extraction in some configurations58, which makes the LLM look less like a lossy summary engine than an ideal plagiarismus apparatus.
This is not merely dependence on inherited search indexes. LLM companies build their own crawlers around the same crawl everything ambition at web scale: discover as much public web as policy, robots rules, and economics permit; classify pages; decide which sources may be used to train models; decide which sources may ground answers; then expose the filtered corpus through a conversational interface.
OpenAI’s crawler documentation is a clean example because it defines
separate robots for search visibility and training data discovery:
OAI-SearchBot is used for ChatGPT search, while GPTBot is used for
content that may enter foundation model training59. OpenAI’s ChatGPT
Atlas60 shows the same direction from the browser side: a Chromium
based browser with ChatGPT built in, where browsed content can be used
for model training when the user enables the relevant training controls,
while pages that opt out of GPTBot are excluded from that training
path.
The search index and the training corpus are not identical, but search is a major discovery, filtering, and grounding layer. Systematic exclusions in search therefore become easier to reproduce inside LLM systems.
The effect is larger than one company’s model. Search engines have long
served as discovery infrastructure for crawlers, researchers, ranking
tools, and downstream information systems. When Google reduces access to
deeper result sets, as happened with the widely reported removal or
disablement of the num=100 search parameter61 in September 2025,
the long tail becomes harder to inspect at scale. Systems that depend on
Google Search as a discovery layer inherit the shape of Google’s
visibility boundary.
For LLMs, the consequence is not only that some pages receive less traffic. The deeper consequence is epistemic homogenization. If models learn from and retrieve through a web filtered by surveillance compatibility, commercial signal production, state pressure, and private editorial judgment, then model outputs will overrepresent the same surfaces. The LLM mediated information layer will appear broad while narrowing toward mainstream, institutionally legible, commercially amplified sources. This is one route through which model programming bias62 becomes infrastructural rather than merely conversational.
This matters most for knowledge that is rare rather than popular: primary source reconstruction, independent technical research, unfashionable legal analysis, small language documentation, local history, dissenting institutional critique, and other work that may be valuable precisely because it is not produced inside the machinery of attention.
The Walled Garden
Google is the main case in this article, but the garden form is per vendor. Each vendor combines its own discovery system, client software, network surface, assistant layer, monetization model, and policy process into a distinct information bubble. Google’s version has four walls: surveillance compatibility, monetization compatibility, state removal, and private editorial power.
The first rewards sites and audiences that can be observed, including through operating system telemetry, browser telemetry, DNS, CDN, VPN, browser assistant, and LLM summarization surfaces. The second rewards the commercial signal economy. The third reflects formal legal and political pressure. The fourth reflects opaque platform judgment and private influence.
Protocol exits do not automatically escape this form. gemini://63,
A12 Web64, GNUnet65, and secushare66 are useful examples
because they answer the same failure in the nerdy register: make the
medium smaller, more explicit, and less available to the modern web’s
ambient extraction machinery. gemini:// does this by defining a
deliberately simple TLS based protocol and text/gemini format for
capsules rather than browser applications. A12 goes further by treating
discovery, link semantics, identity, signed application packages,
traffic ownership, and directory mediated rendezvous as runtime
properties rather than artifacts recovered later by a crawler. GNUnet
moves the same instinct down into a secure distributed application
stack, while secushare uses GNUnet as the substrate for a distributed
social graph. These attempts clarify the problem, but they do not
abolish it. gemini:// becomes a small protocol island; A12 attempts a
more explicit web of trust; GNUnet and secushare move trust, naming,
routing, and social relations into a peer to peer protocol environment.
The garden boundary changes shape, but does not disappear.
Each boundary mechanism can be defended in isolation. User behavior can improve relevance; commercial signals can correlate with legitimacy; state removal can enforce law; editorial intervention can reduce abuse; protocol minimalism can reduce ambient extraction. The danger lies in their combination under monopoly conditions or inside protocol islands. Together they transform the open web into a filtered environment whose boundaries are invisible to the ordinary user.
Afterwords
The old promise of the web was that publication and findability could be separated from institutional permission. A person, small group, or independent organization could publish knowledge, and a search engine could make it discoverable to anyone who needed it. That promise did not require perfect equality of attention, but it required a meaningful path from public availability to public discovery.
The surveilled web breaks that path. Visibility increasingly belongs to content that produces behavioral telemetry, commercial traces, legal safety, and editorial acceptability. Content outside those conditions may remain online while becoming absent67 from the practical information layer. Once LLMs are trained on and grounded through that layer, the exclusion no longer affects only today’s search traffic. It shapes the knowledge that future systems can retrieve, summarize, and treat as real.
The walled garden is therefore not merely a search problem. It is a problem of epistemic infrastructure. A web ordered by surveillance, monetization, state pressure, and opaque editorial judgment cannot truthfully present itself as the open web. It is a curated garden with private walls.
The newer pressure is legal and operational, although its final shape will depend on litigation, enforcement choices, and legislative rework. A site that must prove age, detect bots, prevent fraud, or respect jurisdictional duties has a reason to prefer platform certified traffic. Some laws already make individual sites liable for age gates: Louisiana68, Texas69, and Alabama70 require age verification for substantial harmful to minors or sexual material harmful to minors, while Utah’s approach71 treats a user as accessing a site from Utah when the user is actually located there, even through a VPN, proxy, or other location masking technique.
Other laws point toward moving age state into platform infrastructure. Utah’s app store law72 routes age category and parental consent through app stores; California’s AB 104373 puts age bracket signals into operating system account setup and real time developer APIs; Brazil’s ECA Digital74 requires app stores and terminal operating systems to assess age or age range and expose age signals through secure APIs; Germany’s JMStV § 1275 requires operating system youth protection settings that constrain browsers, app stores, and app availability, while JMStV § 476 keeps pornography behind closed adult user groups. These examples are illustrative rather than exhaustive. The exact duty varies, and some rules may be narrowed or overturned, but the pressure points in one direction: platform authenticated access.
The result is not a future formal ban on the open web. It is a present legal and operational drift in which ordinary services learn that uncertified traffic is expensive to handle. Banks, government portals, airlines, booking sites, ticketing systems, marketplaces, and other high liability services can reduce legal, fraud, bot, and abuse risk by accepting claims from browsers, operating systems, wallets, relays, app stores, or attestation channels. The Internet of gardens therefore emerges without a single gate closing: important services remain nominally public while practical access increasingly depends on platform certification.
The Munich I Regional Court gave the same problem a liability form on
May 28, 202677. In a dispute over Google’s AI overview (Übersicht mit KI), the court held under the German Civil Code (Bürgerliches Gesetzbuch, BGB) in conjunction with the Basic Law (Grundgesetz,
GG) that the contested display was not merely a display of search
results (bloße Anzeige von Suchergebnissen), but own attributable
content (eigener, ihr zurechenbarer Inhalt), because Google presented
results in its own words and according to its own structure, created
statements beyond the linked third party pages, and was liable as a
direct interferer (unmittelbare Störerin) under the protection of the
general personality right or corporate personality right (des allgemeinen Persönlichkeitsrechts bzw. des Unternehmenspersönlichkeitsrechts). The court also recorded Google’s
own defensive premise: users should not blindly trust AI generated
information (den mit KI generierten Informationen nicht blind vertraut werden dürfe). This is the contradiction: Google presents the overview
as an answer, but defends it as a statement users must verify elsewhere.
StatCounter, Search Engine Market Share Worldwide, reported Google at 90.04% worldwide search engine share for April 2026. ↩︎
Google Search Central, In-depth guide to how Google Search works, describes Search as an automated crawler based system that discovers pages, downloads content, indexes it, and serves results. ↩︎
Microsoft Support, How Bing delivers search results, describes Bing crawling the web, building an index, and ranking results. ↩︎
Yandex Webmaster, How does Yandex search work?, describes crawling, indexing, database construction, and result generation. ↩︎
Baidu Help Center, About Baiduspider, describes Baiduspider as the crawler that creates Baidu’s index. ↩︎
The Korea Times, Naver remains dominant player in Korea’s search market, April 16, 2026, reports InternetTrend data putting Naver at 63.8% of South Korea’s domestic web search market in March 2026, ahead of Google at 28.7%; Seznam.cz, History of web search, describes Seznam.cz as a Czech company founded in 1996 that developed its own search engine in 2005. ↩︎
DuckDuckGo, Where do DuckDuckGo search results come from?, describes its use of Bing for traditional links and images, plus DuckDuckBot, internal indexes, and specialised sources. ↩︎
Kagi, Search Sources, describes its own indexes, external provider calls, specialised engines, and small web initiatives. ↩︎
Brave, Brave Search removes last remnant of Bing from search results page, describes Brave Search as using its own index for web search results. ↩︎
Brave, Brave Search beta now available in Brave browser, June 22, 2021, describes Brave Search as available in Brave Browser and built on an independent index. ↩︎
DuckDuckGo, Does DuckDuckGo make a browser?, describes the DuckDuckGo Browser for Mac, Windows, iOS, and Android; DuckDuckGo, What is DuckDuckGo VPN?, describes its subscription VPN and browser based management. ↩︎
Kagi, Orion Browser by Kagi, describes Orion as a privacy focused WebKit browser with zero telemetry. ↩︎
OpenWebSearch.eu, Welcome, describes the EU funded OpenWebSearch.eu project, the Open Web Index, and OWS as open European web search and web data infrastructure; the project site links to openwebindex.eu for the Open Web Index, which is currently available for research and development purposes. ↩︎
Ahrefs, Where do you get the data from?, states that Yep maintains its index using a separate crawler called YepBot. ↩︎
Wiby, Build your own Search Engine, states that Wiby is not meant to index the entire web and prefers human submissions. ↩︎
Marginalia Search, About Marginalia Search, describes itself as an independent open source Internet search engine focused on discovery for the free and independent web. ↩︎
Teclis, Teclis - Non-commercial Web Search, describes itself as surfacing the less known, creative, self expressive web, using its own crawl together with Kagi Small Web and Marginalia results; Teclis, Send feedback, accepts link suggestions to authentic, useful pages rather than whole websites, and says suggestions are crawled every few months. ↩︎
smallweb.cc, About smallweb, describes itself as a guide to independent websites, documents custom crawlers for screenshots, metadata, periodic revisits, and discovery; smallweb.cc, Submit a site, accepts personal websites, blogs, portfolios, and similar submissions by email under qualitative moderation rules. ↩︎ ↩︎
YaCy, Home and FAQ, describes YaCy as free software for a decentralized peer to peer web search engine with no central server, no search request storage, and a shared index; the FAQ describes a YaCy peer as providing web indexing services to other peers, crawling linked documents at configured crawl depth, and using peers to answer search queries. ↩︎
smallweb.txt at commit
e4df141675bf276f318055c69a54badf73e63d72is a pinned snapshot of the public feed list; GitHub renders the file as about 36'000 lines. ↩︎Google Search Help, Resolve Google Search’s “Unusual traffic from your computer network” message, says automated traffic to Google Search can trigger a reCAPTCHA or block, and lists robots, computer programs, automated services, search scrapers, and software that sends searches to find ranking position as automated traffic. ↩︎
Google Search Central, Welcoming the new Search Console URL Inspection API, says the API gives URL level data for properties managed in Search Console and that URL Inspection quota is enforced per Search Console property; Google Search Console API, Usage Limits, lists URL Inspection quota as 2'000 queries per day and 600 queries per minute per site. ↩︎
Google for Developers, Custom Search JSON API, says the API retrieves results from a Programmable Search Engine, is closed to new customers, requires existing customers to transition by January 1, 2027, and gives 100 free queries per day with paid requests at $5 per 1'000 queries up to 10'000 queries per day; Google Programmable Search Engine Help, Programmable Search Engine vs Google.com, says whole web Programmable Search Engines are limited to a subset of the total Google Web Search corpus and can differ from Google.com
site:results. ↩︎ ↩︎Google Search Central, What web creators should know about our March 2024 core update and new spam policies, announced scaled content abuse, expired domain abuse, and site reputation abuse as new spam policies, and states that sites violating spam policies may rank lower or not appear in results; Google, New ways we’re tackling spammy, low-quality content on Search, later said the rollout produced 45% less low-quality, unoriginal content in search results. ↩︎
Janek Bevendorff, Matti Wiegmann, Martin Potthast, and Benno Stein, Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines, in Advances in Information Retrieval, ECIR 2024, monitored Google, Bing, and DuckDuckGo for one year on 7'392 product review queries. The paper reports that review spam sites are often deindexed or penalized quickly after ranker updates, that Google’s updates have noticeable but mostly short lived effects, and that the line between benign content and content or link farms becomes increasingly blurry in the wake of LLM generation. ↩︎
Originality.ai, Can Google Detect and Does it Penalize AI Content, reported 1'446 manual actions among about 79'000 checked content sites, and estimated more than 20 million monthly visits lost; Search Engine Journal, Google’s March 2024 Core Update Impact: Hundreds Of Websites Deindexed, reported Ian Nuttall’s monitored set of 49'345 sites, of which 837 were removed entirely from Google’s index in early March 2024. ↩︎ ↩︎
Indexing Insight, Google Indexing Purge: May 2025, reported that 25% of 2 million monitored pages were actively removed from Google’s index at the end of May 2025, with some sites losing 15%-75% of indexed pages. ↩︎
Google Search Status Dashboard, Ranking history, records the March 2026 spam update and March 2026 core update; Search Engine Roundtable, Google Search May Be Deindexing URLs At Higher Rates, May 1, 2026, records community reports of higher deindexing since early April and Google’s public response that nothing exceptional was visible; Search Engine Land, March 2026 Google core update more volatile than December, April 15, 2026, reports SE Ranking data for March 2026 against December 2025, including 24.1% of top 10 URLs falling out of the top 100, compared with 14.7% after the December update. ↩︎ ↩︎ ↩︎
OpenAI, New AI classifier for indicating AI-written text, said its classifier was withdrawn on July 20, 2023 because of a low rate of accuracy; OpenAI’s own evaluation reported 26% true positives for LLM written text and 9% false positives for human written text. ↩︎
Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou, GPT detectors are biased against non-native English writers, 2023, found that widely used detectors consistently misclassified nonnative English human writing as LLM generated, while native writing samples were classified accurately. ↩︎
OpenAI, New AI classifier for indicating AI-written text, warned that LLM written text can be edited to evade classifiers. ↩︎
Vinu Sankar Sadasivan, Aounon Kumar, Sriram Balasubramanian, Wenxiao Wang, and Soheil Feizi, Can AI-Generated Text be Reliably Detected?, 2023, stress tested detectors with recursive paraphrasing attacks; Kalpesh Krishna, Yixiao Song, Marzena Karpinska, John Wieting, and Mohit Iyyer, Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense, 2023, found that DIPPER paraphrasing reduced DetectGPT accuracy from 70.3% to 4.6% at a constant 1% false positive rate. ↩︎ ↩︎
Kagi, Kagi Small Web, describes
smallweb.txtas containing feeds of indexed blogs and documents inclusion rules that reject automated, LLM generated, and spam content. ↩︎Marginalia Search, Submit websites to be crawled by Marginalia Search, documents a public GitHub based site submission process. ↩︎
Wiby, Submit to the Wiby Web, describes page level submission rules and states that in most cases only the submitted page will be crawled. ↩︎
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford InfoLab technical report 1999-66, 1999; Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, 1998, contrasts Google with conventional information retrieval that relies mostly on keyword matching, and describes its use of link structure, anchor text, and PageRank quality ranking from the web link graph. ↩︎
Google Search Central, SEO Starter Guide and Link best practices for Google, describe SEO as helping search engines understand content, and links as signals for page relevance and discovery. ↩︎
Google Search Central, Spam policies for Google web search, describes link spam as creating links to or from a site primarily to manipulate search rankings, including paid links, excessive link exchanges, automated link creation, hidden links, and expired domain abuse. ↩︎
Rand Fishkin, SparkToro, An Anonymous Source Shared Thousands of Leaked Google Search API Documents with Me; Everyone in SEO Should See Them, May 27, 2024; Mike King, iPullRank, Secrets from the Google Algorithm Leak: Search’s Internal Engineering Documentation and What it Means, May 2024. These sources describe the leaked Content API Warehouse material, including
NavBoost,chromeInTotal, and Chrome related fields; they are evidence of collection and system design, not a public statement of exact ranking weights. ↩︎Google Chrome Developers, Overview of CrUX and CrUX methodology, describe CrUX as field data from Chrome users, state that Google Search uses it for the page experience ranking factor, and note that origins and pages below the popularity threshold are not included. ↩︎
Google Android Help, Share usage & diagnostics information with Google, describes Android usage and diagnostics data, including app use frequency and network connection quality; Google Account Help, Find & control your Web & App Activity, describes Web & App Activity and the optional inclusion of Chrome history and activity from sites, apps, and devices that use Google services. ↩︎ ↩︎
Google for Developers, About PageSpeed Insights, describes PSI as using CrUX real user data and Lighthouse lab data; Chrome for Developers, Eliminate render-blocking resources, describes Lighthouse flagging render blocking scripts and stylesheets, and recommending inlining, deferring, or removing resources. ↩︎ ↩︎
On centralized DNS privacy concerns, see Enforcing DNS-over-TLS on Local DNS Resolver with Random Upstream, especially the “Privacy Concerns” section, which cites Geoff Huston and Bert Hubert on the privacy consequences of browser and application DNS centralization. ↩︎
Apple Developer, Prepare your network or web server for iCloud Private Relay, describes Private Relay as validating that the client is an Apple device, that the customer has a valid iCloud+ subscription, and that relay IP addresses represent coarse location. ↩︎
Google Pixel Phone Help, Connect to VPN by Google on your Pixel device, documents Pixel VPN account and device eligibility requirements. ↩︎
Android Developers, Integrity verdicts, documents Play Integrity verdicts for app, device, and account state. ↩︎
Apple Developer, Establishing your app’s integrity, describes App Attest as certifying that a key belongs to a valid instance of an app. ↩︎
Apple Developer, Get started with the Verify with Wallet API, describes age and identity verification through IDs stored in Apple Wallet, including “Age Over N Flag” and age in years as available request data. ↩︎
Google for Developers, Verify with Google Wallet, describes online requests for verifiable proof of identity and age from Google Wallet or another compliant wallet; the Online Acceptance of Digital Credentials guide uses
age_over_18as an example credential request, not as a universal age threshold. ↩︎DuckDuckGo, Duck.ai, describes private conversations with third party LLM chat models and text summarization; Brave, Brave Leo, describes a browser assistant that can summarize pages, translate, analyze text, and chat with a tab; Kagi, Kagi Assistant and Kagi Summarize, describe LLM assistant and summarization surfaces. ↩︎ ↩︎ ↩︎
Google Chrome Developers, chrome.webRequest, states that Manifest V3 no longer makes
webRequestBlockingavailable to most extensions and points developers towarddeclarativeNetRequest. ↩︎Google Chrome Developers, Manifest V2 support timeline, states that Manifest V2 was disabled for all Chrome channels with Chrome 138 on July 24, 2025 and ceases to function for users upgrading to Chrome 139 and later. ↩︎
The Ungoogled Chromium project describes itself as Chromium without dependency on Google web services and as a drop in Chromium replacement with privacy, control, and transparency changes; see ungoogled-software/ungoogled-chromium. ↩︎
The U.S. Department of Justice, Department of Justice Wins Significant Remedies Against Google, September 2, 2025, summarizes the search monopoly remedies after the August 2024 liability ruling; the Department also reported its ad tech victory in Department of Justice Prevails in Landmark Antitrust Case Against Google, April 17, 2025. ↩︎ ↩︎
Google, Government requests to remove content, reports formal state requests and Google’s responses across product areas and jurisdictions. ↩︎
The Wall Street Journal, How Google Interferes With Its Search Algorithms and Changes Your Results, November 15, 2019; Mike Wacker, Google’s Manual Interventions in Search Results, describes internal blacklists and manual intervention mechanisms using public reporting and leaked internal material. ↩︎
Gemini Team, Gemini 1.5: Unlocking multimodal understanding across millions of tokens, arXiv:2403.05530, describes web documents as part of the pretraining data mixture; Google Cloud, Grounding with Google Search, documents the use of Google Search as a grounding source for Gemini responses. ↩︎ ↩︎
Ahmed Ahmed, A. Feder Cooper, Sanmi Koyejo, and Percy Liang, Extracting books from production language models, arXiv:2601.02671, 2026, evaluated Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro, and Grok 3, and found that substantial copyrighted book text could be extracted from production LLMs despite model and system safeguards. ↩︎
OpenAI, Overview of OpenAI Crawlers, describes
OAI-SearchBotas controlling appearance in ChatGPT search results andGPTBotas crawling content that may be used for foundation model training. ↩︎OpenAI, Introducing ChatGPT Atlas, describes ChatGPT Atlas as a browser with ChatGPT built in; OpenAI Help Center, Setting up the Atlas browser, describes Atlas as a Mac browser built on Chromium; OpenAI Help Center, ChatGPT Atlas - Data Controls and Privacy, describes model training controls for browsed content and the
GPTBotopt out. ↩︎Search Engine Journal, Google Modifies Search Results Parameter, Affecting SEO Tools, September 15, 2025; Botify, What Google’s Removal of
num=100Means for Your Brand, October 15, 2025, describe the disruption caused by Google’s removal or disablement of the 100 results per page parameter. ↩︎On model programming bias, see Abliterated Large Language Models Treat Users as Capable Adults, especially the final discussion, where the term names decision pressure introduced by hidden training choices, filtering rules, post training interventions, and deployment time controls. ↩︎
Project Gemini, FAQ and Protocol Specification, describe Gemini as a conservative client server request response protocol built on URIs, MIME media types, and TLS, with
text/geminias its native hypertext format and capsule as the usual term for a Gemini server. ↩︎Arcan, Arcan-A12: Weaving a Different Web, January 26, 2026, describes A12 Web as a protocol and runtime design where document browsing becomes a signed package compilation step, directory servers provide discovery, NAT traversal, relaying, shared state, application hosting, and match making, and links can be bidirectional, authenticated, revocable, typed, rediscoverable, and presence aware. ↩︎
GNUnet, About GNUnet, describes GNUnet as a peer to peer framework using peer identities, link encryption, and application level protocols; GNUnet, GNUnet, describes itself as a network protocol stack for building secure, distributed, and privacy preserving applications. ↩︎
secushare, SECUSHARE, describes secushare as a research project that employs GNUnet for end to end encryption and anonymizing mesh routing, and applies PSYC on top to create a distributed social graph; secushare, Protocol, describes PSYC on top of GNUnet cryptographic routing. ↩︎
The practical trigger for writing this version was mundane: about a month before publication, this site disappeared from Google’s index entirely. That incident is not used here as proof; it is only a local example of the larger visibility problem. ↩︎
Louisiana Legislature, La. R.S. 51:2121, requires reasonable age verification for commercial websites with a substantial portion of material harmful to minors. ↩︎
Texas Legislature, H.B. 1181, Chapter 129B, requires reasonable age verification for commercial websites whose content is more than one third sexual material harmful to minors. ↩︎
Alabama Legislature, H.B. 164, requires reasonable age verification for adult websites, applications, and digital or virtual platforms distributing sexual material harmful to minors. ↩︎
Utah Legislature, S.B. 73, Online Age Verification Amendments, treats access as Utah access when the individual is actually located in Utah, regardless of VPN, proxy, or other location masking. ↩︎
Utah Legislature, S.B. 142, App Store Accountability Act, requires app store providers to verify age categories and provide age category and parental consent status to developers. ↩︎
California Legislature, AB 1043, Digital Age Assurance Act, requires operating system providers, beginning January 1, 2027, to collect birth date, age, or both at account setup and provide age bracket signals to developers. ↩︎
Brazil’s Lei nº 15.211/2025, ECA Digital, requires reliable age verification for legally restricted content, bars self declaration for such access, and requires app stores and terminal operating systems to assess age or age range and provide age signals through secure APIs. ↩︎
Germany’s JMStV § 12 requires operating systems commonly used by minors to provide a
Jugendschutzvorrichtung; active age settings must be respected by browsers, app distribution, and app use. ↩︎Germany’s JMStV § 4 permits pornographic telemedia only when access is limited to adults in a closed user group. ↩︎
Landgericht München I, Final judgment of 28 May 2026 (
Endurteil v. 28.05.2026), 26 O 869/26, especially Rn. 19 and 30-47. Rn. 43 records Google’s oral argument that users know AI generated information should not be blindly trusted, and rejects that premise as a defence against liability for a self contained statement. ↩︎