All articles · Published 2026-05-07 · 2364 words · 11 min read · EN · RU · ES

How to Make Your Website AI-Agent Readable in 2026 (llms.txt, MCP Cards, Structured Data)

You ask Perplexity a question about your niche industry. It gives a clean, well-sourced answer, citing three of your competitors. Your site, which has a definitive guide on the exact topic, is nowhere to be seen. You try again with ChatGPT, then Claude. Same result. It feels like being invisible.

This isn't a failure of traditional SEO. Your rankings on Google might be fine. This is a new problem: your website isn't "agent-readable." The large language models (LLMs) that power these AI agents are increasingly the first stop for users seeking information. If they can't parse, understand, and trust your content, you don't exist in this new ecosystem. Getting cited by an AI is becoming the new "page one" ranking.

This guide isn't about "using AI for SEO" fluff. It's a technical, practical manual for founders and operators who manage their own websites. We'll cover the specific file formats, server configurations, and data structures that AI crawlers from OpenAI, Anthropic, Google, and others are looking for right now. This is how you get your data out of your website and into their answers.

Why Agent-Readiness Is the New SEO

For two decades, SEO was about signaling relevance to algorithms like Google's PageRank. Now, we must also signal authority and structure to language models. The goal is different. Instead of just a click, you're aiming to become a citable source in a generated answer. This is a higher bar.

If you check your server logs today, you'll likely find that traffic from known AI crawlers (like GPTBot, ClaudeBot, and PerplexityBot) already makes up a small but growing slice of your traffic. For many sites, this is already in the 1-3% range and is expected to increase significantly. This is the data-gathering phase. The models are actively ingesting the web to train future versions. Being accessible now means you're part of that foundational knowledge.

Traditional SEO focuses on user intent leading to a click. Agent-readiness focuses on machine-readable data that allows an AI to satisfy user intent directly, with your site as a trusted source. The two are not mutually exclusive, but they require different tactics. A keyword-optimized blog post is great for Google Search. A well-structured page with clear JSON-LD, a permissive robots.txt, and maybe even an `llms.txt` file is what gets you cited by an AI agent.

The `llms.txt` Specification: A User Manual for Your Site

The `llms.txt` file is a proposal, primarily championed by Anthropic (the makers of Claude), for a standardized way to give instructions to AI models about your site. Think of it as a `robots.txt` but for usage policy instead of crawling access. It tells models how they are permitted to use your content in their training and output.

What It Is and Where to Put It

An `llms.txt` file is a plain text file placed in the `/.well-known/` directory of your website. The full path should be `https://yourdomain.com/.well-known/llms.txt`.

The file uses a simple `field: value` format. The key fields currently proposed are:

  • User-Agent: Specifies which bot the rules apply to. A `*` applies to all bots. You can also target specific bots like `ClaudeBot`.
  • Allow: Specifies directories or pages that are explicitly permitted for use in training generative models.
  • Disallow: Specifies directories or pages that are forbidden from being used for training.
  • Allow-Citing: A proposed field to explicitly permit the model to cite your content.

A Practical `llms.txt` Example

Here’s a configuration that allows all bots to use most of the site for training, disallows a private `/members/` area, and explicitly allows citing from the `/articles/` directory.


# Default policy for all LLM agents
User-Agent: *
Disallow: /members/
Disallow: /private-data/

# Allow all bots to cite our public articles
User-Agent: *
Allow-Citing: /articles/

# Specific rules for ClaudeBot, if needed
User-Agent: ClaudeBot
Allow: /

Pros and Cons of `llms.txt`

  • Pro: It provides a clear, machine-readable way to state your usage terms. This is much better than burying it in a human-readable "Terms of Service" page that no crawler will ever parse.
  • Pro: It's forward-looking. Adopting it now signals that you're an engaged, technically savvy publisher.
  • Con: It's still a proposal. There is no guarantee all major AI companies will honor it. OpenAI, for example, currently relies on `robots.txt`. It's a bet on a future standard.
  • Con: It adds another configuration file to maintain. For most small sites, a simple, permissive file is a set-and-forget task.

JSON-LD: Spoon-Feeding Structured Data to Machines

If you want an AI to understand the *meaning* of your content, you need to tell it what it's looking at. Is this page a product, an article, or a how-to guide? JSON-LD is a way to embed this structured data directly in your HTML, using the vocabulary from Schema.org.

AI agents, especially those focused on shopping or step-by-step instructions, actively look for this data. It's the difference between them trying to guess your product's price and you telling them directly: `"price": "240"`. You should add the JSON-LD script tag within the `` or `` of your HTML. For most platforms (like WordPress with a plugin), this is handled for you once configured.

Key Schemas AI Agents Actually Use

Don't try to implement every schema. Focus on the ones that map to your content and are most valuable to AI agents.

  • Article: Essential for any blog post or publication. It clearly defines the author, publication date, headline, and body. This helps agents attribute content correctly.
    
    <script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Article",
      "headline": "How to Make Your Website AI-Agent Readable",
      "author": {
        "@type": "Organization",
        "name": "GuardLabs"
      },
      "datePublished": "2024-05-21"
    }
    </script>
            
  • Product: If you sell anything, this is non-negotiable. It allows agents to pull product names, descriptions, pricing, availability, and reviews into comparison models. This is how you show up in "what's the best tool for X" queries. Our own Website Care plan could be marked up this way.
    
    <script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Product",
      "name": "Website Care Plan",
      "image": "https://guardlabs.online/images/care-icon.png",
      "description": "Annual website maintenance and support.",
      "offers": {
        "@type": "Offer",
        "priceCurrency": "USD",
        "price": "240.00"
      }
    }
    </script>
            
  • FAQPage: If you have a FAQ, mark it up. AI agents love FAQs because they are pre-packaged question-answer pairs. This makes it trivial for them to use your content to answer a user's question directly.
  • HowTo: For step-by-step guides, this schema is perfect. It breaks down the process into discrete steps, which an agent can then re-format and present to a user.

The main limitation of JSON-LD is that it's only as good as the data you provide. If your schema is incomplete or inaccurate (e.g., the price on the page doesn't match the `price` in the JSON-LD), it can confuse bots or cause them to distrust your site.

MCP Cards: A Business Card for Your Server

The Machine-readable Citable Page (MCP) protocol is a newer, more experimental concept. The idea is simple: what if, alongside your human-readable webpage, you provided a simple, structured JSON file that contained all the key citable information? This is an MCP "card."

An AI agent could fetch `https://yourdomain.com/my-article.mcp.json` to get the core facts of your article without having to parse HTML, ads, and navigation menus. This makes their job easier and your data cleaner.

When and How to Publish an MCP Card

You don't need an MCP card for every page. It's most useful for data-rich, citable content like reports, product pages, or reference guides.

To implement it, you create a static JSON file that follows the MCP spec and host it at a predictable URL. A common convention is to append `.mcp.json` to the original URL. You then link to it from your HTML page using a `` tag in the ``:

<link rel="alternate" type="application/mcp+json" href="https://yourdomain.com/path/to/page.mcp.json">

A simple MCP card for an article might look like this:


{
  "spec_version": "1.0",
  "title": "How to Make Your Website AI-Agent Readable",
  "url": "https://guardlabs.online/articles/agent-readable-website",
  "author": "GuardLabs",
  "publication_date": "2024-05-21",
  "summary": "A technical guide on using llms.txt, JSON-LD, and MCP cards to make websites understandable to AI agents.",
  "key_points": [
    "AI crawlers represent a growing source of traffic and influence.",
    "llms.txt is a proposed standard for declaring usage rights.",
    "JSON-LD provides essential structured data for context.",
    "robots.txt remains the primary tool for crawl access control."
  ]
}

The main drawback is its novelty. As of late 2024, no major AI agent has publicly committed to using MCP. Implementing it is a forward-looking bet on a potential standard. It's a low-effort, high-potential-reward activity for technically-inclined site owners.

`robots.txt` for AI: The Doorman for Your Data

The `robots.txt` file is your most direct and widely respected tool for controlling which bots can access your site. All major AI companies have introduced specific crawlers and, for now, they respect `robots.txt` directives.

Your choice is simple: allow or disallow. If you want to be cited, you must allow them. Disallowing a bot is a surefire way to be excluded from its model's knowledge base.

A Reference Table of Common AI Bots

Here are the user agents for the most common AI crawlers and what they do. You can use these in your `robots.txt` file to set permissions.

User Agent Company Purpose Honors `robots.txt`?
GPTBot OpenAI Crawls web data to improve future ChatGPT models. Yes
ClaudeBot Anthropic Used for training Claude models. Yes
PerplexityBot Perplexity AI Crawls the web to find answers for Perplexity's conversational search engine. Yes
Google-Extended Google A separate crawler Google uses to improve Bard/Gemini. Opting out here does not affect Google Search. Yes
CCBot Common Crawl Not a company, but a non-profit that crawls and archives the web. Its data is widely used to train many open-source and commercial LLMs. Yes

Example `robots.txt` for AI Readiness

A sensible default for most businesses is to allow these bots. If you don't have a `robots.txt` file, create one in the root of your domain. Here is a permissive example:


User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# You might want to disallow CCBot if you are concerned about
# your content being in a public dataset forever.
User-agent: CCBot
Disallow: /

# Keep your existing rules for other bots
User-agent: *
Disallow: /admin
Disallow: /private/

The only real "con" to allowing these bots is that they use bandwidth. However, their crawl rate is typically low and shouldn't impact performance for most sites. The bigger risk is being left out by disallowing them.

How to Verify: Are the Bots Actually Reading You?

How do you know if any of this is working? You can't just ask ChatGPT "did you read my site?" Instead, you need to test from the agent's perspective.

  1. Check Server Logs: This is the ground truth. Filter your server's access logs for the user agents listed in the table above (e.g., `grep "GPTBot" /var/log/nginx/access.log`). If you see entries with a `200 OK` status code, you know they are successfully crawling your pages. If you see `403 Forbidden` or `503 Service Unavailable`, you have a problem.
  2. Use `curl` to Impersonate a Bot: You can simulate a request from an AI crawler using the command-line tool `curl`. This is great for debugging firewall or CDN issues.

    curl -A "GPTBot" -I https://yourdomain.com/my-article

    The `-A` flag sets the User-Agent string. The `-I` flag just fetches the headers. If you get a `HTTP/2 200` response, the bot can access your site. If you get a `403` or are presented with a CAPTCHA, your security settings are blocking it.

  3. Prompt Engineering for Citation: After you've confirmed the bots are crawling your site and you've given them a few weeks to ingest the data, you can test for citation. The trick is to ask a question where your site is a uniquely authoritative source. Don't ask "what is a website care plan?" Ask something specific that only your content answers well, like: "According to guardlabs.online, what is included in their Website Care plan?" This forces the model to check its specific knowledge of your domain.

Common Mistakes That Make You Invisible to AI

Many well-intentioned sites accidentally block AI agents or make their content impossible to parse.

  • Overzealous Cloudflare Rules: The "Bot Fight Mode" or aggressive "Super Bot Attack Mode" settings in Cloudflare are notorious for blocking legitimate AI crawlers. They see a non-human user agent and present a JavaScript challenge that the bot cannot solve. You must go into your Cloudflare settings and specifically allow the user agents for `GPTBot`, `ClaudeBot`, etc. Cloudflare's new "AI Audit" feature can help identify and allow these bots.
  • Content Behind Paywalls or Login Walls: An AI crawler is an unauthenticated user. If your definitive guide is behind a hard paywall or requires a login, the bot will only see the login page. It cannot index what it cannot see. If you run a membership site, consider having public, citable summaries or abstracts.
  • Missing Canonical URLs: If you have the same content accessible at multiple URLs (e.g., with and without `www`, or with tracking parameters), you must use the `rel="canonical"` link tag to tell all bots which URL is the master version. Without it, AI models might see your content as duplicate or low-quality.
  • Relying on Images or Video for Key Info: LLMs primarily read text. If your product's price, specs, or key features are only available in an image or a video, the AI crawler will miss them. All critical information should exist as plain HTML text on the page.

Making your site agent-readable isn't a one-time fix; it's a new layer of web maintenance. It requires a shift in thinking from just pleasing human visitors and search engine spiders to also accommodating machine learning models. The sites that do this work now will become the trusted, citable sources for the next generation of search and information discovery.

If you've gone through this guide and feel it's more than you want to manage yourself, this is the kind of deep-dive technical audit we perform. Our Agent-Ready Site audit is a full readiness scan that covers everything mentioned here, from `robots.txt` configuration to JSON-LD validation and firewall rules, to ensure your site is positioned to be a source of truth for AI agents.

Want your site cited by ChatGPT and Claude, not skipped?

GuardLabs Agent-Ready audit scans for llms.txt, MCP cards, JSON-LD coverage, robots.txt for 6 AI crawlers, and gives you a prioritized fix list. From $79. See sample report →

Related reading