This is a practical, implementation-driven guide for determining whether a robots.txt file is likely to be parsed correctly by modern crawlers.
There is no single authoritative modern specification for robots.txt.
In practice, crawler behavior (especially Google’s) defines what “well-formed” means.
This guide is designed for bookmarklets, linters, and lightweight validators that prioritize real-world parsing behavior over theoretical grammar.
This guide answers one question:
Does this
robots.txtappear well-formed enough to be interpreted correctly by common crawler parsers?
It does not attempt to:
- Enforce a strict grammar
- Predict indexing behavior
- Enforce SEO best practices
- Guarantee crawler outcomes
A robots.txt file is considered present if:
-
It is accessible at
/robots.txt -
The HTTP response is:
200 OK, or401 / 403(still treated as valid by major crawlers)
Notes:
- A
404means the file does not exist, not that it is malformed Content-Type: text/plainis preferred but not required
Basic expectations:
- Plain text (not HTML, JSON, or binary)
- UTF-8 or ASCII encoding
- Line-based content
- No obvious binary signatures
Strong signals of a likely malformed file:
- HTML pages (themes, error templates, CMS fallbacks)
- Minified JS, JSON, or binary output
Each non-empty, non-comment line is parsed independently.
Valid line structure:
field-name ":" optional-whitespace value
Rules:
- Field names are case-insensitive
- Lines without a colon (
:) are ignored - Unknown field names are ignored, not errors
- Inline comments are allowed using
#
There is no global syntax error state. One malformed line does not invalidate the file.
- Lines starting with
#are comments - Comments may appear inline after directives
- Comment-only lines are ignored
Rules are evaluated in groups.
A group is defined as:
User-agent: <value>
<directive>: <value>
<directive>: <value>
Behavior:
- A group begins with one or more
User-agentlines - All subsequent directives apply to that group
- A new
User-agentstarts a new group - Empty lines are allowed and commonly used as separators
The following directives are widely recognized and safe to parse:
User-agentDisallowAllowCrawl-delay(ignored by Google, used by some crawlers)Sitemap(global; not group-scoped)
Unknown directives must be ignored, not treated as errors.
A file is generally considered well-formed if:
- At least one
User-agentdirective exists Allow/Disallowrules appear after aUser-agent- No rules are orphaned before the first
User-agent
Violations of these rules are likely to cause mis-parsing.
The following conditions do not make a file malformed, but may be surfaced as warnings:
- Missing
Sitemapdirective - Relative (non-absolute) sitemap URLs
- Unreachable sitemap URLs
- Missing global
User-agent: *group - Unknown or legacy directives
These are best-practice or robustness signals, not parsing failures.
Do not treat the following as malformed:
- Duplicate
User-agententries - Multiple groups for the same agent
- Mixed casing
- Trailing whitespace
- Empty
Disallow:(means allow all) - Legacy or ignored directives (
Noindex, etc.)
Modern crawlers tolerate all of the above.
Recommended output categories for tools:
- Parsable
- At least one valid group
- No orphaned rules
- Parsable
- Missing recommended signals (e.g. sitemap)
- Contains ignored or legacy directives
- HTML or binary content
- No parsable directives
Allow/Disallowbefore anyUser-agent
When presenting results to users:
“robots.txt does not have a strict modern specification. This check verifies whether the file appears well-formed according to common crawler parsers, including Google’s.”
This framing is accurate, defensible, and avoids false precision.
robots.txtis permissive by design- Modern behavior is defined by crawler implementations
- Structural heuristics outperform strict validation
- Bookmarklet-level tools should prioritize parsing safety
Well Formed:
Likely Malformed: