Skip to Content

What is robots.txt?

image

Definition

robots.txt is a text file located in the root directory of a website that tells search engine crawlers (bots) which pages they can crawl and which pages they should not crawl. This file follows a standard called the Robots Exclusion Protocol or Robots Exclusion Standard.

The robots.txt file acts like a traffic controller for your website. It's the first file that search engine bots check when they visit a website, and through it, they understand the site owner's crawling policy. For example, you can specify areas you don't want to appear in search results, such as admin pages, duplicate content, or test pages.

An important point is that robots.txt is a "request," not a "command." While most legitimate search engine bots (Google, Naver, Bing, etc.) respect the rules in this file, malicious bots or scrapers can ignore them. Therefore, robots.txt alone is not sufficient to protect sensitive information, and proper access control or encryption measures are necessary.

Features

  • Crawl Budget Optimization: By blocking crawling of unimportant pages, you can encourage search engines to allocate more resources to pages that really matter.
  • Preventing Duplicate Content: You can prevent SEO issues by blocking pages with similar content or duplicate URLs generated by various parameters.
  • Specifying Sitemap Location: You can specify the location of the sitemap within the robots.txt file so that search engines can easily find it.
  • Simple and Standardized Format: It can be easily written and modified with a text editor without special technical knowledge.
  • Immediate Application: Once the file is uploaded, it takes effect immediately, and the new rules are applied from the next time search engine bots visit.

How to Use

Here's how to effectively write and manage a robots.txt file.

Step 1: Understand the Basic Structure The robots.txt file is based on "User-agent" and "Disallow/Allow" directives. User-agent specifies which bot the rule applies to, Disallow specifies paths to prohibit crawling, and Allow specifies paths to allow crawling.

Step 2: Identify Areas to Block Identify areas on your website that should not be exposed to search engines. Generally, this includes admin pages (/admin), personal information pages, duplicate content, test pages, search result pages, shopping carts, or checkout pages.

Step 3: Write the robots.txt File Write the robots.txt file with a text editor. Rules for all bots start with "User-agent: *", and you can also specify separate rules for specific bots.

Step 4: Upload to Root Directory Upload the written robots.txt file to the root directory of your website. The file must be accessible in the form https://yoursite.com/robots.txt.

Step 5: Test Use Google Search Console's robots.txt tester tool to verify that the file is written correctly and that the desired URLs are properly blocked or allowed.

Step 6: Regular Review Whenever the website structure changes, the robots.txt file should be updated as well. It's important to regularly review to ensure that important pages haven't been accidentally blocked.

Examples

Example 1: Basic robots.txt Structure

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /temp/
Allow: /

Sitemap: https://zero-coke.com/sitemap.xml

This is the most basic robots.txt file. It instructs all search engine bots not to crawl the admin, private, and temp directories, and allows everything else. The sitemap location is specified at the end.

Example 2: Rules for Specific Bots

# Google bot rules
User-agent: Googlebot
Disallow: /search/
Disallow: /cart/
Allow: /

# Bing bot rules
User-agent: Bingbot
Disallow: /admin/
Allow: /

# Image search bot
User-agent: Googlebot-Image
Disallow: /private-images/
Allow: /

# Block bad bots
User-agent: BadBot
Disallow: /

# All other bots
User-agent: *
Crawl-delay: 10
Disallow: /admin/

Sitemap: https://zero-coke.com/sitemap.xml
Sitemap: https://zero-coke.com/sitemap-images.xml

This example shows how to apply different rules for different bots. Comments (#) are used to improve readability.

Example 3: Using Wildcards

User-agent: *
# Block all PDF files
Disallow: /*.pdf$

# Block URLs with specific parameters
Disallow: /*?sort=
Disallow: /*?filter=

# Block files with specific extensions
Disallow: /*.php$
Disallow: /*.inc$

# But allow specific directories
Allow: /public/*.pdf$

Sitemap: https://zero-coke.com/sitemap.xml

You can create more sophisticated rules using wildcards (*) and path end specifiers ($).

Example 4: E-commerce Site robots.txt

User-agent: *
# Block user account-related pages
Disallow: /account/
Disallow: /login/
Disallow: /register/
Disallow: /checkout/
Disallow: /cart/

# Prevent duplicate content - sorting and filter parameters
Disallow: /*?sort=
Disallow: /*?page=
Disallow: /*?filter=

# Search result pages
Disallow: /search?

# Admin area
Disallow: /admin/

# Allow product pages (important!)
Allow: /products/

# Set crawl delay (server load management)
Crawl-delay: 5

Sitemap: https://zero-coke.com/sitemap.xml
Sitemap: https://zero-coke.com/sitemap-products.xml
Sitemap: https://zero-coke.com/sitemap-categories.xml

This is a comprehensive robots.txt example that can be used for actual e-commerce sites.

Pros and Cons

Pros

  • Improved Crawling Efficiency: It helps search engines not waste time and resources crawling unnecessary pages, allowing them to focus more on important content. This is essential for managing crawl budgets efficiently, especially for large-scale websites.

  • Reduced Server Load: You can reduce server load from excessive crawling. Using the Crawl-delay directive allows you to control the interval between bot requests to protect server resources.

  • Simple Implementation: It can be easily implemented with just one text file without complex programming knowledge, and modifications can be made immediately. No separate database or server configuration changes are required.

Cons

  • Not a Security Measure: robots.txt is merely a recommendation and has no enforcement power. Malicious bots or hackers can ignore this file, so it's not suitable for protecting sensitive information. In fact, specifying blocked paths in robots.txt can result in telling attackers the location of hidden pages.

  • Serious Impact from Mistakes: If the robots.txt file is written incorrectly, you could accidentally block the entire website or exclude important pages from search results. Just one wrong entry like "Disallow: /" can make your entire site disappear from search engines.

  • Not Immediately Effective: Even if you update the robots.txt file, search engines may not reflect it immediately. To remove already indexed pages, robots.txt alone is not enough, and you need to separately request URL removal in Google Search Console.

FAQ

Q: Can I delete already indexed pages with robots.txt? A: No, robots.txt only blocks new crawling and does not delete already indexed pages. In fact, blocking with robots.txt prevents search engines from recrawling the page, so they can't check updated information (e.g., noindex tag). To remove already indexed pages, you should first add a noindex meta tag to the page, let search engines check it, and then block it in robots.txt. Alternatively, you can use Google Search Console's URL removal tool.

Q: What happens if there's no robots.txt file? A: Even without a robots.txt file, the website works normally, and search engines assume all pages can be crawled. In other words, everything is allowed. For small websites or sites that want all pages to appear in search results, this isn't a problem, but if there are areas that need to be blocked, you must create a robots.txt file.

Q: What's the difference between Disallow and noindex? A: Disallow (robots.txt) blocks search engines from crawling a page, but if the page is linked from elsewhere, it can still appear in search results (with only the title and URL, no content). On the other hand, noindex (meta tag) allows crawling but instructs that the page should never be displayed in search results. To completely remove a page from search results, you should use the noindex meta tag, not robots.txt blocking.

Q: How do wildcards (*) work? A: Wildcards () mean zero or more of any character. For example, "Disallow: /admin" blocks all paths starting with admin, such as /admin, /admin/, /admin/users, /administrator. "$" indicates the end of a URL, so "Disallow: /*.pdf$" blocks all URLs ending in .pdf. However, not all search engines support wildcards, so they work in major search engines (Google, Bing, etc.) but may be ignored by some older bots.