What is Robots.txt File and How to Set It

As a blogger, maybe you are ever to heard the word of ‘robots.txt’. I believe, you surely confused when heard it. Well, I also confused when i heard it first time. But Now, i had know it and i will share and explain it to you as my best.

What is Robots.txt

Robot.txt file is a file with text(.txt) format that owned by each sites that online in internet. Any sites should have this file or the security of that site will broken. Robot.txt serves to control and regulate which the pages or directories will published in search engines. how it works is by put some codes inside it to control search engines to crawl or not the pages in that website. It surely to protect your system pages to known by people. Administrator pages is a page which abstinence to known by peoples or indexed by search engines. So, all of sites surely disallowing it administrator pages to indexed by search engines. Well, It is did by robot.txt. Robot.txt allow you to manage which the pages may crawled and indexed by search engines.
same as website, Blog also have robots.txt file. Only, robots.txt in a blog usually have set as default by the blogger platform. For blogspot blog, the default of the robots.txt is set like this :

User-agent: Mediapartners-Google

User-agent: *
Disallow: /search
Allow: /

Sitemap: http://blogURL/feeds/posts/default?orderby=UPDATED

The meaning of the code above is the robots.txt file allow the google adsense crawler to crawl all pages of your site invariably, and allow all crawler to crawl all of your blog pages except the page with url prefix blogURL/search. The pages with this url prefix is labels and archives post pages.

How to Set Robots.txt

If you want to modify and set your robots.txt, firstly you must be careful because an error of your setting will make your website SERP on search engines disappear. You must have a clear purpose because it is not a little thing for your site/blog.
Okay, now I will tell you how to set it by right. You must know the meaning of each codes in robots.txt file above.

” user-agent : Mediapartners-Google ” : it is means that the codes which placed under it apply only with Mediapartners-Google crawler. Mediapartners-Google is google adsense crawler.

” Disallow: ” : It is means that no pages disallowed to crawl by crawler. By other word, allow the crawler to crawl all pages on the site.

” user-agent : * ” : it is means that the codes which placed under it is apply with all of search engines/crawler

” Disallow: /search ” : Disallow the crawler selected to crawl the url that have prefix blogurl/seach…

” Allow: / ” : Allow the crawler to crawl all of pages.

” Sitemap: http://blogURL/feeds/posts/default?orderby=UPDATED ” : this is sitemap of your site which submitted to crawled by crawler.

Not only the codes above can use in robots.txt. Just for the example, i want to blocked this post by robots.txt in order do not indexed by all of search engines. so i can do it by put this code.

user-agent : *
Disallow: /2012/11/what-is-robots-and-how-set-it.html

well, to block the root folders or directories(included it contents) in your site, you can do this :

user-agent : *
Disallow: /cgi-bin/
Disallow: /wp-admin/
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-content/cache/
Disallow: /wp-content/themes/

we can also set robots.txt in order to do not crawl and index a specific keyword. the example we want to block the word “people” to no-indexed by all search engines. we can set like this :

user-agent : *
Disallow: /search/*people

we can also set robots.txt to block the file in your site with specific format. The example we will block file with format php(.php), css(.css) and javascript(.js).

user-agent : *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.css$

User Agent

In setting of your robots.txt, you can set which the crawler you allowed or disallowed to crawl. This is did by setting the user agent. Therefore, you must know the names of some famous crawlers/spiders. This is it lists :

Google adsense – Mediapartners-google
Google – Googlebot
Altavista – Scooter
Lycos – Lycos_Spider_(T-Rex)
Alltheweb – FAST-WebCrawler/
Yahoo – Yahoo Slurp
MSN- Msnbot

*note :
-Any example i explain above is for blogspot, except the example for block folders/directories (it is for wordpress. Each sites have different set of robots.txt. It is because each sites have different directories too. The setting robots.txt of wordpress blog surely will different with blogspot blog, but the rules of codes are same.
-If you have a sites which no use blogspot platform. you must add robots.txt file and locate it on main root of your site.

well, that’s all. Still any questions ? don’t hesitate to ask me. Now, you have know and can set your robots.txt file of your site. i waiting for your response. Hopefully useful for you. Keep happy blogging guys 🙂

1 Comment

  1. Wow !!! I’ve been wondering what the Robot.txt means but i never knew.. Thanks so much for this post, i hope it will help my Technology Blog to rank better in Search Results.

