

When a website gets overloaded with more requests that the web server can handle, it might become unresponsive. Respect the delay that crawlers should wait between requests by following the robots.txt Crawl-Delay directive. Mission-critical to having a polite crawler is making sure your crawler doesn't hit a website too hard.

Some websites even use the crawlers’ user agent to specify separate rules for different web crawlers: User-agent: Some_Annoying_Bot This file is usually available at the root of a website (and it describes what a crawler should or shouldn't crawl according to the Robots Exclusion Standard. Whether you call them spiders, crawlers, or robots, let’s work together to create a world of Baymaxs, WALL-Es, and R2-D2s rather than an apocalyptic wasteland of HAL 9000s, T-1000s, and Megatrons.Ī polite crawler never degrades a website’s performanceĪ polite crawler identifies its creator with contact informationĪ polite crawler is not a pain in the buttocks of system administrators robots.txtĪlways make sure that your crawler follows the rules defined in the website's robots.txt file. In this post, we’re sharing a few tips for our platform and Scrapy users who want polite and considerate web crawlers. We’re supporters of the democratization of web data, but not at the expense of the website’s owners. The second rule of web crawling is you do NOT harm the website. The first rule of web crawling is you do not harm the website.
