Today we have a new toy 🙂 . We have built a xml sitemap generator as Google Chrome extension. You can download from here: https://chrome.google.com/webstore/detail/thewebminer-sitemap-gener/gdljgjdcflclcapfnoejmbpodgajkbcd?hl=en
Tag Archives: web crawler
How to use Tor to avoid CAPTCHA
These days I have published a Python 3 module intense used at TheWebMiner. The module it’s called PyTor and it’s available here: https://github.com/adibalcan/PyTor. It helps us to avoid CAPTCHA or other mechanisms for blocking the robots that crawl fast, websites. With this module you can detect when a website ban your IP address and you can easily change it (actually this happens automatically). Now this module it’s public and you can use it in your application.
I hope that it’s useful for you.
Can robots.txt protect website from scraping?
No. Robots.txt it’s a formal parsing guide for web crawlers (especially for search engines).
With robots.txt you can avoid to appear in unwanted page or sections in search engines, but this can’t stop bots to parse this pages.
What you need to know before building a web crawler
This article it’s for persons with technical skills that are some experience in the internet field.
A web spider or a web crawler is a specific program build and used for extracting data from a specific website.
Before start coding for a web crawler you need to know some info about next points:
1 what is your data source (website URL)
2.what it’s your crawling strategy:
If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL
3 common elements
Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.
4 programming language
What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.