Tag Archives: DOM

What you need to know before building a web crawler

This article it’s for persons with technical skills that are some experience in the internet field.

A web spider or a web crawler is a specific program build and used for extracting data from a specific website.

Before start coding for a web crawler you need to know some info about next points:

1 what is your data source (website URL)

2.what it’s your crawling strategy:

If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL

3 common elements

Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.

4 programming language

What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.

DOM versus Regex in web scraping

In web scraping field there are two methods for data filtration. and the question is what is best?

The correct answer is, depends.

First is to use a DOM (Document Object Model) parser and second is regex matching (regex is an acronym from regular expressions). Both of them has advantages and disadvantages.

DOM Parser

Advantages Disadvantages
Simple to code Use more memory
Sensitive at bad HTML

Regex

Advantages Disadvantages
Insensitive at bad HTML Use more CPU
more difficult to code