Tag Archives: regex

Processing of large text files

We, at TheWebMiner we have often the need of processing large text files, and when i say large i mean files of few hundreds of Megabytes or bigger. Out of all the text processing tools that we’ve tested so far we concluded that the best was Vim or gVim (the windows version of this famous editor).

Regulated Expressions

Another useful tool in file processing are regulated expressions, or, more simple RegEx. These expressions help us find, or find and replace pieces of text of a certain format, all being done automatically. By combining the two definitions we discover a new problem.

How do we use Regulated Expressions in Vim?

Vim has its own format for RegEx so we cannot use standard regulated expressions Of Vim but we have created and put to your disposal a convertor for this purpose. You can find the converter on our site (www.thewebminer.com/regex-to-vim) , and we hope that this will come to your help.

How to use regex in Vim?

We often need to process big text files (larger than 100 mb) and we discovered that best text editor for this is Vim and gVim (windows version). Also a powerful mode to process text automatically is to use regular expressions (also called RegEx).

Using RegEx in Vim

Vim doesn’t support standard RegEx, but we built a tool that converts standard regex to Vim regex. This tool it’s available here: RegEx to Vim.

We hope that is useful for you.

What you need to know before building a web crawler

This article it’s for persons with technical skills that are some experience in the internet field.

A web spider or a web crawler is a specific program build and used for extracting data from a specific website.

Before start coding for a web crawler you need to know some info about next points:

1 what is your data source (website URL)

2.what it’s your crawling strategy:

If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL

3 common elements

Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.

4 programming language

What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.

DOM versus Regex in web scraping

In web scraping field there are two methods for data filtration. and the question is what is best?

The correct answer is, depends.

First is to use a DOM (Document Object Model) parser and second is regex matching (regex is an acronym from regular expressions). Both of them has advantages and disadvantages.

DOM Parser

Advantages Disadvantages
Simple to code Use more memory
Sensitive at bad HTML


Advantages Disadvantages
Insensitive at bad HTML Use more CPU
more difficult to code