We, at TheWebMiner we have often the need of processing large text files, and when i say large i mean files of few hundreds of Megabytes or bigger. Out of all the text processing tools that we’ve tested so far we concluded that the best was Vim or gVim (the windows version of this famous editor).
Another useful tool in file processing are regulated expressions, or, more simple RegEx. These expressions help us find, or find and replace pieces of text of a certain format, all being done automatically. By combining the two definitions we discover a new problem.
How do we use Regulated Expressions in Vim?
Vim has its own format for RegEx so we cannot use standard regulated expressions Of Vim but we have created and put to your disposal a convertor for this purpose. You can find the converter on our site (www.thewebminer.com/regex-to-vim) , and we hope that this will come to your help.
We often need to process big text files (larger than 100 mb) and we discovered that best text editor for this is Vim and gVim (windows version). Also a powerful mode to process text automatically is to use regular expressions (also called RegEx).
Using RegEx in Vim
Vim doesn’t support standard RegEx, but we built a tool that converts standard regex to Vim regex. This tool it’s available here: RegEx to Vim.
This article it’s for persons with technical skills that are some experience in the internet field.
A web spider or a web crawler is a specific program build and used for extracting data from a specific website.
Before start coding for a web crawler you need to know some info about next points:
1 what is your data source (website URL)
2.what it’s your crawling strategy:
If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL
3 common elements
Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.
4 programming language
What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.