We often use Excel in data processing domain, and today I will share with you an Excel cheat sheet.
Author Archives: Adrian Balcan
Can robots.txt protect website from scraping?
No. Robots.txt it’s a formal parsing guide for web crawlers (especially for search engines).
With robots.txt you can avoid to appear in unwanted page or sections in search engines, but this can’t stop bots to parse this pages.
What you have to know before requesting web scraping services?
Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).
Step 1: Define what data you need?
Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.
In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.
Step 2: Structure of data
Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.
Step 3: Number of data extraction
In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.
If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.
What you need to know before building a web crawler
This article it’s for persons with technical skills that are some experience in the internet field.
A web spider or a web crawler is a specific program build and used for extracting data from a specific website.
Before start coding for a web crawler you need to know some info about next points:
1 what is your data source (website URL)
2.what it’s your crawling strategy:
If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL
3 common elements
Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.
4 programming language
What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.
DOM versus Regex in web scraping
In web scraping field there are two methods for data filtration. and the question is what is best?
The correct answer is, depends.
First is to use a DOM (Document Object Model) parser and second is regex matching (regex is an acronym from regular expressions). Both of them has advantages and disadvantages.
DOM Parser
Advantages | Disadvantages |
---|---|
Simple to code | Use more memory |
Sensitive at bad HTML |
Regex
Advantages | Disadvantages |
---|---|
Insensitive at bad HTML | Use more CPU |
more difficult to code |
Convert from SQLite to CSV
It’s very simple:
sqlite> .mode list
sqlite> .separator ,
sqlite> .output exported_file.csv
sqlite> select * from yourtable;
sqlite> .exit
You can use other separator. For Microsoft Excel default separator is “;”.
TheWebMiner video presentation
We have a new video presentation 🙂
How to convert .XLS files in .CSV and viceversa?
XLS (Microsoft Excel spreadsheet format)
Is a binary format used by Microsoft excel for storing data.
CSV ( Comma Separated Values )
Is a file format for storing data in text files. Every value are separated from other value by a delimiter (many times is , or ; ). Content of a CSV file looks like following lines:
Year;Make;Model;Length 1997;Ford;E350;2,34 2000;Mercury;Cougar;2,38
Convert XLS to CSV
This it’s very simple:
1. You need to open XLS (Excel spreadsheet) file with Microsoft Excel
2. Save as (from file or office menu)
3. Insert file name
4. Select CSV (Comma delimited) (*.csv) from save as type ( above File name )
5, Press Save, then OK and Yes This is all.
Convert CSV to XLS file
1. You need to open CSV file with Microsoft Excel
2. Save as (from file or office menu)
3. Insert file name
4. Select Excel 97-2003 Workbook (*.xls) from save as type ( above File name )
5. Press Save This is all.
Apple appstore apps list
Now you can download for free, entire list with details about all applications (over 1,200,000 records) from Apple appstore. Link: //thewebminer.com/download
Web scraping in email marketing
Many businesses have difficulties to bring their products on their market. Often their market it’s easy to be defined. On a defined market you can easy research your competitors and your customers, and what it’s very valuable, you can find details about every potential customer and you can bring him an offer.
A real example:
If your business does cleaning solutions for swimming pools you may be interested by all contact details (contact person, phone number, email address, website) of hotels who have swimming pools near you or even in entire country. If you have this data you can do direct marketing or email marketing (sending services or product offer by email)