Category Archives: Data & Research

New tools

Hello everybody!

Today we proudly present our new feature of the site, a tool that can not only be useful for large companies but to individual users reading this blog from the comfort of their homes.

For this new tool we had to redesign the tool section so we also hope that you will fond the new aspect, more simple and elegant.

Now, about the tool itself, we are confident that it will make good use to you because its main purpose is to find, in a webpage the most important section/article/data, which can be a difficult task especially in large websites or on pages that are filled with promotional content that is for no use to anyone. You will also see how easy it is to use it: just entering the URL and hitting the button “i’m lucky” the extractor will quote the text right in TheWebMiner tab.

That’s all for now, we hope that you will put to work this new tool, (that you may find at this link) and that it will save you from a lot of work!

 

What you need to know before building a web crawler

This article it’s for persons with technical skills that are some experience in the internet field.

A web spider or a web crawler is a specific program build and used for extracting data from a specific website.

Before start coding for a web crawler you need to know some info about next points:

1 what is your data source (website URL)

2.what it’s your crawling strategy:

If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL

3 common elements

Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.

4 programming language

What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.

DOM versus Regex in web scraping

In web scraping field there are two methods for data filtration. and the question is what is best?

The correct answer is, depends.

First is to use a DOM (Document Object Model) parser and second is regex matching (regex is an acronym from regular expressions). Both of them has advantages and disadvantages.

DOM Parser

Advantages Disadvantages
Simple to code Use more memory
Sensitive at bad HTML

Regex

Advantages Disadvantages
Insensitive at bad HTML Use more CPU
more difficult to code

Hello Big Data

If you are interested in the scraping business you have probably heard by now of a concept called Big Data. This is, as the name says, a collection of data that is so big and complex that it is very hard to process. Nowadays it is estimated that a normal Big Data cell would be around tens of exabytes, meaning around 10 to the power of 18 bytes, but it is estimated that until 2020 more than 18000 exabytes of data will be created.

There are many pros and cons of Big Data because, while some organisations wouldn’t know what to do with a collection of data bigger than few dozen terabytes, others wouldn’t consider analyzing data smaller than that. Another point of view, and one of the major cons that is attributed to Big Data is the fact that with such big amount of data, a correct sampling is very hard to do,  and so, major errors could interrupt the analyzing process. On the other hand, Big Data provided a revolution in science and more generalist, in economy. It is enough for us to think that only in Geneva, for the Large Hadron Collider there are more than 150 million sensors, delivering data about 40 million times per second about 600 collision per second. As for the Business sector, the one that we are interested in, we can say that  Amazon , handles each day queries from more than half a million third party sellers, dealing with millions of back end operations each day. Another example is that of Facebook who has to handle each day more than 50 billion photos.

Generally, there are 4 main characteristics of Big Data: First of them, and the most obvious one is the volume, of which i have already talked and said that it’s growing at an exponential rate. The second main characteristic is the speed of Big Data. This also grows in direct connection with the volume because it is expected that as the world evolve the processing units to be faster. A third category it is considered to be the variety of data. Only 20 percent of all data is structured data, and only this can be analyzed by traditional approach. The structured data is in direct connection with the fourth characteristic, the veridicity of them, which is essential for the whole process to have accurate results.

To end with I would say that even if not many have heard of it, Big Data is already  a part of our lives, influencing the world we live in for many years already. This influence can only grow in the next decades until everybody will be heard of it and how decisions are made through Big Data.

How to convert .XLS files in .CSV and viceversa?

XLS (Microsoft Excel spreadsheet format)

Is a binary format used by Microsoft excel for storing data.

CSV ( Comma Separated Values )

Is a file format for storing data in text files. Every value are separated from other value by a delimiter (many times is , or ; ). Content of a CSV file looks like following lines:

Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38

Convert XLS to CSV

This it’s very simple:

1. You need to open XLS (Excel spreadsheet) file with Microsoft Excel

2. Save as (from file or office menu)

3. Insert file name

4. Select CSV (Comma delimited) (*.csv) from save as type ( above File name )

5, Press Save, then OK and Yes This is all.

Convert CSV to XLS file

1. You need to open CSV file with Microsoft Excel

2. Save as (from file or office menu)

3. Insert file name

4. Select Excel 97-2003 Workbook (*.xls) from save as type ( above File name )

5. Press Save This is all.

 

Web scraping in email marketing

Many businesses have difficulties to bring their products on their market. Often their market it’s easy to be defined. On a defined market you can easy research your competitors and your customers, and what it’s very valuable, you can find details about every potential customer and you can bring him an offer.

A real example:

If your business does cleaning solutions for swimming pools you may be interested by all contact details (contact person, phone number, email address, website) of hotels who have swimming pools near you or even in entire country. If you have this data you can do direct marketing or email marketing (sending services or product offer by email)

 

Why scraping and why TheWebMiner?

If you read this blog you are one of two things: you are either interested in web scraping and you have studied this domain for quite a while, or you are just curious about this relatively new field of interest and want to know what it is, how it’s done and especially why. Either way it’s fine!

In case you haven’t googled already this I can tell you that data extraction (or scraping) is a technique in which a computer program extracts data from human-readable output coming from another program (wikipedia). Basically it can collect all the information on a certain subject from certain places. It’s sort of the equivalent of ctrl+f, at the scale of the whole internet. It’s nothing like the search engines that we currently use because it can extract the data in a certain file, as excel, csv (coma separated values) or any other that the buyer wants, and also extracts only the relevant data, only the values that you are interested in.

I hope now that you understand the concept and you are wondering just why would you need such data. Well let’s take the example of an online store, pretty common nowadays, and of course the manager just like any manager wants his business to thrive, so, for that he has to keep up with the other online stores. Now the web scraping takes place: it is very useful for him to have, saved as excels all the competitor’s prices of certain products if not all of them. By this he can maintain a fair pricing policy and always be ahead of his competitors by knowing all of their prices and fluctuations.  Of course the data collecting can also be done manually but this is not effective because we are talking of thousand of products each one having its own page and so on. This is only one example of situation in which scrapping is useful but there are hundreds and each one of them it’s profitable for the company.

By now I’ve talked about what it is and why you should be interested in it, from now on I’m going to explain why you should use thewebminer.com. First of all, it’s easy: you only have to specify what type of data you want and from where and we’ll manage the rest. Throughout the project you will receive first of all an approximation of price, followed by a time approximation. All the time you will be in contact with us so you can find out at any point what is the state of your project. The pricing policy is reasonable and depends on factors like the project size or complexity. For very big projects a discount may be applicable so the total cost be within reason.

Now I believe that thewebminer.com is able to manage with any kind of situation or requirement from users all over the world and to convince you, free samples are available at any project you may have or any uncertainty or doubt.

AB test with Google Analytics

An AB test it’s a comparison between two versions of the same page (this testing method is often used in online marketing). With this type of test you can measure what version of a page produces more conversions.

Google Analytics has a powerful tool for AB testing. This tool does all parts of AB testing, including visit balancing over versions of pages (distributes visits to each page version), data registering, reporting and decision.

For doing an AB test with Google Analytics you need to do the following steps:

1. Build an alternative page for your original page (page you want to test it)

2. Registering this experiment in Google Analytics section Content/Experiment

3. Put the code of experiment only in the original version of page.

You can find more details here: https://support.google.com/analytics/answer/1745147?ref_topic=1745207&rd=1