Excel Cheat Sheet

We often use Excel in data processing domain, and today I will share with you an Excel cheat sheet.

About sentiment analysis

Hello internet,

As you probably know, we deal everyday with data scraping, which is quite challenging, but, from time to time we tend to ask ourselves what else is there, and especially, can we scrap something else other than data? The answer is yes, we can, and today I am going to talk about how opinion mining can help you.

Opinion mining, better known as Sentiment analysis deals with automatically scan of a text and establishing its nature or purpose. One of the basic tasks is to determine whether the text itself is basically good or bad, like if it relates with the subject that is mentioned in the title. This is not quite easy because of the many forms a message can take.

Also the purposes that sentiment analysis can be to analyze entries and state the feelings it express (happiness, anger, sadness). This can be done by establishing a mark from -10 to +10 to each word generally associated with an emotion. The score of each word is calculated and then the score of the whole text. Also, for this technique negations must be identified for a correct analysis.

Another research direction is the subjectivity/objectivity identification. This refers to classifying a given text as being either subjective or objective, which is also a difficult job because of many difficulties that may occur (think at a objective newspaper article with a quoted declaration of somebody). The results of the estimation are also depending of people’s definition for subjectivity.

The last and the most refined type of analysis is called feature-based sentiment analysis. This deals with individual opinions of simple users extracted from text and regarding a certain product or subject. By it, one can determine if the user is happy or not.

Open source software tools deploy machine learning, statistics, and natural language processing techniques to automate sentiment analysis on large collections of texts, including web pages, online news, internet discussion groups, online reviews, web blogs, and social media. Knowledge-based systems, instead, make use of publicly available resources to extract the semantic and affective information associated with natural language concepts.

That was all about sentiment analysis that TheWebMiner is considering to implement soon. I hope you enjoyed and you learned something useful and interesting.

Can robots.txt protect website from scraping?

No. Robots.txt it’s a formal parsing guide for web crawlers (especially for search engines).

With robots.txt you can avoid to appear in unwanted page or sections in search engines, but this can’t stop bots to parse this pages.

What you have to know before requesting web scraping services?

Before you request web scraping services you have to know what are your needs (what data you need, structure of it and where you can find this data).

Step 1: Define what data you need?

Data needs depending on purpose, if you want to find new customers you probably need contact data from players in your industry. Also if you want to study your competitors you need to define who are they. Only after that you can select data sources (websites feeds or other electronic sources) for this extraction.

In many cases for discovering and defining data sources are used search engines like Google, Bing, Yahoo, and others.

Step 2: Structure of data

Data structure it’s directly linked to usage purpose. In many cases data structure it’s a table where a row represents an entity and a cell of this row represents a property of this entity. In other cases Data structure is a a chart or another graphic representation builder with data extracted from a web source.

Step 3: Number of data extraction

In many cases is needed one time data extraction. In other cases when you need a regular report, are needed periodically extractions.

If you have defined all of above points you are ready to request a quote and an amount estimation from this contact form.

New tools

Hello everybody!

Today we proudly present our new feature of the site, a tool that can not only be useful for large companies but to individual users reading this blog from the comfort of their homes.

For this new tool we had to redesign the tool section so we also hope that you will fond the new aspect, more simple and elegant.

Now, about the tool itself, we are confident that it will make good use to you because its main purpose is to find, in a webpage the most important section/article/data, which can be a difficult task especially in large websites or on pages that are filled with promotional content that is for no use to anyone. You will also see how easy it is to use it: just entering the URL and hitting the button “i’m lucky” the extractor will quote the text right in TheWebMiner tab.

That’s all for now, we hope that you will put to work this new tool, (that you may find at this link) and that it will save you from a lot of work!

What you need to know before building a web crawler

This article it’s for persons with technical skills that are some experience in the internet field.

A web spider or a web crawler is a specific program build and used for extracting data from a specific website.

Before start coding for a web crawler you need to know some info about next points:

1 what is your data source (website URL)

2.what it’s your crawling strategy:

If you get data from multiple URLs, How can you start maybe an index page, or a list with all of interest URL

3 common elements

Crawling is about finding common elements and extract different data from different locations (as URLs) contained in elements with the same structure like a div with a specific class or another HTML element.

4 programming language

What programming language you can use for this and what libraries you need to use for this. Also this it’s the point when you need to decide if you use a DOM parser or regex for finding common element and extract data from it.

DOM versus Regex in web scraping

In web scraping field there are two methods for data filtration. and the question is what is best?

The correct answer is, depends.

First is to use a DOM (Document Object Model) parser and second is regex matching (regex is an acronym from regular expressions). Both of them has advantages and disadvantages.

DOM Parser

Advantages	Disadvantages
Simple to code	Use more memory
	Sensitive at bad HTML

Regex

Advantages	Disadvantages
Insensitive at bad HTML	Use more CPU
	more difficult to code

Hello Big Data

If you are interested in the scraping business you have probably heard by now of a concept called Big Data. This is, as the name says, a collection of data that is so big and complex that it is very hard to process. Nowadays it is estimated that a normal Big Data cell would be around tens of exabytes, meaning around 10 to the power of 18 bytes, but it is estimated that until 2020 more than 18000 exabytes of data will be created.

There are many pros and cons of Big Data because, while some organisations wouldn’t know what to do with a collection of data bigger than few dozen terabytes, others wouldn’t consider analyzing data smaller than that. Another point of view, and one of the major cons that is attributed to Big Data is the fact that with such big amount of data, a correct sampling is very hard to do, and so, major errors could interrupt the analyzing process. On the other hand, Big Data provided a revolution in science and more generalist, in economy. It is enough for us to think that only in Geneva, for the Large Hadron Collider there are more than 150 million sensors, delivering data about 40 million times per second about 600 collision per second. As for the Business sector, the one that we are interested in, we can say that Amazon , handles each day queries from more than half a million third party sellers, dealing with millions of back end operations each day. Another example is that of Facebook who has to handle each day more than 50 billion photos.

Generally, there are 4 main characteristics of Big Data: First of them, and the most obvious one is the volume, of which i have already talked and said that it’s growing at an exponential rate. The second main characteristic is the speed of Big Data. This also grows in direct connection with the volume because it is expected that as the world evolve the processing units to be faster. A third category it is considered to be the variety of data. Only 20 percent of all data is structured data, and only this can be analyzed by traditional approach. The structured data is in direct connection with the fourth characteristic, the veridicity of them, which is essential for the whole process to have accurate results.

To end with I would say that even if not many have heard of it, Big Data is already a part of our lives, influencing the world we live in for many years already. This influence can only grow in the next decades until everybody will be heard of it and how decisions are made through Big Data.

Convert from SQLite to CSV

It’s very simple:

sqlite> .mode list
sqlite> .separator ,
sqlite> .output exported_file.csv
sqlite> select * from yourtable;
sqlite> .exit

You can use other separator. For Microsoft Excel default separator is “;”.

TheWebMiner video presentation

We have a new video presentation 🙂

TheWebMiner Blog

cloud web scraping tool