Author Archives: Bogdan Balcan

Web Scraping’s 2013 Review – part 2

As promised we came back with the second part of this year’s web scraping review. Today we will focus not only on events of 2013 that regarded web scraping but also Big data and what this year meant for this concept.

First of all, we could not talked about the conferences in which data mining was involved without talking about TED conferences. This year the speakers focused on the power of data analysis to help medicine and to prevent possible crises in third world countries. Regarding data mining, everyone agreed that this is one of the best ways to obtain virtual data.

Also a study by MeriTalk  a government IT networking group, ordered by NetApp showed this year that companies are not prepared to receive the informational revolution. The survey found that state and local IT pros are struggling to keep up with data demands. Just 59% of state and local agencies are analyzing the data they collect and less than half are using it to make strategic decisions. State and local agencies estimate that they have just 46% of the data storage and access, 42% of the computing power, and 35% of the personnel they need to successfully leverage large data sets.

Some economists argue that it is often difficult to estimate the true value of new technologies, and that Big Data may already be delivering benefits that are uncounted in official economic statistics. Cat videos and television programs on Hulu, for example, produce pleasure for Web surfers — so shouldn’t economists find a way to value such intangible activity, whether or not it moves the needle of the gross domestic product?

We will end this article with some numbers about the sumptuous growth of data available on the internet.  There were 30 billion gigabytes of video, e-mails, Web transactions and business-to-business analytics in 2005. The total is expected to reach more than 20 times that figure in 2013, with off-the-charts increases to follow in the years ahead, according to researches conducted by Cisco, so as you can see we have good premises to believe that 2014 will be at least as good as 2013.


Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

  1. iPhone (including 4, 5, 5s, 5c, and 6)
  2. Samsung (including Galaxy, S4, S3, Note)
  3. Siri
  4. iPad Cases
  5. Snapchat
  6. Google Glass
  7. Apple iPad
  8. BlackBerry Z10
  9. Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon)  let’s move 5. Train employees Other interesting declaration  was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

K-means clustering and why?

You have certainly noticed, even if you are in this business or not, that online marketing has became a science far more complicated than it was a decade ago, and also with a exponential rate of grow. Simply because of the need to keep up with the big corporations, small online companies have become more and more interested into the simple ways of maximizing the profits. One of these ways is to use K-means clustering.

Through this, one can discover hot words that led viewers to its webpage or to analyze the behavior of people browsing through pages of a site. Many other possibilities are available for the K-means but these two are the two most common used in online marketing industry.

The term of K-means refers to the number of average values that can be associated with a certain data domain.. when used with text , k-means can provide a very good way to organize the millions of words used by the customers to describe their visits and so, for marketers to actually determine what the users meant and felt at that certain time.

Once someone did understand what his clients are trying to do, he can adapt on the individual needs of users and suggest new ways that can bring more and more income without neglecting the obviously improvement of his services.


About sentiment analysis

Hello internet,

As you probably know, we deal everyday with data scraping, which is quite challenging, but, from time to time we tend to ask ourselves what else is there, and especially, can we scrap something else other than data? The answer is yes, we can, and today I am going to talk about how opinion mining can help you.

Opinion mining, better known as Sentiment analysis deals with automatically scan of a text and establishing its nature or purpose. One of the basic tasks is to determine whether the text itself is basically good or bad, like if it relates with the subject that is mentioned in the title. This is not quite easy because of the many forms a message can take.

Also the purposes that sentiment analysis can be to analyze entries and state the feelings it express (happiness, anger, sadness). This can be done by establishing a mark from -10 to +10 to each word generally associated with an emotion. The score of each word is calculated and then the score of the whole text.  Also, for this technique negations must be identified for a correct analysis.

Another research direction is the subjectivity/objectivity identification. This refers to classifying a given text as being either subjective or objective, which is also a difficult job because of many difficulties that may occur (think at a objective newspaper article with a quoted declaration of somebody). The results of the estimation are also depending of people’s definition for subjectivity.

The last and the most refined type of analysis is called feature-based sentiment analysis. This deals with individual opinions of simple users extracted from text and regarding a certain product or subject. By it, one can determine if the user is happy or not.

Open source software tools deploy machine learning, statistics, and natural language processing techniques to automate sentiment analysis on large collections of texts, including web pages, online news, internet discussion groups, online reviews, web blogs, and social media. Knowledge-based systems, instead, make use of publicly available resources to extract the semantic and affective information associated with natural language concepts.

That was all about sentiment analysis that TheWebMiner is considering to implement soon. I hope you enjoyed and you learned something useful and interesting.

New tools

Hello everybody!

Today we proudly present our new feature of the site, a tool that can not only be useful for large companies but to individual users reading this blog from the comfort of their homes.

For this new tool we had to redesign the tool section so we also hope that you will fond the new aspect, more simple and elegant.

Now, about the tool itself, we are confident that it will make good use to you because its main purpose is to find, in a webpage the most important section/article/data, which can be a difficult task especially in large websites or on pages that are filled with promotional content that is for no use to anyone. You will also see how easy it is to use it: just entering the URL and hitting the button “i’m lucky” the extractor will quote the text right in TheWebMiner tab.

That’s all for now, we hope that you will put to work this new tool, (that you may find at this link) and that it will save you from a lot of work!


Hello Big Data

If you are interested in the scraping business you have probably heard by now of a concept called Big Data. This is, as the name says, a collection of data that is so big and complex that it is very hard to process. Nowadays it is estimated that a normal Big Data cell would be around tens of exabytes, meaning around 10 to the power of 18 bytes, but it is estimated that until 2020 more than 18000 exabytes of data will be created.

There are many pros and cons of Big Data because, while some organisations wouldn’t know what to do with a collection of data bigger than few dozen terabytes, others wouldn’t consider analyzing data smaller than that. Another point of view, and one of the major cons that is attributed to Big Data is the fact that with such big amount of data, a correct sampling is very hard to do,  and so, major errors could interrupt the analyzing process. On the other hand, Big Data provided a revolution in science and more generalist, in economy. It is enough for us to think that only in Geneva, for the Large Hadron Collider there are more than 150 million sensors, delivering data about 40 million times per second about 600 collision per second. As for the Business sector, the one that we are interested in, we can say that  Amazon , handles each day queries from more than half a million third party sellers, dealing with millions of back end operations each day. Another example is that of Facebook who has to handle each day more than 50 billion photos.

Generally, there are 4 main characteristics of Big Data: First of them, and the most obvious one is the volume, of which i have already talked and said that it’s growing at an exponential rate. The second main characteristic is the speed of Big Data. This also grows in direct connection with the volume because it is expected that as the world evolve the processing units to be faster. A third category it is considered to be the variety of data. Only 20 percent of all data is structured data, and only this can be analyzed by traditional approach. The structured data is in direct connection with the fourth characteristic, the veridicity of them, which is essential for the whole process to have accurate results.

To end with I would say that even if not many have heard of it, Big Data is already  a part of our lives, influencing the world we live in for many years already. This influence can only grow in the next decades until everybody will be heard of it and how decisions are made through Big Data.

Why scraping and why TheWebMiner?

If you read this blog you are one of two things: you are either interested in web scraping and you have studied this domain for quite a while, or you are just curious about this relatively new field of interest and want to know what it is, how it’s done and especially why. Either way it’s fine!

In case you haven’t googled already this I can tell you that data extraction (or scraping) is a technique in which a computer program extracts data from human-readable output coming from another program (wikipedia). Basically it can collect all the information on a certain subject from certain places. It’s sort of the equivalent of ctrl+f, at the scale of the whole internet. It’s nothing like the search engines that we currently use because it can extract the data in a certain file, as excel, csv (coma separated values) or any other that the buyer wants, and also extracts only the relevant data, only the values that you are interested in.

I hope now that you understand the concept and you are wondering just why would you need such data. Well let’s take the example of an online store, pretty common nowadays, and of course the manager just like any manager wants his business to thrive, so, for that he has to keep up with the other online stores. Now the web scraping takes place: it is very useful for him to have, saved as excels all the competitor’s prices of certain products if not all of them. By this he can maintain a fair pricing policy and always be ahead of his competitors by knowing all of their prices and fluctuations.  Of course the data collecting can also be done manually but this is not effective because we are talking of thousand of products each one having its own page and so on. This is only one example of situation in which scrapping is useful but there are hundreds and each one of them it’s profitable for the company.

By now I’ve talked about what it is and why you should be interested in it, from now on I’m going to explain why you should use First of all, it’s easy: you only have to specify what type of data you want and from where and we’ll manage the rest. Throughout the project you will receive first of all an approximation of price, followed by a time approximation. All the time you will be in contact with us so you can find out at any point what is the state of your project. The pricing policy is reasonable and depends on factors like the project size or complexity. For very big projects a discount may be applicable so the total cost be within reason.

Now I believe that is able to manage with any kind of situation or requirement from users all over the world and to convince you, free samples are available at any project you may have or any uncertainty or doubt.