Tag Archives: data scraping


2015 retrospective

Only now at the end of 2015 we can realize the magnitude of a whole year and what we managed to accomplish in this time. For us, at TheWebMiner this year was a full one, marked by new experiences, connections, and most important, successful data extractions.

More than that, 2015 was a productive year. After doing an internship with several students from Bucharest Academy of Economic Studies we managed to expand our team with one member, a devoted programmer just as passionate about this science as any of us.

Continue reading

Challenging users to data science

A problem well known in the data science world is the mismatch between people who have the data and people who know how to use it. On the other hand data scientists complain about the difficulties of the scrapping process and more exact, the difficulties of obtaining the data. For this mismatch Kaggle was created, trying to mediate a connection between data and analysts.

The platform was born on this principles and creates a competition between users which must update solutions to diverse data sets and so to win points, and, in the end, money.

On the other side, the uploader of data gets a number of possible solutions of analysis to his data sets, from which he can choose the most appropriate to his interests.

A very interesting case study, and a powerful demonstration in favor of Kaggle capabilities is the collaboration that the platform has, with NASA and Royal Astronomical Society, in which the challenge was to find an algorithm for measuring the distortions in images of galaxies in order for scientists to prove the existence of dark matter. It seems that within a week from the start of the project, the accuracy of the algorithms provided by NASA, and obtained in studies started back in 1934 and continued to that time was reached. More than this , within three months from the start of the project, an algorithm was provided by a user, that was more than 300% more accurate than any of the previous versions. The whole case study can be found here.

 essentially, the fun thing about Kaggle is that the winners of the competitions are folks around the world with a knack for problem solving, and not always degrees in mathematics. And degrees don’t matter on Kaggle; all that matters is result. 



Hello Big Data

If you are interested in the scraping business you have probably heard by now of a concept called Big Data. This is, as the name says, a collection of data that is so big and complex that it is very hard to process. Nowadays it is estimated that a normal Big Data cell would be around tens of exabytes, meaning around 10 to the power of 18 bytes, but it is estimated that until 2020 more than 18000 exabytes of data will be created.

There are many pros and cons of Big Data because, while some organisations wouldn’t know what to do with a collection of data bigger than few dozen terabytes, others wouldn’t consider analyzing data smaller than that. Another point of view, and one of the major cons that is attributed to Big Data is the fact that with such big amount of data, a correct sampling is very hard to do,  and so, major errors could interrupt the analyzing process. On the other hand, Big Data provided a revolution in science and more generalist, in economy. It is enough for us to think that only in Geneva, for the Large Hadron Collider there are more than 150 million sensors, delivering data about 40 million times per second about 600 collision per second. As for the Business sector, the one that we are interested in, we can say that  Amazon , handles each day queries from more than half a million third party sellers, dealing with millions of back end operations each day. Another example is that of Facebook who has to handle each day more than 50 billion photos.

Generally, there are 4 main characteristics of Big Data: First of them, and the most obvious one is the volume, of which i have already talked and said that it’s growing at an exponential rate. The second main characteristic is the speed of Big Data. This also grows in direct connection with the volume because it is expected that as the world evolve the processing units to be faster. A third category it is considered to be the variety of data. Only 20 percent of all data is structured data, and only this can be analyzed by traditional approach. The structured data is in direct connection with the fourth characteristic, the veridicity of them, which is essential for the whole process to have accurate results.

To end with I would say that even if not many have heard of it, Big Data is already  a part of our lives, influencing the world we live in for many years already. This influence can only grow in the next decades until everybody will be heard of it and how decisions are made through Big Data.