Tag Archives: data mining

Elements of Statistical Learning Walkthrough


Data science can be a art, a art of identifying patterns and decisions before of even being taken, all this, with impressive accuracy. For our blog’s comeback I thought I should cover more the literary part of this science-art-craft and talk about some of the ground principles exposed in some of the finest books about data science.

In today’s article I will focus on a very well sturctured paper of Trevor Hastie, Professor of Mathematical Sciences at Stanford Univesity. His book, co-writed with Robert Tibshirani and Jerome Friedman is called The Elements of Statistical Learning: Data Mining, Inference, and Prediction and tries, if not, manages to give a detailed explanation to the challenge of understanding of how data led to development of new tools in the field of statistics, and spawned new areas such as data mining, machine learning, and bioinformatics.  This paper  mainly observes the important fields and ideas in a common virtual framework.


The approach being mainly is statistical, the emphasis falls on concepts rather than  on mathematics. Many examples are given, with a easy-to-understand use of color graphics. It is a valuable resource for statisticians and everyone interested in data mining in science or industry. The book’s coverage is broad, from supervised learning (better known as prediction) to unsupervised learning. Various topics are covered including neural networks, support vector machines, classification trees and boosting – the first comprehensive treatment of this topic in any book of this kind.

All in all I can certainly say that the presentation is not keened on mathematical aspects, and it  does not provide a deep analysis of why a specific method works. Instead, it gives you some intuition about what a method is trying to do. And this is the reason why i can say that I like this book so much. Without going into mathematical details of complicated algorithms, it summarizes all necessary (and really important) things one needs to know. Sometimes you understand it after doing a lot of research in this subject and coming back to the book. Nevertheless, the authors are great statisticians and certainly know what they are talking about!

Big Data and Data Mining Tools

Recently we have tested a Data Mining tool about which i want to write today. It is called Datameer and it’s a cloud app based on Hadoop so we don’t need to install anything on our computers but we must have the data that we want analyzed.

Step 1: Importing the data

To import any kind of data we must select the format of them:


Step 2: A small configuration

Some of which regard data format, others of the way to detect certain data types. This program tries to detect each column’s type being possible to add data types from a file:


Step 3: Some fine adjustments
If the program doesn’t detect the columns well we can do it manually.  A bad of this program is the fact that we can adjust data at this step only by removing of the recordings that won’t correspond to the type of data recently defined.


Step 4:Selecting the sample used for previsualisation


So this is all it is to be done for adding data into Datameer. Further on, an excel-like interface shows all the data .
Here we can find a few buttons responsible for the magic:

Column Dependency
Shows the relation between different columns and basically if a variable depend on other.

Using this we can group similar data.
All the discovering part is done by the program and we only have to specify the number of clusters that we want.

Decision Tree
Builds a decision tree based on the data.

These are all the important function of Datameer, but the true importance of this App relies not on the functions but on the ability of processing a huge quantity of data/

The science behind an internet request

Altruism can be found in many shapes on the internet, especially on sites designed for user interaction, like blogs, forums or social networks. The giant Reddit even has a special thread The random acts, on Pizza section which is specialized in giving free pizza to strangers if the story they tell is worth one. It is fun and the motto is as simple as that: “because … who doesn’t like helping out a stranger? The purpose is to have fun, eat pizza and help each other out. Together, we aim to restore faith in humanity, one slice at a time.”

This great opportunity rises an objective popular question in our minds though: What should one say to get free pizza, and furthermore, what should one say to get any kind of free stuff on the internet? A possible answer comes once again from the science of data mining. Researchers at Stanford University analyzed this intriguing problem but limited to Reddit posts.

By mining all the section posts from 2010 until today and passing them through filters like sentiment analysis, politeness and more important if they wore successful or not, a pattern was established.Altruism I

Predictability rate resulted is up to 70 % accuracy and beside the sociological observations, like the positive results of longer posts or the negative results of very polite posts it is interesting to observe the algorithm that made all this possible by dividing the narratives into five types, those that mention: money; a job; being a student; family; and a final group that includes mentions of friends, being drunk, celebrating and so on, which the team  called “craving.”

This study has a very important role in analytics of behavior of peers on the internet and opens a wide area of research for better understanding of online consumers around the world.



TheWebMiner in French

Good day everyone, or should i say better bonjour, because along this week we have launched the french version of TheWebMiner.com.

It is a certitude that the need for data increases every day in every possible direction and we want to keep up with this trend. Although English is the language of the internet we want to reach also to other users from smaller environments that might need our services, and because French is the official language in 29 countries it seemed as an obvious choice.

So, from now on along with the English version and the Romanian, which is the base country of our company a third version is available to choose in the language menu from the upper right corner of  our site.


We hope you will enjoy your experience and will provide a good feedback on our expansion.


Data Cleaning

In case you are really into data mining maybe you have wondered what happens to data after is extracted: does it gets delivered the way it is or there is more?

The truth is that extraction is only one part of the process and it is followed by several others, including Data Cleaning, the subject of today’s article.

The necessity for such a process has always been present in scientific areas where misleading results can induce false conclusions and lead to failure of the initial purposes but the automation has occurred relatively recent, in the last two decades when the need for cleaning was imposed to a very large quantity of data.

For data to be considered of high quality it must fulfill a series of requirements such as:

  • Validity, which represents the degree of correspondence with the usual business constraints. This is relatively easy to ensure, having to set up specific indicators as Data-type constraints or Range constraints or Mandatory constraints.
  • Decleansing represents error detection and syntactically removal of them for better programming.
  • Accuracy: The degree of conformity of a measure to a standard or a true value; this also requires an external set of data for comparison.
  • Completeness: percentage to which all required measures are known.
  • Consistency: The degree to which a set of measures are equivalent in across systems.
  • Uniformity, which ensures that all the measurements have the same measurement units and some aspects of validation.

This research area has more to complete until all the challenges that optimization imposes will be fixed. Today, problems like Error correction and lose of information through it, or Maintenance of cleansed data still create serious issues, but the with the advance of Big Data and interest exertion from the big companies such as IBM or Oracle in this field we can be optimistic and say that we are on the right track .

How to use regex in Vim?

We often need to process big text files (larger than 100 mb) and we discovered that best text editor for this is Vim and gVim (windows version). Also a powerful mode to process text automatically is to use regular expressions (also called RegEx).

Using RegEx in Vim

Vim doesn’t support standard RegEx, but we built a tool that converts standard regex to Vim regex. This tool it’s available here: RegEx to Vim.

We hope that is useful for you.

Why do you need Facebook for your business?

This might be a relatively simple question but the complexity of the answer might surprise you!

First of all, if you have a business and you don’t have a Facebook page for it, well, i’m sorry to tell you that you might be among the last ones which doesn’t. Forgotten are the times when not everyone were on this social platform and now, user concentrate on adding everything, from everywhere to it; this includes businesses, places, currents, events, personalities and many other daily-life aspects, all with the purpose of simplifying our actions.

People tend to be skeptical about the success of a business page but what they fail to understand is that even a small page with a small audience can make a difference. A research shows that merely six percent of all Facebook pages have more than ten thousand likes and that is not a problem. If given time the popularity of a page will grow and more and more users will be interested in the information provided by you. Another reason why people tend to avoid having a social media page for their business is that because it’s hard to keep it updated all the time. Although constant posts will keep your fans happy there is not a direct correlation between the posting span and the growth of the page.

On the other side there are plenty of different reasons that your business needs Facebook. What it boils down to, though, is that this is a free opportunity to reach out to your audience in their preferred environment, improve your SEO rankings and visibility, and show off your business in a way that people can relate to. This being said we want to familiarize you with TheWebMiner Facebook page were we constantly post technical news, updates about our tool or tips for our field of activity.

Web Scraping’s 2013 Review – part 2

As promised we came back with the second part of this year’s web scraping review. Today we will focus not only on events of 2013 that regarded web scraping but also Big data and what this year meant for this concept.

First of all, we could not talked about the conferences in which data mining was involved without talking about TED conferences. This year the speakers focused on the power of data analysis to help medicine and to prevent possible crises in third world countries. Regarding data mining, everyone agreed that this is one of the best ways to obtain virtual data.

Also a study by MeriTalk  a government IT networking group, ordered by NetApp showed this year that companies are not prepared to receive the informational revolution. The survey found that state and local IT pros are struggling to keep up with data demands. Just 59% of state and local agencies are analyzing the data they collect and less than half are using it to make strategic decisions. State and local agencies estimate that they have just 46% of the data storage and access, 42% of the computing power, and 35% of the personnel they need to successfully leverage large data sets.

Some economists argue that it is often difficult to estimate the true value of new technologies, and that Big Data may already be delivering benefits that are uncounted in official economic statistics. Cat videos and television programs on Hulu, for example, produce pleasure for Web surfers — so shouldn’t economists find a way to value such intangible activity, whether or not it moves the needle of the gross domestic product?

We will end this article with some numbers about the sumptuous growth of data available on the internet.  There were 30 billion gigabytes of video, e-mails, Web transactions and business-to-business analytics in 2005. The total is expected to reach more than 20 times that figure in 2013, with off-the-charts increases to follow in the years ahead, according to researches conducted by Cisco, so as you can see we have good premises to believe that 2014 will be at least as good as 2013.


Web Scraping’s 2013 Review – part 1

Here we are, almost having ended another year and having the chance to analyze the aspects of the Web scraping market over the last twelve months. First of all i want to underline all the buzzwords on the tech field as published in the Yahoo’s year in review article . According to Yahoo, the most searched items wore

  1. iPhone (including 4, 5, 5s, 5c, and 6)
  2. Samsung (including Galaxy, S4, S3, Note)
  3. Siri
  4. iPad Cases
  5. Snapchat
  6. Google Glass
  7. Apple iPad
  8. BlackBerry Z10
  9. Cloud Computing

It’s easy to see that none of this terms regards in any way with the field of data mining, and they rather focus on the gadgets and apps industry, which is just one of the ways technology can evolve to. Regarding actual data mining industry there were a lot of talks about it in this year’s MIT’s Engaging Data 2013 Conference. One of the speakers Noam Chomsky gave an acid speech relating data extraction and its connection to the Big Data phenomena that is also on everyone’s lips this year. He defined a good way to see if Big Data works by following a series of few simple factors: 1. It’s the analysis, not the raw data, that counts. 2. A picture is worth a thousand words 3. Make a big data portal (Not sure if Facebook is planning on dominating in cloud services some day) 4. Use a hybrid organizational model (We’re asleep already, soon)  let’s move 5. Train employees Other interesting declaration  was given by EETimes saying, “Data science will do more for medicine in the next 10 years than biological science.” which says a lot about the volume of required extracted data.

Because we want to cover as many as possible events about data mining this article will be a two parter, so don’t forget to check our blog tomorrow when the second part of this article will come up!

About sentiment analysis

Hello internet,

As you probably know, we deal everyday with data scraping, which is quite challenging, but, from time to time we tend to ask ourselves what else is there, and especially, can we scrap something else other than data? The answer is yes, we can, and today I am going to talk about how opinion mining can help you.

Opinion mining, better known as Sentiment analysis deals with automatically scan of a text and establishing its nature or purpose. One of the basic tasks is to determine whether the text itself is basically good or bad, like if it relates with the subject that is mentioned in the title. This is not quite easy because of the many forms a message can take.

Also the purposes that sentiment analysis can be to analyze entries and state the feelings it express (happiness, anger, sadness). This can be done by establishing a mark from -10 to +10 to each word generally associated with an emotion. The score of each word is calculated and then the score of the whole text.  Also, for this technique negations must be identified for a correct analysis.

Another research direction is the subjectivity/objectivity identification. This refers to classifying a given text as being either subjective or objective, which is also a difficult job because of many difficulties that may occur (think at a objective newspaper article with a quoted declaration of somebody). The results of the estimation are also depending of people’s definition for subjectivity.

The last and the most refined type of analysis is called feature-based sentiment analysis. This deals with individual opinions of simple users extracted from text and regarding a certain product or subject. By it, one can determine if the user is happy or not.

Open source software tools deploy machine learning, statistics, and natural language processing techniques to automate sentiment analysis on large collections of texts, including web pages, online news, internet discussion groups, online reviews, web blogs, and social media. Knowledge-based systems, instead, make use of publicly available resources to extract the semantic and affective information associated with natural language concepts.

That was all about sentiment analysis that TheWebMiner is considering to implement soon. I hope you enjoyed and you learned something useful and interesting.