Category Archives: Data & Research

Get started with microformats

Microformats are small patterns that can be embedded into your HTML for easier recognition and representation of common published materials, like people, events, dates or tags. Even though the content of web is fully capable of automated processing, microformats simplify the process by attaching semantics and other so lead the way for a more professional automated processing. Many advantages can be found in favor of microformats but the most crucial are these ones.

By this time i should mention that microformats are a huge relief in web scraping by defining lightweight standards for declaring info in any web page. By doing so another concept of HTML5 is defined, Microdata. This lets you define custom variables and implement certain proprieties of them.

Now that you know what microformats are we should focus on the getting started part. A really useful, quick and detailed guide can be found here, and more complex task are also available. Now, the only thing left is to wish you good luck into implementing it .

Big Data and Data Mining Tools

Recently we have tested a Data Mining tool about which i want to write today. It is called Datameer and it’s a cloud app based on Hadoop so we don’t need to install anything on our computers but we must have the data that we want analyzed.

Step 1: Importing the data

To import any kind of data we must select the format of them:

datameer0

Step 2: A small configuration

Some of which regard data format, others of the way to detect certain data types. This program tries to detect each column’s type being possible to add data types from a file:

datameer0.1

Step 3: Some fine adjustments
If the program doesn’t detect the columns well we can do it manually.  A bad of this program is the fact that we can adjust data at this step only by removing of the recordings that won’t correspond to the type of data recently defined.

datameer1

Step 4:Selecting the sample used for previsualisation

datameer2

So this is all it is to be done for adding data into Datameer. Further on, an excel-like interface shows all the data .
Here we can find a few buttons responsible for the magic:

Column Dependency
Shows the relation between different columns and basically if a variable depend on other.

Clustering
Using this we can group similar data.
All the discovering part is done by the program and we only have to specify the number of clusters that we want.

Decision Tree
Builds a decision tree based on the data.

These are all the important function of Datameer, but the true importance of this App relies not on the functions but on the ability of processing a huge quantity of data/

Processing of large text files

We, at TheWebMiner we have often the need of processing large text files, and when i say large i mean files of few hundreds of Megabytes or bigger. Out of all the text processing tools that we’ve tested so far we concluded that the best was Vim or gVim (the windows version of this famous editor).

Regulated Expressions

Another useful tool in file processing are regulated expressions, or, more simple RegEx. These expressions help us find, or find and replace pieces of text of a certain format, all being done automatically. By combining the two definitions we discover a new problem.

How do we use Regulated Expressions in Vim?

Vim has its own format for RegEx so we cannot use standard regulated expressions Of Vim but we have created and put to your disposal a convertor for this purpose. You can find the converter on our site (www.thewebminer.com/regex-to-vim) , and we hope that this will come to your help.

Get ready to adapt your business to the future!

Robot-Counsellor-900x540

 

 

Recently while browsing the internet I’ve stumbled upon an article that captured my whole attention. Articles about how life is going to change in recent future and how technology becomes more and more a part of our lives are easy to find but in this article the accent is put on business development in recent and more distant future in all of its aspects, some of which i want to share with you.

First of all we want to remind that change is mandatory. Maintaining a certain business plan on a company for a long time can only lead to stagnation and in the end, to failure, as seen in the cases of Kodak or more recent Blackberry.

Having these examples we must not fear to embrace new promising technologies and stay ahead of competitors. Job market will evolve also creating new jobs that today sound simply weird like Nostalgist, Simplicity Expert or End of life Therapist. while also replacing more than 2 billion of today’s jobs through Automation.

Coming to the Advertisements sector we can say that revolution already became part of our lives by developing targeting algorithms meant to deliver the best commercials for us and culminating with knowing us better than our closest ones .

There are many to tell about how things are predicted to change in a few decades and we can’s stay ahead of everything but the least we can do is to be prepared, and optimists about change, after all, curiosity is our greatest gift.

 

 

 

Perspective Analytics is what really matters

I don’t know how much have you heard about Perspective Analytics because it is not as popular as Descriptive and Predictive Analytics but sure it has the power of changing how we treat Big Data.

By taking a blunt look at this situation we can say that Perspective Analytics is the new term to name the step from analytics to knowledge in the data to knowledge pyramid. Predictive analytics is the next step up in data reduction. It utilizes a variety of statistical, modeling, data mining, and machine learning techniques to study recent and historical data, thereby allowing analysts to make predictions about the future. As we know, big data imposes a huge amount of information the majority of which is useless, hence the necessity for this new service.

The purpose of analytics is not to tell you what is going to happen in the future but, because of its probabilistic nature, to inform you of what MIGHT happen, based on a a predictive model with two additional components: actionable data and a feedback system that tracks the outcome produced by the action taken.

This type new step/ type of analytics was first introduced in 2013 after the Descriptive Analytics was defined as the simplest class of analytics, one that allows you to condense big data into smaller, more useful nuggets of information, after which next step in reducing information is by applying a Predictive algorithm.

IBM’s vision is that descriptive analytics allows an understanding of what has happened, while advanced analytics, consisting of both predictive and prescriptive analytics, is where there is real impact on the decisions made by businesses every day

 

Latest mobile app trends

Mobile app market has changed over the years in many unexpected ways, but if there is something that everyone expected is that it is continuously growing. This rather new industry has expanded in every direction forcing the limits of creators imagination and of mobile physical capabilities.

TheWebMiner team has set up a series of graphics showing not only the evolution of mobile apps market for Android and iOS but also the most important trends to follow. The presentation can also be viewed here.

As a conclusion we can certainly say that app market has a steady position over mobile web market, always searching for new possibilities and expand areas few of which will go mainstream soon accustoming users with concepts like Internet of Things, or Mobile Payments in everyday situations.

TheWebMiner.com offers now structured list of all apps of Google Play and iTunes in any format suits your needs, with respect to any indicators on the site (you can find data here)!

The science behind an internet request

Altruism can be found in many shapes on the internet, especially on sites designed for user interaction, like blogs, forums or social networks. The giant Reddit even has a special thread The random acts, on Pizza section which is specialized in giving free pizza to strangers if the story they tell is worth one. It is fun and the motto is as simple as that: “because … who doesn’t like helping out a stranger? The purpose is to have fun, eat pizza and help each other out. Together, we aim to restore faith in humanity, one slice at a time.”

This great opportunity rises an objective popular question in our minds though: What should one say to get free pizza, and furthermore, what should one say to get any kind of free stuff on the internet? A possible answer comes once again from the science of data mining. Researchers at Stanford University analyzed this intriguing problem but limited to Reddit posts.

By mining all the section posts from 2010 until today and passing them through filters like sentiment analysis, politeness and more important if they wore successful or not, a pattern was established.Altruism I

Predictability rate resulted is up to 70 % accuracy and beside the sociological observations, like the positive results of longer posts or the negative results of very polite posts it is interesting to observe the algorithm that made all this possible by dividing the narratives into five types, those that mention: money; a job; being a student; family; and a final group that includes mentions of friends, being drunk, celebrating and so on, which the team  called “craving.”

This study has a very important role in analytics of behavior of peers on the internet and opens a wide area of research for better understanding of online consumers around the world.

 

 

Challenging users to data science

A problem well known in the data science world is the mismatch between people who have the data and people who know how to use it. On the other hand data scientists complain about the difficulties of the scrapping process and more exact, the difficulties of obtaining the data. For this mismatch Kaggle was created, trying to mediate a connection between data and analysts.

The platform was born on this principles and creates a competition between users which must update solutions to diverse data sets and so to win points, and, in the end, money.

On the other side, the uploader of data gets a number of possible solutions of analysis to his data sets, from which he can choose the most appropriate to his interests.

A very interesting case study, and a powerful demonstration in favor of Kaggle capabilities is the collaboration that the platform has, with NASA and Royal Astronomical Society, in which the challenge was to find an algorithm for measuring the distortions in images of galaxies in order for scientists to prove the existence of dark matter. It seems that within a week from the start of the project, the accuracy of the algorithms provided by NASA, and obtained in studies started back in 1934 and continued to that time was reached. More than this , within three months from the start of the project, an algorithm was provided by a user, that was more than 300% more accurate than any of the previous versions. The whole case study can be found here.

 essentially, the fun thing about Kaggle is that the winners of the competitions are folks around the world with a knack for problem solving, and not always degrees in mathematics. And degrees don’t matter on Kaggle; all that matters is result. 

 

 

Enigma Analytics

Without any introduction we can certainly say that Enigma is a tool that should not be ignored by any data enthusiast. First introduced to the wide public at TechCrunch Disrupt NY 2013 where this start-up was the grand winner, it has gained popularity by simplicity of use and wide availability of its content.

Enigma allows its users to explore a vast amount of publicly available although not easy to obtain data. The service pulls its data from more than 100,000 data sources, a major advantage being a deceptively simple process of sifting through all the information  — a quick search for a person’s name or company brings up multiple detailed sources of information, and jumping in and playing with data is thoughtfully executed. 

By now the excellent simplistic design and usefulness of the information provided in one place has brought the company partnerships with the Harvard Business School, research firm Gerson Lehrman Group, S&P Capital IQ, and newly-minted strategic investor the New York Times.

Although by now it has proven itself a very useful tool Enigma has its ups and downs. The biggest downsize is the fact that it only has databases collected from american government and american local authorities, which is great because those datasets are public and free but they are not very useful for researchers from another countries, unless they are studying their country relations with America. Second of all, its simplistic design can be a bit confusing at first because it’s a new type of application and not all of its functions are clear. However this can be avoided if before browsing through the site you first visit the support section.

All in all , we have reached the verdict that Enigma is a great App if you are interested in public data of America, not easy to obtain otherwise.

Companies join forces against FCC

After years of pressure from ISPs, net neutrality is under threat by the FCC itself. Chair Tom Wheeler promised to revive the Open Internet Order after it saw an unceremonious defeat in January, but a leaked version of his latest proposal would let companies pay ISPs for a “fast lane” to subscribers, undermining the spirit of the original rules, which barred companies from discriminating between services. Despite Wheeler’s reassurances, this new proposal is the exact opposite of net neutrality. It could undermine both the companies of today and the startups of tomorrow. It might also be exactly the push activists need to fight back, according to The Verge .

As Washington Post suggest, more than 150 internet firms are protesting in a letter to the Federal Communications Commission. The companies asked federal regulators to reconsider a proposal that critics fear would allow Internet providers to charge for faster, better access to consumers. The list includes Amazon, Facebook, Google and Microsoft, along with dozens of other firms that called the prospect of paid fast lanes “a threat to the Internet.”

With just a week to go before the Federal Communications Commission meets to consider its proposed new rules for ISPs, the letter represents a late attempt by Silicon Valley to take a stance on the open Internet.

“Instead of permitting individualized bargaining and discrimination,” the companies wrote, “the commission’s rules should protect users and Internet companies on both fixed and mobile platforms against blocking, discrimination and paid prioritization, and should make the market for Internet services more transparent.”

The main question is whether a slow-down protest would have any impact. But it is undoubtedly worth starting a broader conversation about what the Internet community can do together to protest the FCC’s proposed rules.