Automatic Data Processing

Big DataMost of the data we deliver is CSV type (comma separated values). Each row represents a value, and also different proprieties of a value are, you guessed it, comma separated. Of course that in order for this to have any meaning the order of this proprieties is kept for each row(each value).

Many of our clients require data in their own format and so, here, I want to write in detail about a case study:

Let there be the X client, asking for a database with certain companies of an area in order to use this into the company CRM and develop a marketing strategy.

Continue reading

TWM

2015 retrospective

Only now at the end of 2015 we can realize the magnitude of a whole year and what we managed to accomplish in this time. For us, at TheWebMiner this year was a full one, marked by new experiences, connections, and most important, successful data extractions.

More than that, 2015 was a productive year. After doing an internship with several students from Bucharest Academy of Economic Studies we managed to expand our team with one member, a devoted programmer just as passionate about this science as any of us.

Continue reading

Where are the flying cars?

Flying cars

2015 came, and by now is almost gone and we can see that we’ve been mostly deceived by popular expectations from the media industry like hover boards, flying cars or laser guns.

It’s obvious that we all wish for such cool gadgets and we are eager to use them, but are we actually? in this matter data science has a word and establishes itself as an expression of people’s hidden wishes by underlining not what they say or what they wish but actually what people do in order to fulfill a goal. By now we determined that people love to read or watch SF but don’t actually want to experiment dangerous technologies that can be unstable and as much as The Jetsons inspired security things are not quite so, and from a darwinian point of view it’s the most normal thing to do.

Continue reading

How to mount an existing EBS to an amazon instance

By existing, I want to say a non empty EBS, a formatted device :)

It’s very simple:

1. Use lsblk  command to view all attached devices:

[ec2-user ~]$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
xvdf 202:80 0 22G 0 disk
xvda1 202:1 0 8G 0 disk /

2. create dir and mount device:

[ec2-user ~]$ sudo mkdir /mnt/my-data
[ec2-user ~]$ sudo mount /dev/xvdf /mnt/my-data

Degree thesis subjects

I saw recently an event at an university in Romania (Universitatea Politehnica Bucuresti) that aims to help students to choose the subject for their degree thesis. At this event companies are invited to  present themes in front of students. You will find below a short list of themes related to our industry:

1. Automatic website classification

Possible categories: e-commerce, company website, news/blog, other.

2. Detecting website structure (and representing as a tree)

E.g. The first level of an online store contains main categories, second level sub categories and n level product page. The entire website can be represented as a tree.

3. Logo detection on internet

When detecting logos on a website page there are multiple issues that might occur. For example: many logos in same image, scaled logos.

Please let us know if you want to develop one of the above themes, and we will help you with results of our research.

 

An Internet filter engine

filterI always thought that companies have needs that are different from those of end users (see classification by target, B2C or B2B). And I think that this hypotheses is also true in internet area. These days I was busy with developing a TheWebMiner Filter and I want to talk in the following lines about internet search.

What is internet searching?

What I understand (and maybe many of you) by search is sorting. Google, Bing and other search engines try hard to find most relative page for our query and results are impressive. A colleague of mine told me that if you describe a movie scenario in a Google search, Google will find the Wikipedia page of movie. But this is an end user point of view.

Continue reading