Most of the data we deliver is CSV type (comma separated values). Each row represents a value, and also different proprieties of a value are, you guessed it, comma separated. Of course that in order for this to have any meaning the order of this proprieties is kept for each row(each value).
Many of our clients require data in their own format and so, here, I want to write in detail about a case study:
Let there be the X client, asking for a database with certain companies of an area in order to use this into the company CRM and develop a marketing strategy.
TheWebMiner can deliver this database, having 10 million records, in a specific format, described by client Eg. name:catalog no.:street address:postal code | name:catalog no.:street address:postal code. We can notice from this example that : is used to separate properties and | is used to separate values. This means that the file is having a single line, and therefore is very difficult to process. Because of this, until the end of the delivery of data we can use “\n” (new line) as a separator for the lines and before every delivery we will replace “\n” with “|”.
Also, a client may want to remove all the records that do not belong to a certain record number and sort them after another criteria (a second sorting criteria).
How do we do this?
Step 1: We export data under a certain format:
name: no. of category:address:postal code\n
name: no. of category:address:postal code\n
Step 2: With this step we need to remove the record that don’t contain the record number and to do this we use the following command:
sed -i=backup ‘/^.*::.*:.*$/d’ filename
Step 3: We sort the data after the no. of category (as a main criteria) and also after the postal code (as a secondary criteria) using the following command
sort -t: -k2,2n -k4,4n oldfile > newfile
Step 4: We replace “\n” with “|” using sed:
sed -i=backup ‘:a;N;$!ba;s/\n/|/g’ filename
Step 5: We archive it and send it to the client.
Simple enough, right?
Using this technology TheWebMiner processed more than 10 million records BigData related!
Image source: http://www.crossing-technologies.com/wp-content/uploads/2015/04/Big_data_image-300×256.jpg