In case you are really into data mining maybe you have wondered what happens to data after is extracted: does it gets delivered the way it is or there is more?
The truth is that extraction is only one part of the process and it is followed by several others, including Data Cleaning, the subject of today’s article.
The necessity for such a process has always been present in scientific areas where misleading results can induce false conclusions and lead to failure of the initial purposes but the automation has occurred relatively recent, in the last two decades when the need for cleaning was imposed to a very large quantity of data.
For data to be considered of high quality it must fulfill a series of requirements such as:
- Validity, which represents the degree of correspondence with the usual business constraints. This is relatively easy to ensure, having to set up specific indicators as Data-type constraints or Range constraints or Mandatory constraints.
- Decleansing represents error detection and syntactically removal of them for better programming.
- Accuracy: The degree of conformity of a measure to a standard or a true value; this also requires an external set of data for comparison.
- Completeness: percentage to which all required measures are known.
- Consistency: The degree to which a set of measures are equivalent in across systems.
- Uniformity, which ensures that all the measurements have the same measurement units and some aspects of validation.
This research area has more to complete until all the challenges that optimization imposes will be fixed. Today, problems like Error correction and lose of information through it, or Maintenance of cleansed data still create serious issues, but the with the advance of Big Data and interest exertion from the big companies such as IBM or Oracle in this field we can be optimistic and say that we are on the right track .