Pre-processing in DataCleaner 2: Why?

20110217

Pre-processing in DataCleaner 2: Why?

Last monday we released the new DataCleaner 2.0 and one of the major new features in it is the ability to transform and filter your data using the tool. Previously the common answer for someone asking about transformation in DataCleaner has been to tell them that "there are already good tools for this available elsewhere, so use them". So why did we choose to focus on data processing in DataCleaner 2.0? And what are the capabilities of DataCleaner in terms of transformations? Is DataCleaner over time going to evolve into a full-fledged ETL tool? I'll try to answer these questions.

All your data quality functions in one place

The obvious answer is that we want to provide more data quality functionality and that transformations is something that a lot of people need. While doing data profiling it is often needed to do data adjustments, eg. to tokenize values, extract certain information, filter which rows gets profiled etc. You can also do all this also by applying an ETL tool or maybe even by creating database VIEWs. The problem with such an approach is that it will eventually get in your way because you're trying to get 2-3 independent tools to work nicely together, instead of just having these functions available where you need them.

Transformations for the DQ domain

Possibly even more important is that the transformations that you want to employ in Data Quality analysis are typically quite different than those that come out-of-the-box in database scripts and ETL tools. Such tools are typically quite generic and will provide general purpose tokenizers etc., but will typically lack transformations pertaining to the DQ-domain, such as date mask matchers, dictionary lookups and synonym replacements, standardization of email adresses, names and URL's.

Non-persistent transformations

When you do pre-processing in separate tools, you also need to persist your transformed datasets. In Data Quality analysis this is just a waste of resources and provides poor performance. If you need to perform transformations, apply filtering etc. for the purpose of analysis, profiling and exploring your data it is much more feasible to just perform these transformations when needed in stead of storing them up front. This also allows for a much more free user experience where you can actually experiment with your data and you analysis in stead of having to overthink it.

Is DataCleaner's validator gone? No, it's just filtering with added output handling!

DataCleaner 1.x was known to have a strict separation between the activities "profile" and "validate". Since this separation is gone in 2.0, one might ask, "Is DataCleaner's validator gone? I can only see analysis jobs!". But the answer is no, we just consider validation as a type of analysis (thus, analysis is a broader term, comprising both profiling, validation and more). You can easily perform all the validation operations of DataCleaner 1.x, but the approach is a bit different because you basically apply filters in stead of "validators". There is even an example of this in the "Introduction to analyzing, cleansing and filtering data" webcast, available on DataCleaner's website.

5 comments:

Garima said...: Does DalaCleaner support SQL queries? How can I issue SQL queries to explore data? Should it not be a part of a data analysis/profiling tool to allow to explore data?

I can generate various profiles to explore data, but in the output, I can see only the column on which profiling was done. What should I do to see other associated column values too for that row?; March 2, 2011 at 8:21 AM
Kasper Sørensen said...: Currently there is not a SQL console or something similar in DataCleaner. This was a quite conscious choice, because we want to provide something that will work the same for any database, or rather - datastore! A datastore might simply be a CSV file, or an Excel spreadsheet, so making a SQL console would be something only available for some of the datastore types.

The other thing about doing SQL is that I don't think it's very good for exploration. In DataCleaner there are a lot of filters and transformation steps that you can add, and you can choose to preview and even subset your data ("Write to datastore") after you've performed these steps. In my oppinion that's a better way to do exploration.

I'm not against making something for user-written queries, but it's just not on the top of my radar screen. If someone wants to work on such a feature I think it's great and I will do my best to help them out.

Regarding your question about associated columns, the answer is that you need to include these columns in some kind of analysis. It could be a simple "String analyzer" or something like that. The reason is that the rows that are saved for inspection by the user are based on the rows that are available while DataCleaner is executing. At the same time, DataCleaner will only query for the data it needs. So this means if you need more data (columns) to be available, then you should make sure these data are included in the analysis.; March 2, 2011 at 9:09 AM
Anonymous said...: rhaa, I lost my previous comment. So here is a shorter one:

I just wanted to say that SQL queries are faster than a java program in general. And that they are useful when you need to monitor your data quality on a regular basis (Do I have some nulls data today?).

On the other hand, complex analyses will require the data to be extracted from the datastore. In this case, SQL queries cannot be used.

@Garima, give a try at Talend Open Profiler which provides both types of analysis.; March 17, 2011 at 11:58 AM
Kasper Sørensen said...: But honestly - any tool can do a SQL query. You can invoke it using any command line tool that follows with your DB.

And it's not entirely true that SQL is always faster than Java. If you're only looking for a single measure, then yes, it typically is. But if you want to retrieve a lot of measures from the same set of data, then it is typically faster to query once and share the result between multiple measure generators.

Furthermore, SQL doesn't work on anything but relational databases... I find that doing a DQA most often requires incorporating a lot of data from different stores.

Monitoring is something completely different and whatever measures found there should be handled as temporal data stored for trends analysis etc. I don't think a profiling tool will be the obvious choice for that.; March 17, 2011 at 2:30 PM
Joshua Smith said...: For every kind of business it needs to be updated by using new software solutions. Some applications could be developed by outsource company that provides custom software for business companies.; August 19, 2011 at 1:59 PM

kasper's source

20110217