Development of DataCleaner has been a bit quiet lately (where 'lately' refers to that last month or so ;-)) but now I want to share an idea that we have been working with at Human Inference: Data quality monitoring.
But first some motivation: I think by now DataCleaner has a pretty firm reputation as one of the leading open source data quality applications. We can, and we will, continue to improve it's core functionalities, but it is also time to take a look at the next step for it - the next big version, 3.0. For this we are picking up a major piece of functionality that is crucial to any DQ and MDM project - data quality monitoring. We all know the old saying that you cannot manage what you cannot measure. This is a statement that has a lot of general truth to it, and also when it comes to data quality. Data profiling tools traditionally focus on a one-time measurement of various metrics. But to truly manage your data quality, you need to monitor it over time and be able to act on not only the current status, but also the progress you're making. Doing this is something new, which none of our open source competitors provide - a monitoring application for tracking data quality levels over time! We also want to make it easy to share this intelligence with everyone in your organization - so it has to be web based.
Based on this we're setting forth the roadmap for DataCleaner 3.0. And we're already able to show very good results. In the following I will disclose some of what is there. Do not take it for anything more than it is: A snapshot from our current work-in-progress. But we do plan to deliver a final version within a few months.
The applications main purpose is to present the history of data quality measurements. Here's a screenshot of our current prototype's timeline view:
What is shown there is the metrics collected by a DataCleaner job, recorded and displayed in a timeline.
It's really easy to customize the timeline, or create new timeline views. All you need to do is select the metrics you're interested in. Notice in the screenshot below that there are different types of metrics, and for those that are queryable you have a nice autocompletion/suggestion interface:
If you click a point in the timeline, you'll get the option to drill to the point-in-time profiling result:
If you drill, you will be given a full data profiling result (like you know them from the current version of DataCleaner, but in the browser). Our prototype is still a bit simplified on this point, but most features analysis result actually render nicely:
So how is it going to work in a larger context? Let me disclose a few of the ideas:
- Queryable metrics: Consider if you have a gender column and you always expect it to have values "MALE" or "FEMALE". But you also know that there is quite some dirt in there, and you wish to measure the progress of eliminating the dirt. How would you define the metric then? What you will need is a metric definition that you pass a parameter/query saying that you wish to monitor the number of values occurring that are NOT "MALE" or "FEMALE". Similar cases exist for many other use-cases like the Pattern finder, Dictionary matchers etc. etc. In the DataCleaner monitor, the user will be able to define such queries using IN [...] and NOT IN [...] clauses, like very simple SQL.
- Scheduling: The monitoring application will include a scheduler to let you automatic run your periodic data quality assesments. The scheduler will allow you to both set up periodic, but also trigger-based scheduling events. For instance, it might be that you have a workflow where you wish to trigger data profiling and monitoring in a wider context, such as ETL jobs, business processes etc.
- Desktop integration: You will also be able to run the jobs in the regular DataCleaner (desktop) application and then upload/synchronize your results with the monitoring application. This will make it easy for you to easily share your findings when you are interactively working with the desktop application.
- Email alerting: You will be able to set up expected ranges of particular metrics, and in the case that values are recorded outside the allowed ranges, the data steward will be alerted by email.
- Repository: All jobs, results, configuration data etc. is stored in a central repository. The repository allows you to centrally manage connection information, job definitions etc. that your team is using. The idea of a repository also opens up the door to concepts like versioning, personal workspaces, security, multi-tenancy and more. I think it will become a great organizing factor and a "hub" for the DataCleaner users.
- Daily snapshots: It is easy to define a profiling job that profiles a complete table. But when dealing with periodic data profiling, it is likely that you only wish to analyze the latest increment. Therefore DataCleaner's handling of date filters have been improved a lot. This is to ensure that you can easily request a profiling job of "yesterday and todays data" and thereby see eg. only profiling results based on the data that was entered/changed in your system recently.
Sounds exciting? We certainly think so. And this idea actually has been talked about for quite some time (I found some mentionings of a "DC web monitor" application as old as from 2009!). So I am happy that we are finally putting this idea to the test. Will love to hear from you if you have any thoughts, remarks or additional ideas.
If you're interested in contributing or just taking a closer look at the development of DataCleaner 3.0, we've already started working on the
new milestone in our issue tracking system. The code is located in DataCleaner's
source code repository and we are eagerly awaiting you if you wish to pitch in!
Update
We now have an
alpha version available that you can play around with. If you feel like being an early adopter, we would really appreciate any feedback from this!
Update (2)
We now have an
beta version available that you can play around with. This is starting to look feature complete, so please check it out and let us know what you think!