Comparing apples with pears: how to come up with a reasonable strategy for data validation

11 min readDec 21, 2020

All icons made by Freepik from Flaticon (https://www.flaticon.com/authors/freepik)

First of all: I wonder if this picture even works in English. In German you say: “comparing apples with pears” if you want to point out that you attempt to compare two things that are entirely different and therefore comparison just doesn’t make sense. (How does that translate to English? Put it into the comments below…)

As I tried to outline in my previous post (link): data validation always ends up in comparing table like data. This insight is clearly part of the solution for coming up with a data validation strategy — but it’s certainly not the whole story.

Every data validation will follow this schema: 1. select data from source and target, 2. calculate expected data based on the selected source data and 3. finally compare those two.

What you still have to work out: what data to select and compare? Considering the fact that data in source and target (respectively: before and after conversion, it could even be multiple systems, but for the sake of readability I’ll just use “source” and “target”) is probably supposed to be different, either in format, regarding data model, the amount of data, etc. The whole point of doing a data transformation is that the result will somehow differ from where you started at. If nothing would change you would not need a data transformation.

Summary

Hurry up, I have only five minutes until my next meeting starts

They say people are not able to memorize more than seven things at a time. So I’ll make it six so you have one thought left:

Know your validation scope. If something is not in scope of your data transition just make sure it really doesn’t change, that’s sufficient. Usually this is an easy check. Save your energy for those cases which are expected to change.
There are proven check patterns. Don’t try to re-invent the wheel. Cross those patterns with the data that is in scope of your data transformation and you will see that there are only few spots left uncovered.
Know what to validate in which case. What check patterns to be used for which data depends on what data transition you are planning to do. For everything there is a solution but there is not one solution for everything.
Stay flexible. The project scope will change and data validation will have to adapt to it.
Don’t dare to do this without a proper tool. Capgemini’s methodology is a tool based approach that will take your input and automate whatever can be automated.
Know whom to contact to get advice: www.linkedin.com/in/henrik-saterdag

Coming up with a strategy

All of what comes next has been proven in +10 years of doing data validation in data transformation projects. However: these were all projects that dealt either entirely or at least primarily with SAP systems. If you have non-SAP systems involved, this may also hold true. But I won’t guarantee.

Chose your enemy

Maybe you want to grab a text marker and a big sheet of paper and write two questions on it:

What could go wrong during my data transition?
How will I be able to detect it in case it happens with minimum amount of effort?

It might look obvious but for example if you do an SAP S/4 conversion (“brownfield conversion”) in which many tables in your SAP system are not even touched, there is no point in validating these tables. Maybe you want to do a row count for those tables, that’s something that can be done without big effort. Or you could check the change timestamp in the DB log and make sure it doesn’t change during your conversion. There is a slight chance that a table would be impacted by any of the activities even though it should not happen, but that chance is so minimal that you don’t want to spend more than 15 minutes on that. Make sure that tables that you don’t expect to change are actually not changed and do no further checks. Focus on what is going to change.

Independent from project size, duration and project type (system merger, split, conversion, etc.) you can boil the requirements for data validation down to the following three patterns.

Table data completeness & table value correctness: direct table data comparisons
Object consistency & object completeness: object checks
Functional correctness & functional completeness: simply re-use the application logic implemented in standard (and customer’s) reports

The trick is to cover as much as possible with these three patterns. It’s a standardized and proven approach and it will give you an amount of coverage that you may not have thought of as possible.

Stay flexible

What’s even rarer than seeing a unicorn flying at full moon? Answer: an IT project where the scope does not change during the course of the project. While setting up your data validation strategy you should be aware that the data transition scope that is the basis of all of your efforts is just a current snapshot. Be aware that all of the following might change:

The list of tables in scope
The list of company codes, plant codes, etc. in scope
The mapping logic to be applied to certain values
…

The saddest experience I have made with data validation in any project was that one guy that had implemented a highly sophisticated validation algorithm. It was able to validate that one special mega complicated case — but as the project scope shifted it was suddenly rendered worthless. A big adjustment in the validation logic would have been necessary to get again valuable results and it was just decided it was not worth it.

But even smaller changes can break your neck if they happen frequently (and small changes happen frequently). Who wants to be responsible to manually keep the list of company codes up to date for the selection variants of 100 reports?

Use a tool based approach

Breaking news: the whole thing will only fly if you have the right software supporting it. Validating tons of data will only be fast if you let the computer do it. And keeping track of scope changes is nothing you want to do in Excel.

Check patterns — let’s get our hands dirty!

Trigger warning: from here on it will be a bit technical. If you are allergic to table names or other technical terminology, you should consider jumping directly to the third pattern of standard and customer reports. Or entirely stop reading here.

1. Direct table data comparisons

Let’s start with an easy example. Let’s take table MARC. And let’s assume you do a Selective Data Transition (SDT) in which you transfer data of one company code (and its respective plant codes) from your SAP ECC to your brand-new shiny SAP S/4.

An easy check would be:

For all the plant codes associated to the company code in scope, do we have the same number of rows in table MARC in source and target?
For all plant codes not in scope: do we have zero rows in the target?

(To be more precise, for both checks you would also have to consider any data which already existed in the target system prior to the data migration. Because even an empty system sometimes contains some data where you would not have expected.)

And of course, you would not treat table MARC individually. Instead, you would check all tables that have a plant code with one go. That is hundreds of tables, just validated with the same logic and same implementation with one click. Same for all tables witch a company code, a sales organization, etc.

Please do not take this example as the only way how to use direct table comparisons to do data validation. It’s just an easy, straight forward example. A direct comparison on table level is the check pattern the most variability. It allows the implementation of pretty much any requirement. However, the effort associated to that can become quite significant — as soon as you decide to put a bit more logic in than just doing a row count comparison. Therefore, it is recommended to not over-complicate the check logic. It’s not implementing the logic in ABAP that will cause the high efforts (by using proper modelling tools, you can avoid the ABAP part). It’s the brain power you must invest to get the specification done.

The right usage of data comparison on table level is

a) To implement patterns that are valid for many tables

b) To cover requirements that could not be covered by other means

A direct data comparison on table level means (a bit simplified): data is read per SELECT from one or multiple tables in the source. The same happens in the target. Finally, the data is then compared. The trick is in many cases to not use the same selection in source and target, but instead to apply different filtering, value mapping, etc.

2. Object checks

The term “object” is often used in the context of data. However, there is no clear definition what a data object is. Usually, the term is used to describe a set of tables that are somehow relating to each other, while it is not clear where one object would end and the other one starts. Objects in the SAP context can consists of 50 tables or more, where one single table could be part of multiple objects, depending on the content of each table row. Not talking about the infamous “intransparent fields” (if you don’t know what that is: SAP invented intransparent fields long ago to torture innocent data migration teams).

There are two ways of checking data objects:

Object consistency: does each data entity which is referenced exist? For example: for each MARC entry, does the corresponding MARA entry exist?
Object completeness: does each object instance (e.g. material with material number “100”) exist with all dependent data? For example: does the material number “100” exist for all five plant codes (“P100”, “P101”, etc.) that it had existed for in the source? I know this one is a bit harder to understand, so take your time and read it a second time. The main problem is: the material “100” may (correctly) exist in table MARA, but if only the MARA entry has survived the data transition, then material “100” is screwed up.

Both checks are relevant as soon as due to a selective data migration or a conversion there is the risk that object consistency or object completeness are harmed.

However, the amount of checks could be a bit scary. Any SAP systems consists of something between 75 000 and 100 000 tables. Between those tables we have an average of 500 000 to 1 000 000 table relations. Of course not all of them are relevant because most tables are empty, but still the amount of table relations which will be relevant for checking can easily be up to 20 000. Don’t even try to define them manually.

There are ways to crawl the system and determine the relevant data relationships automatically — and do this in an iterative process to be able to adjust the check scope automatically if the scope of the data transformation changes. You just define the scope of the data transformation so that the crawler will know which tables to consider as relevant for checking and the rest will happen automatically.

Running those checks can still be a challenge, depending on the amount of data that is to be checked. But only 5% of the checks will require special treatment. Because they run too long or eat up all system resources (or both). But don’t worry, the system guys will take care of that (that’s me :-) )

It’s worth repeating: there is a way how to fully automatically set up and run object checks for thousands of table relationships — just based on the list of tables in scope for your data transformation.

3. Standard (and customer) reports

Hey, you’re still with me? Either you’re bored or you are really interested in data validation. Thanks mate :-)

Back to the topic.

Both patterns explained previously are more technical driven. They consider which tables are on scope of a data transformation and take steps from there. But there’s another way to do data validation that comes from the opposite direction. Try asking yourself the question: How would the departments do the data validation in an SAP system? Answer is: probably they would use the same reports and transactions they use in their daily business to check if all documents are there, if balance figures match, etc. Because if you would copy one company code from one system to the other, you certainly would want the company’s balances to be the same as before.

And when you look at an SAP S/4 brown field conversion, it becomes even more obvious: the whole transition is technical. It doesn’t matter that the data model inside changes: all business figures, documents, etc. must look the same from the outside as before the transition. (No question that with SAP S/4 you can build fancy stuff on top, but your company’s balances should not change.)

So all you have to do is define the relevant SAP (and customer) reports, run in source and target and compare the output. Right?

Well, there is the problem: assume you have identified 100 relevant reports and you want to run that for 250 company codes with a total of 1000 plant codes, you will have to hire a lot of people to conduct this comparison. However, this should be low cost resources because your project budget is limited. On the other side, they have to be fast because their task contributes to the business downtime.

Solution to that problem is simply: get the report execution integrated into a tool. The tool will do the report execution for you. It will take the report output, feed it into a parser (so that only the relevant figures are considered), do the comparison and present the results.

Of course this works best if you have only a technical transition which allows to do a 1:1 comparison of the report output. That’s why it is so popular with the SAP S/4 brownfield conversions. However, you could also use this in other contexts where values are going to change (e.g. in a currency conversion). If values are supposed to change, of course you will have to consider this change before doing the comparison, but usually that is not too hard to implement. Usually the complexity of the data validation is not too high when looking at the individual cases — it is more the amount of data that creates the challenge.

Conclusion

It’s not easy to find the right balance between a) just scratching the surface and b) describing too many details. I tried my best — let me know how successful this attempt was.

The intention of this post is to show that data validation is not a trial and error process or something that you would do only when everything else is fixed. There are ways to approach this systematically and divide it into digestible pieces.

Just don’t try to do it without a proper tool.

There’s more details to come in the next days. I just need to figure out how medium.com works best for this.