How can I identifying 'Corrupted' Extracted Data?

We have a situation where we're unable to correctly extract data from an old SAP system for all languages. I don't fully understand the exact reason, but it’s related to the database collation/code page, etc. Currently, the client has this problem and were told that the only way to reliably extract data, such as Chinese, is by logging into SAP with the correct language and extracting data manually. Note - we're unable to use RFC data extraction due to other client issues.

I want to create profiling reports that attempt to look for 'corrupted' data. I would suggest that anything that is a non-ascii character is worthy of a 'heads up.’ These reports are intended to be another tool alongside other methods to identify problematic data. We have a profiling process to put all values into a table and then use a query with the following where clause:

- WHERE FIELDVALUE LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?@\[\^_`{|}~\]\\]%' ESCAPE '\'

This seems to return a decent result; however, when we collate the values with Latin1_General_BIN, I get other results that includes umlauts, Spanish/Scandinavian values that are extracted correctly WHERE FIELDVALUE COLLATE Latin1_General_BIN LIKE '%[^0-9a-zA-Z !"#$%&''()*+,\-./:;<=>?@\[\^_`{|}~\]\\]%' ESCAPE '\'

Advice would be appreciated from anyone that is familiar with this type of problem. My goal is to better understand what these where clauses are doing or whether there is a better way to approach this. Thanks

Best answer

DSP Expert

February 07, 2018 12:05

DanDowney: In Collect, go to the Vertical View of the Target Source. Change the SAP Settings Language to 1 (Chinese). Then use BOA RFC to download the data. It will display correctly in the sdb.

February 07, 2018 11:49

JDanos: I always like to ensure that any language packs are installed on the servers that are used in the process, including SQL and APP. The other suspects are the actual drivers themselves and looking at some of the details of how a DSN is actually set up. Sometimes tweaks are needed to change how that driver works with double-byte.

February 07, 2018 13:32

VGazquez: Hi Andy, this is in addition to previous answers. The collation has two purposes:

Relates to the bit pattern and code page
Specifies how character data is sorted and compared (language, regional conventions, case-sensitive etc)

The issue that you are having is due to the sort property. The range that you use in the like needs to sort the data to determine whether is in that range or not. For instance, the character 'ñ' , goes between 'n' and 'o' in the SQL_Latin1_General_CP1_CI_AS, but it goes after the 'z' in the Latin1_General_BIN. Same will happen with a vowel with or without accent and upper/lower case characters, or even special characters are in different order. Check the attached screenshot

How can I identifying 'Corrupted' Extracted Data?

Comments

Didn't find what you were looking for?