[SQL] Search for Patterns and Collation Implication

Solution: SQL Server

Creation of reports where we need to search for characters, words or patterns is a very common task in any Data Migration project, or many others, where we need to do some data analysis.

A common way to look for patterns in SQL is the use of the LIKE operator. It searches for strings, data and times values that match a specific pattern. It is very handy as you can use wildcards (%,_) and charlist ([]). This operator is easy to use but it can become tricky when using charlist if you are not aware of how SQL sorts the data.

To understand this, you have to be familiar with the collation.

The collation can be specified at different levels:

1. Database

2. Column

3. Casting the collation of an expression.

In this post we are interested to see how the collation influence the search of patterns.

I have created a simple table that contain a list of characters:

How would you search for all letters in the alphabet? First we would ask, which alphabet? Do we search for upper or lower case? Do we consider accents? Those questions might seem irrelevant, but their answers determine the way that you have to use the LIKE operator and the collation.

Here is an example:

If we use the following WHERE clause: where [Name] LIKE '%[A-Z]%'

Which of these options is the result set?

A,a,Z,z
A,á,a,z,Z,ñ
A,Z

Answer:

Is Option 2 correct? Yes and no, it depends upon the collation used. So, in the previous example, Option 2 was correct because we were using the standard SQL collation.

In the following example we are going to use the collation: Latin1_General_BIN and apply the same pattern search:

Now the result is:

Why is this happening?

When we search for patterns, SQL first sorts and then looks for the pattern. This order is determined by the collation that we use.

A collation specifies how strings are sorted and compared. In many collation names you can easily determine some of their properties, for instance, the default SQL collation:

sql_latin1_general_cp1_ci_as

CP1 specifies code page 1252
CI specifies case-insensitive (CS specifies case-sensitive)
AS specifies accent-sensitive (AI specifies accent-insensitive)

Let’s have a look at how SQL sorts using the Latin1_General_BIN:

For this reason, when we said LIKE '%[A-Z]%', it only found the result set between rows 4 and 5.

It is very different to the previous result set, where we have used the default collation. In that scenario, we can see that the order is totally different:

Now the statement LIKE '%[A-Z]%', returns all the records between 4 and 9.

The above examples are very simple - the idea is to make you aware of the importance of the collation and how SQL derives the result. During a project, you may find data from different countries with many unfamiliar characters. Here you should take time to analyse the data and search for patterns. This may seem straightforward, but sometimes is more complex than expected.

Alternatively, you can translate all characters to Unicode and then use those values to look for specific characters.

I hope you have found this post interesting.

Please do not hesitate to ask any questions.

[SQL] Search for Patterns and Collation Implication

Comments

Didn't find what you were looking for?