Friday, May 11, 2007

Proper string comparison

It seems developers have a knack for collecting technical books.  I have a system where books are available to me relative to their importance.  Books I crack once every few months stay at home.  If I start needing the book more than once a month, it goes on my bookshelf in my desk.  There are a few books I need on almost a daily basis, and these books stay within arms reach.  The two books I've always kept close to me are Framework Design Guidelines and CLR via C#.  I would almost consider these books required reading for a .NET team, though the latter can be fairly low-level at times.  One such situation I've needed these books close at hand is when I'm doing string comparisons.

An exciting bug

I was looking at a defect that came down to data not being saved into the database properly.  All of the fields in a record had a value, but one important field was conspicuously blank.  A colleague hunted down the persistence code and noticed the following line (scrubbed somewhat):

if (address.Other1.Type.Equals("taxcode"))
    AddInputParam(sqlCmd, "@tax_code", SqlDbType.Char, address.Other1.Value);
else
    AddInputParam(sqlCmd, "@tax_code", SqlDbType.Char, String.Empty);

This code seems innocuous.  If the address's Other1 property's Type equals "taxcode", then set the "@tax_code" parameter to its value, otherwise set the "@tax_code" parameter value to an empty string.  This stored procedure requires a "@tax_code" parameter, but the address may or may not have a "taxcode" value.  All very straightforward, and fairly explicit.

Capital letters are fun

Oh snap.  It looks like the value of "address.Other1.Type" isn't "taxcode", but rather "TAXCODE".  For this snippet of code, the intent was to ignore the case of the Type, but by default, string.Equals will do a character by character comparison in a case-sensitive fashion to determine equality.  Since chars are 16-bit Unicode values, internally the char.Equals will just compare the 16-bit values of two characters to determine equality.  So we need a good way to determine equality while taking into account case-sensitivity (hint: address.Other1.Type.ToUpper().Equals("TAXCODE") is NOT the right answer).

So how should we compare strings then?  This is where the CLR via C# book comes in extra-handy.  The answer depends on the context of comparison you want to do.  There are three types of comparisons we can perform on a string:

  • Current culture
  • Invariant culture
  • Ordinal

Additionally, each comparison type has a case-insensitive option available.

So what should I have used instead?

By far most string comparisons should use the Ordinal comparison types.  Ordinal comparisons use the UTF-16 value of the string, character by character, while oridinal case-insensitive comparisons use invariant culture rules.  Ordinal comparisons are also much faster than culture-aware comparisons.  By default, string comparisons use the CultureInfo.CurrentCulture settings, which can change during the course of execution, unknown to the code performing the string comparison.  That's why we don't want to ever use ToUpper() when we're not dealing with user input, since the underlying culture can change without us knowing.  Comparisons that were equal when executed in the US are now not in France.  This can cause major headaches when dealing with global code.  Ordinal comparisons are good for use in:

  • Path and file names
  • Database object names (database, table, field, etc.)
  • XML tags and attribute names
  • Local constants, identifiers and "magic values"

Notice that the defect was an example of the last entry.  Since I wanted a case-sensitive comparison, I would have used StringComparison.OrdinalIgnoreCase.  For simplicity and ease-of-use, the .NET Framework has static and member overloads of the following methods that will take the StringComparison enumeration:

  • Equals
  • Compare
  • StartsWith
  • EndsWith

So to fix the defect, I would have changed the code snippet to:

if (address.Other1.Type.Equals("taxcode", StringComparison.OrdinalIgnoreCase))
    AddInputParam(sqlCmd, "@tax_code", SqlDbType.Char, address.Other1.Value);
else
    AddInputParam(sqlCmd, "@tax_code", SqlDbType.Char, String.Empty);

So how do I compare strings in the future?

When I'm performing string comparisons, I really only need to ask myself two questions:

  • Where did each string come from?  (Hard-coded value, user input, etc.)
  • Do I care about the case?

The answer of each of the questions will point me to the correct StringComparison value.  Just to be explicit, I never use the default methods for ToUpper, Equals, etc.  When you leave out the StringComparison or other culture information, you're more likely to introduce bugs simply because you haven't been explicit about what kind of comparison you'd like to do, regardless whether the default implementation does what you'd want it to do.  If you're explicit, the next developer to look at that code will understand immediately what the intention of the comparison was instead of trying to figure out what kind of comparison it should have been.

For further information

No comments: