JavaScript and Unicode Character Validation

gists, javascript, csharp comments edit

I’m struggling right now with the fact that JavaScript/ECMAScript doesn’t allow for Unicode character classes in regular expressions. For example, if I want to set up a client-side JavaScript validation expression on a numeric field, I’d want to do something like ^\d+$ as my regular expression, right? Match one or more digits?

The problem is that in JavaScript, \d expands out to [0-9], which technically isn’t all of the digits, if you think about all of the other alphabets out there that exist and don’t use 0 through 9 to indicate numbers.

In .NET, they solve this by mapping to Unicode character classes. So \d maps to \p{Nd}, which is the Unicode character class for digits. Much more global, right? So how do you do that on the client side?

Well, I figure you have to expand the character classes on the server side and then feed those to the client. JavaScript supports Unicode character codes with a hexadecimal character code, so you can say like \uFFFF or whatever to specify a particular character. So you need to take \d and expand to the full set of Unicode characters.

Using \d as our example, a C# snippet that expands the digits looks like this:

static void Main(string[] args){
  string Nd = UnicodeExpansion(System.Globalization.UnicodeCategory.DecimalDigitNumber);
  Console.WriteLine(Nd);
  Console.ReadLine();
}

/// <summary>
/// Expands a Unicode character set into an ECMAScript compatible character
/// range string.
/// </summary>
/// <param name="category">
/// The Unicode character category to expand.
/// </param>
/// <returns>
/// A <see cref="System.String" /> that can be used in an ECMAScript regular
/// expression.
/// </returns>
/// <remarks>
/// <para>
/// ECMAScript (JavaScript) does not inherently understand Unicode in regular
/// expressions, which results in incorrect validation when using character
/// classes (\w, \s, \d, etc.).
/// </para>
/// <para>
/// This method expands a <see cref="System.Globalization.UnicodeCategory" />
/// into a string that can be used in an ECMAScript regular expression.  For
/// example, the category <see cref="System.Globalization.UnicodeCategory.LetterNumber" />
/// expands to <c>\u2160-\u2183\u3007\u3021-\u3029\u3038-\u303a</c>.
/// </para>
/// </remarks>
public static string UnicodeExpansion(System.Globalization.UnicodeCategory category){
  // The fully expanded block of characters
  string expansion = "";
  // Low-end of the character block
  int blockLow = -1;
  // High-end of the character block
  int blockHigh = -1;
  // Marks whether the current block has been written
  bool blockWritten = false;

  for(int charVal = 0; charVal <= Char.MaxValue; charVal++){
    // Get the category of the current character
    System.Globalization.UnicodeCategory charCat = Char.GetUnicodeCategory(Convert.ToChar(charVal));

    // We haven't written anything this loop; used to ensure
    // all blocks get written at the end.
    blockWritten = false;

    // Ignore characters that don't match the category.
    if(charCat != category){
      continue;
    }

    if(blockLow == -1){
      // Handle the very first block
      blockLow = charVal;
      blockHigh = charVal;
    }
    else if(
      // charVal skipped some characters OR
      blockHigh + 1 != charVal ||
      // We're at the end of the set of characters
      blockHigh + 1 > Char.MaxValue
      ){

      // Write the block to the expansion string
      if(blockLow == blockHigh){
        // This is a one-character block
        expansion += String.Format(@"\u{0:x4}", blockLow);
      }
      else{
        // This is a multi-char block
        expansion += String.Format(@"\u{0:x4}-\u{1:x4}", blockLow, blockHigh);
      }

      // Start a new block
      blockWritten = true;
      blockLow = charVal;
      blockHigh = charVal;
    }
    else{
      // We're still in the same block; increment the high end of the block.
      blockHigh = charVal;
    }
  }

  // If we didn't write the last block, write it now
  if(!blockWritten){
    if(blockLow == blockHigh){
      // This is a one-character block
      expansion += String.Format(@"\u{0:x4}", blockLow);
    }
    else{
      // This is a multi-char block
      expansion += String.Format(@"\u{0:x4}-\u{1:x4}", blockLow, blockHigh);
    }
    blockWritten = true;
  }

  return expansion;
}

For \d, it expands out to:

\u0030-\u0039\u0660-\u0669\u06f0-\u06f9\u0966-\u096f\u09e6-\u09ef\u0a66-\u0a6f\u0ae6-\u0aef\u0b66-\u0b6f\u0be7-\u0bef\u0c66-\u0c6f\u0ce6-\u0cef\u0d66-\u0d6f\u0e50-\u0e59\u0ed0-\u0ed9\u0f20-\u0f29\u1040-\u1049\u1369-\u1371\u17e0-\u17e9\u1810-\u1819\uff10-\uff19

Which means that rather than ^[\d]+$ to validate, you’d use ^[\u0030-\u0039\u0660-\u0669\u06f0-\u06f9\u0966-\u096f\u09e6-\u09ef\u0a66-\u0a6f\u0ae6-\u0aef\u0b66-\u0b6f\u0be7-\u0bef\u0c66-\u0c6f\u0ce6-\u0cef\u0d66-\u0d6f\u0e50-\u0e59\u0ed0-\u0ed9\u0f20-\u0f29\u1040-\u1049\u1369-\u1371\u17e0-\u17e9\u1810-\u1819\uff10-\uff19]+$.

You can try this out at http://www.regular-expressions.info/javascriptexample.html. Seems to work pretty well.

I’m using numbers as my example here, though the same thoughts could be applied to letters or any other character classes. Like in JavaScript, \w maps to [a-zA-Z_0-9], which is obviously not all the possible letters out there.

You could even take this a further step and pre-calculate all of the Unicode character blocks at application start time and cache the common character class expansions for use in regex translation on the server side.

Updated 9/9/2005 for boundary condition logic error and again on 9/11/2005 to fix accidental omission of the last block (thanks cougio); modified the method to be a standalone static for easier cut and paste into applications; added comments for readability.

Comments