Match characters

Match a character

Hexadecimal character

1
[a-fA-F0-9]

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Nonhexadecimal character

1
[^a-fA-F0-9]

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

The notation using square brackets is called
a character class

  • \, ^, -, and ] have special function in square brackets

  • JavaScript treats ‹[]› as an empty character class that always fails to match.

  • In all the regex flavors discussed in this book, a negated character class matches line break characters, unless you add them to the negated character class. Make sure that you don’t accidentally allow your regex to span across lines.


About \w

  • Java 4-6, JavaScript, PCRE, Ruby : \w -> [a-zA-Z0-9_]
  • Java 7 : set Pattern.UNICODE_CHARACTER_CLASS flag, \w matches unicode characters. (?U) Inner regex usage.
  • Python2.x : set UNICODE or U flag, matches unicode characters.
  • Python3.x : matches unicode characters in default, \w ASCII-only with the ASCII or A flag.
  • Perl>=5.1.4 : /u (Unicode) adds all Unicode scripts, and /l (locale) makes \w depend on the locale. (/d, no adlu) Unicode scripts rule same as Perl<5.1.4.
  • Perl<5.1.4 : matches unicode characters in default, \w automatically includes Unicode scripts if the subject string or the regex are encoded as UTF-8, or the regex includes a code point above 255 such as ‹\x{100}› or a Unicode property such as ‹\p{L}›. If not, the default for \w is pure ASCII

About \d

  • \d follows the same rules as \w in all these flavors.
  • In .NET, digits from other scripts are always included.
  • In Python it depends on the UNICODE and ASCII flags, and whether you’re using Python 2.x or 3.x. In Perl 5.14, it depends on the /adlu flags.
  • In earlier versions of Perl, it depends on the encoding of the subject and regex, and whether the regex has any Uncicode tokens

About \s

  • \s matches any whitespace character. This includes spaces, tabs, and line breaks.
  • \S matches any character not matched by \s
  • In .NET and JavaScript, \s also matches any character defined as whitespace by the Unicode standard.
  • In Java, Perl, and Python, \s follows the same rules as \w and \d.
  • Notice that JavaScript uses Unicode for ‹\s› but ASCII for \d and \w.

Flavor-Specific Features

1
[a-zA-Z0-9-[g-zG-Z]]
  • Regex options: None
  • Regex flavors: .NET 2.0 or later

1
[\w&&[a-fA-F0-9\s]]
  • uses character class intersection to match a hexadecimal digit.
  • Regex options: None
  • Regex flavors: Java
1
[a-zA-Z0-9&&[^g-zG-Z]]
  • uses character class subtraction to match a single hexadecimal character in a roundabout way.
  • Regex options: None
  • Regex flavors: Java

Match any character

Any character except line breaks

'.'

  • Regex options: None (the “dot matches line breaks” option must not be set)
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Any character including line breaks

'.'

  • Regex options: Dot matches line breaks
  • Regex flavors: .NET, Java, PCRE, Perl, Python, Ruby

'[\s\S]'

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby

Inner mode

(?s)'.'

  • Regex options: None
  • Regex flavors: .NET, Java, XRegExp, PCRE, Perl, Python

(?m)'.'

  • Regex options: None
  • Regex flavors: Ruby