Match whole words

Problem

My cat is brown
category
octocat
staccato

  • find word ‘cat’
  • find word begin with ‘cat’
  • find word end with ‘cat’
  • find word contain ‘cat’
  • find word not begin with ‘cat’
  • find word not end with ‘cat’
  • find word not contain ‘cat’

Solution

Word boundaries

1
\bcat\b

  • Regex options: None
  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
    Nonboundaries

    1
    \Bcat\B
  • Regex options: None

  • Regex flavors: .NET, Java, JavaScript, PCRE, Perl, Python, Ruby
    1
    2
    3
    4
    5
    \bcat       (?<!\w)(?=\w)cat
    cat\b cat(?<=\w)(?!\w)
    \Bcat (?<=\w)cat(?!\w)
    cat\B (?<!\w)cat(?=\w)
    \b(?!\w*?cat\w*?)\w+?\b

\b -> (?<=\w)(?!\w)|(?<!\w)(?=\w)

Discussion

‹\b› matches in these three positions:

  • Before the first character in the subject, if the first character is a word character
  • After the last character in the subject, if the last character is a word character
  • Between two characters in the subject, where one is a word character and the other
    is not a word character

‹\B› matches in these five positions:

  • Before the first character in the subject, if the first character is not a word character
  • After the last character in the subject, if the last character is not a word character
  • Between two word characters
  • Between two nonword characters
  • The empty string

Word Characters

  • Java :
    • Java 4 to 6 ‹\w› matches only ASCII characters
    • Java 7 ‹\w› extended matches Unicode characters if set the UNICODE_CHARACTER_CLASS flag
    • All version Java ‹\b› is Unicode-enabled, supporting any script
  • .NET, JavaScript, PCRE, Perl, Python, and Ruby have:
    • ‹\b› match between two characters where one is matched by ‹\w› and the other by ‹\W›.
    • ‹\B› always matches between two characters where both are matched by ‹\w› or ‹\W›
  • JavaScript, PCRE, and Ruby : ‹\w› is identical to ‹[a-zA-Z0-9_]› so only “whole words only” search in language which use Latin alphabet.
  • .NET : treats letters and digits from all scripts as word characters. You can do a “whole words only” search on words in any language

  • Python 2.x: non-ASCII characters are included only if you pass the UNICODE or U flag when creating the regex.

  • Python 3.x: non-ASCII character are included by default, but you can exclude them with the ASCII or Aflag. This flag affects both ‹\b› and ‹\w› equally.

  • Perl: depends on your version of Perl and /adlu flags whether ‹\w› is pure ASCII or includes all Unicode letters, digits, and underscores.