github JFLarvoire/the_silver_searcher 2.2.4-Windows
Better support for Unicode

latest release: 2.2.5-Windows
2 years ago

Changes in this release

  • Can now search in UTF-16 text files, in addition to the UTF-8 and Windows system code page files that were already supported.
    Note that searching in UTF-16 files is slower than for the other types of files supported, because the text is converted to UTF-8 first.
  • Can now search in UTF-8 text piped in via standard input in any code page, in addition to the current console code page that was already supported.
  • Converts \xXX, \uXXXX, and \UXXXXXXXX escape sequences in the search pattern to the equivalent Unicode character.
  • Added detailed explanations in the help screen about input and output encoding rules and limitations.
  • Bug fix: The console text color was not restored properly in case Ctrl-C was used to abort a long search.
  • Merged all changes from the Unix master sources as of 2021-06-03 (where the version still is 2.2), including several bug fixes, and several new known file types.
  • Updated the MsvcLibX library to version 2021-06-03.
  • Updated the PCRE library to version 8.44.

Details on the Unicode escape sequences support

The previous versions of ag.exe only handled \xXX escape sequences for defining bytes in regular expressions.
Due to limitations of the PCRE 1 library, the Unicode escape sequences \uXXXX & \UXXXXXXXX were not supported.
Also no such escape sequence replacement was done when searching for fixed strings.

This version converts \xXX, \uXXXX, and \UXXXXXXXX escape sequences in the pattern string to the equivalent Unicode character.
This is done in the argument processing phase, prior to passing the search pattern to the search functions.
Use option --verbose to display the pattern string generated.

This conversion is done when searching for regular expressions, and for fixed strings (Using option -F|--fixed-strings).
It is NOT done when using option -Q/--literal. (So in that sense, the -F and -Q options aren't strictly equivalent anymore.)

Example of use: List UTF-8 and UTF-16 text files that contain a Unicode BOM:

ag -l \uFEFF

Caution: \x80 to \x9F are invalid Unicode code points.
To search for the Euro sign, use either "€" or "\u20AC", but not "\x80", even for CP 1252 files.
Likewise, to search for all non-ASCII characters, use "[^\x00-\x7F]", not "[\x80-\xFF]".

Don't miss a new the_silver_searcher release

NewReleases is sending notifications on new releases.