How can I reverse a string that contains combining characters in Perl?

How can I reverse a string that contains combining characters in Perl?

I have the the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér).  I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer).  How can I reverse the string, but still respect the combining characters?

Solutions/Answers:

Answer 1:

The best answer is to use Unicode::GCString, as Sinan points out


I modified Chas’s example a bit:

  • Set the encoding on STDOUT to avoid “wide character in print” warnings;
  • Use a positive lookahead assertion (and no separator retention mode) in split (doesn’t work after 5.10, apparently, so I removed it)

It’s basically the same thing with a couple of tweaks.

use strict;
use warnings;

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;

print <<HERE;
original: [$original]
   wrong: [$wrong]
   right: [$right]
HERE

Answer 2:

You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:

#!/usr/bin/perl

use strict;
use warnings;

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";

Answer 3:

You can use Unicode::GCString:

Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].

#!/usr/bin/env perl

use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);

use Unicode::GCString;

my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse @{ $y->as_arrayref };

say "$x -> $wrong";
say "$y -> $correct";

Output:

résumé -> ́emuśer
résumé -> émusér

Answer 4:

Perl6::Str->reverse also works.

In the case of the string résumé, you can also use the Unicode::Normalize core module to change the string to a fully composed form (NFC or NFKC) before reverseing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.

Answer 5:

Some of the other answers contain elements that don’t work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.

#!/usr/bin/perl

use strict;
use warnings;
use feature 'unicode_strings';

binmode STDOUT, ":utf8";

my $original = "re\x{0301}sume\x{0301}";
my $wrong    = reverse $original;
my $right    = join '', reverse split /(\X)/, $original;
print "original: $original\n",
      "wrong:    $wrong\n",
      "right:    $right\n";

References