Skip to main content

Operations on strings don’t always commute

  • Posted

Earlier today, I saw a tweet linking to the ButFirst module, a Perl module that lets you run a block of code before something else. For example:

# Print a greeting, but first find caffiene.
{
    print "Good morning!\n";
} but first {
    print "I need a coffee\n";
}

This is a bad idea, which the README acknowledges as such (“Any use of this module should be considered a bug”) – but I love fun stuff like this.

I was particularly struck by the last example in the README:

while (<>) {
    print;
} butfirst {
    $_ = reverse $_;
} butfirst {
    $_ = uc $_;
}

This prints a series of lines, with each line reversed and uppercased – but in which order? The README explains in a comment, but I think it’s somewhat ambiguous – I could interpret this as reversing first, or uppercasing first.

That got me wondering: does it matter? Reversing a string and uppercasing a string should be completely orthogonal operations, so we should be able to swap the order with impunity. (This is called commutativity.) That seems reasonable, right?

But strings are rarely reasonable – there are lots of weird corners of Unicode where strings do unexpected things. Are there strings where upper(reverse(s)) ≠ reverse(upper(s))?

I tried a few examples by hand with combining characters and didn’t get anywhere useful, so I wrote a test to have Hypothesis search for interesting examples instead. (Source code) It tried a few hundred examples, then stumbled upon a string where uppercasing and reversing don’t commute:

>>> 'fi'.upper()[::-1]
'IF'
>>> 'fi'[::-1].upper()
'FI'

Python is uppercasing the ligature to FI, which is correct if you follow the Unicode spec, and feels intuitively fine – but proves that we can’t swap the order of these string operations.

Perl’s uc function doesn’t seem to be Unicode aware, so uppercasing returns the same string unmodified. I did write another test to try to find strings where you can’t swap uppercasing/reversing in Perl, but it couldn’t find any examples. Maybe these operations are safe to swap in Perl (but I wouldn’t bet on it).

Either way, this is another reminder that strings can behave in decidedly unintuitive ways. Unicode is complicated, and I only know a fraction of the rough edges.