Saturday, October 5, 2013

Splitting a string while keeping the delimiters except escaped ones (regex)

If I have a String which is delimited by a character, let’s say this:

a-b-c

and I want to keep the delimiters, I can use look-behind & look-ahead to keep the delimiters themselves, like:

string.split("((?<=-)|(?=-))");

which results in

  • a
  • -
  • b
  • -
  • c

Now, if one of the delimiters is escaped, like this:

a-b\-c

And I want to honor the escape, I figured out to use a regex like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\-))))  

ergo

string.split("((?<=-(?!(?<=\\\\-)))|(?=-(?!(?<=\\\\-))))"):

Now, this works & results in:

  • a
  • -
  • b\-c

(The backslash I’d after remove with string.replace("\\", "");, I haven’t found a way to include that in the regex)

My Problem is one of understanding.
The way I understood it, the regex would be, in words,

split ((if '-' is before (unless ('\-' is before))) or (if '-' is after (unless ('\-' is before))))

Why shouldn’t the last part be “unless \ is before”? If ‘-‘ is after, that means we’re between ‘\’ & ‘-‘, so only \ should be before, not \\-, yet it doesn’t work if I alter the regex to reflect that like this:

((?<=-(?!(?<=\\-))) | (?=-(?!(?<=\\))))  

Result: a, -, b\, -c

What is the reason for this? Where is my error in reasoning?

Why shouldn’t the last part be “unless \ is before”?

In

(?=-(?!(?<=\\-))))     ^here

cursor is after - so "unless \ is before" will always be false since we always have - before current position.


Maybe easier regex would be

(?<=(?<!\\\\)-)|(?=(?<!\\\\)-)

  • (?<=(?<!\\\\)-) will check if we are after - that has no \ before.
  • (?=(?<!\\\\)-)will check if we are before - that has no \ before.

No comments:

Post a Comment