I would like to remove any extra whitespace from my code, I'm parsing a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.

Example, I use this to remove extra whitespace:

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code>

This code/string:

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into:

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this?

Comments

You might want to consider writing a simple parser for this. At least you need to distinguish lines outside from the code-block with those inside. And you can't do that with a single regexp.

Written by poke

You don’t need to use a character class for a single character; just write / {2,}/.

Written by Gumbo

Regular expression allow for conditions (?(x)y|z), but I have no idea how to apply that to match either line-wise or in blocks. And you are better off iterating line-wise over the source text, setting and reversing a state flag for occurences of </?code> and applying the regex /^\s{2,} only then on each line.

Written by mario

@mario I was going to write that as an answer... please do that so I can upvote it :)

Written by alex

@alex: Too lazy. You go and write it and I vote it up! :P

Written by mario

@mario Done :P

Written by alex

Accepted Answer

You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work?

  • (<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
  • ^ + - match and remove spaces on beginnings of lines.
  • [ ]+$ - match and remove spaces on ends of lines.
  • ( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.

Things to remember:

  • If $1 or $2 didn't capture, it will be replaced with an empty string.
  • Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.

Working example: http://ideone.com/HxbaV

Written by Kobi
This page was build to provide you fast access to the question and the direct accepted answer.
The content is written by members of the stackoverflow.com community.
It is licensed under cc-wiki