Skip to main content
RegEx Corner

RegEx Corner: Greedy Expressions

Back-end Development

I recently came across a need to mimic a Drupal entity reference field within a settings form. Because of the large number of items in the referenced taxonomy, I wanted my settings form to use something like the core autocomplete widget as the UI.

Using a field widget outside the context of an entity editing form is possible in Drupal, by mocking up a fake entity and convincing the widget that it is on an edit form. But this approach takes a lot of code, and is potentially a cause of future regressions because it is not an officially supported method. So instead, I wanted to merely mimic the behavior of the form element myself.

One aspect of this UX pattern is that Drupal puts the results of the autocomplete operation in a regular text field before submission. This means that the submit handler for the form must be able to interpret those results, even though they have been "pretty-printed" for human readers. The contents of the text field look something like this:

First selection (52), Second selection (97)

The name of the selected item is displayed, then the entity ID within parentheses. We need to extract those entity IDs so that we can save the result.

This is not a horribly hard problem to solve from scratch, but I wanted to mimic the behavior of the core widget as much as possible, so I went looking for how this is done normally. In the EntityAutocomplete class we find:

public static function extractEntityIdFromAutocompleteInput($input) {
  $match = NULL;

  if (preg_match("/.+\s\(([^\)]+)\)/", $input, $matches)) {
    $match = $matches[1];
  }

  return $match;
}

So as a start, we can run that method against the field content. From the above example, this returns 97. Well of course, since we are calling preg_match, we should expect a single result. What happens if we update this to call preg_match_all instead?

if (preg_match_all("/.+\s\(([^\)]+)\)/", $input, $matches, PREG_PATTERN_ORDER)) {
  $results = $matches[1];
}

By specifying PREG_PATTERN_ORDER, we should get an array of all the full-pattern matches in $matches[0], the first subpattern in $matches[1], and so on. But we still only get one result, ['97']. Why?

The clue is that this is the last ID in the input string. Welcome to the world of "greedy operators!"

The first sequence in the regular expression, .+, should be a familiar one. It matches one or more characters of any kind. But the + operator (like the * one) is greedy: it matches as many characters as possible, as long as the rest of the expression still matches. So the .+ is gobbling up every character until the very last open parenthesis, and our result is just the last ID.

There are lots of ways we can solve this problem. It turns out that Drupal core first passes the input through an explode, and runs the regex separately on each independent string. This works! But what if we want the regex to do it all at once?

We could modify the expression to make that first operator non-greedy:

if (preg_match_all("/.+?\s\(([^\)]+)\)/", $input, $matches, PREG_PATTERN_ORDER)) {
  $results = $matches[1];
}

Now the + matches as few characters as possible, as long as the rest of the expression still matches. So it finds the first ID, and then the preg_match_all is able to continue on its merry way, finding the rest in turn.

But we don't actually need these leading characters at all; we are just looking for the part in the parentheses. So it's enough just to make sure there is some character in front of the space, which makes it a valid input. So we can be just slightly more efficient by ditching the operator entirely:

if (preg_match_all("/.\s\(([^\)]+)\)/", $input, $matches, PREG_PATTERN_ORDER)) {
  $results = $matches[1];
}

Running this on our input First selection (52), Second selection (97), we get the result we desired: ['52', '97']. Et voilà!