Regular expressions are an invaluable development tool, and also extremely handy for non-developers who need to comb through plain text in an editor. In this article, we'll look at a simple regex problem and dissect a possible solution.
The Problem
We have a complex piece of code that harvests data from an enormous CSV file with more than a hundred columns. We want to know which columns in the CSV file are actually used by the code.
The Text
if (trim($record[73]) && in_array(trim($record[74]), $valid_types)) { // alternate product if (array_key_exists($record[73], $alternates)) { $alternate = $alternates[$record[73]]; $alternates[$record[73]]->present = true; } else { $alternate = entity_create('field_collection_item', array('field_name' => 'field_alternates')); // Create new field collection item. $alternate->setHostEntity('node', $node); // Attach it to the node. $alternate->field_unit[LANGUAGE_NONE][0]['value'] = $record[73]; } $alternate->field_use[LANGUAGE_NONE][0]['value'] = $record[74]; $alternate->field_count_per_primary[LANGUAGE_NONE][0]['value'] = $record[75]; $alternate->field_upcharge_pct[LANGUAGE_NONE][0]['value'] = $record[76]; ...
The Regex
If we assume that each parsed row of the CSV file is stored in the $record variable, then writing a regex to grab the array index is pretty simple. We can write:
\$record\[([0-9]+)\]
Let's break this down. The backslashes in front of $, [, and ] are necessary because we are searching for those actual characters, and all three of those have special meaning in regular expressions. The parentheses tell the regular expression that we want to "capture" whatever is inside them; that part of the matched pattern will be called $1 since it is the first subpattern. Inside the parentheses we have [0-9]+, which means "one or more characters, each of which is between 0 and 9 inclusive".
If we're writing a script to extract the numbers this is all we need. Let's complicate things a bit, though, by doing all of our work inside a text editor. I'm using TextMate, but this should apply equally to similar fancy text editors like Sublime Text. In a text editor, we can use a regex to find everything that is not a column index, and replace that with a newline. This way, we'll get a nice list of just the column indices.
Search
(?m).*?\$record\[([0-9]+)\]
Replace:
$1\n
So what's changed? Our search pattern is the same, except for a new prefix. First, let's look at the .*? part. The regex .* should be familiar to anyone who's used one: a dot means "any character," and a star means "zero or more of those," so .* means "any string." By default, all expressions are "greedy," which means they match as much as they possibly can. This means the .* would suck up all the text up to the very last instance of $record in the document, which is clearly not what we want. The question mark means "non-greedy." By adding this, we match the shortest string we possibly can: just the stuff leading up to the first $record.
What about that (?m) at the front? Well, by default, a . character does not match against line breaks, so our regex would only affect a line at a time. (?m) puts TextMate's regex engine into "multiline" mode, in which . matches newlines as well. Note: This bit is specific to TextMate and not in standard regexes. Other implementations generally have this feature, but the syntax is often different (it's a trailing "s" character in PHP, for example).
Finally we have the replacement pattern. $1 means "the stuff inside the first subpattern," which is the column index in our case. The \n is a newline character. So, all the stuff leading up to the column index, plus the column index itself, is replaced with the column index and a newline.
The Result
73 74 73 73 73 73 74 75 76
We can pass this result through sort and uniq to get rid of duplicates. Done!
Need a fresh perspective on a tough project?
Let’s talk about how RDG can help.