the Perl 'split' operator and related operators for string processing
Kragen Sitaker
kragen@pobox.com
Sun, 27 Jan 2002 03:38:01 -0500 (EST)
Perl's 'split' operator is awfully useful, but it could be generalized.
split turns a string into a sequence of strings by chopping the string
up at occurrences of a pattern. It can optionally include the
occurrences of the pattern in the sequence too, doubling the number of
items in that sequence.
Perl's <> operator and Python's file.readline method turn a file into
a sequence of lines by chopping the file up at occurrences of a
pattern, typically "\n"; it includes the matched pattern on the end of
each chopped bit, but it is very common (using chomp or chop or [:-1])
to drop those bits, and a very frequent cause of bugs to forget to do so.
Suppose split lazily iterated over a sequence of (inbetweentext,
delimiter) pairs, where the last pair would have a delimiter of ''.
(Having a delimiter regex that could actually match '' would make this
correct.) Then you could conceptualize the ordinary loop:
while (<>) {
chomp;
do_something($_);
}
as
for line, _ in split(file.contents):
do_something(line)
There's the interesting question as to whether the last pair should be
required to have a delimiter of '' or not, even if the string ends
with the separator. It's correct for some applications but not for
others; notably, it's wrong for Unix text files. (Correct for MS-DOS
text files, though.)
(Possibly the second item of each pair should be something with data
about the regular-expression match; in Python, it'd be a match object
in every case except possibly for the last, where it would be None.)
For some applications of this, you'd want a flattening version of map,
which concatenated the lists produced for individual items into a
single long list. (It's easy to define normal map and filter in terms
of this map.) Norvig defines it in one line in
http://www.norvig.com/python-lisp.html for his translation of the
first example from \(i Paradigms of AI Programming), which suggests
that it's a generally useful routine; Perl users get it for free
because of Perl's autoflattening.
def mappend(fn, list):
"Append the results of calling fn on each element of list."
return reduce(lambda x,y: x+y, map(fn, list))
This same modified split operator is useful for finding all matches
for a regex in a string, too.