the Perl 'split' operator and related operators for string processing

Kragen Sitaker kragen@pobox.com
Sun, 27 Jan 2002 03:38:01 -0500 (EST)


Perl's 'split' operator is awfully useful, but it could be generalized.

split turns a string into a sequence of strings by chopping the string
up at occurrences of a pattern.  It can optionally include the
occurrences of the pattern in the sequence too, doubling the number of
items in that sequence.

Perl's <> operator and Python's file.readline method turn a file into
a sequence of lines by chopping the file up at occurrences of a
pattern, typically "\n"; it includes the matched pattern on the end of
each chopped bit, but it is very common (using chomp or chop or [:-1])
to drop those bits, and a very frequent cause of bugs to forget to do so.

Suppose split lazily iterated over a sequence of (inbetweentext,
delimiter) pairs, where the last pair would have a delimiter of ''.
(Having a delimiter regex that could actually match '' would make this
correct.)  Then you could conceptualize the ordinary loop:

    while (<>) {
        chomp;
        do_something($_);
    }

as

    for line, _ in split(file.contents):
        do_something(line)

There's the interesting question as to whether the last pair should be
required to have a delimiter of '' or not, even if the string ends
with the separator.  It's correct for some applications but not for
others; notably, it's wrong for Unix text files.  (Correct for MS-DOS
text files, though.)

(Possibly the second item of each pair should be something with data
about the regular-expression match; in Python, it'd be a match object
in every case except possibly for the last, where it would be None.)

For some applications of this, you'd want a flattening version of map,
which concatenated the lists produced for individual items into a
single long list.  (It's easy to define normal map and filter in terms
of this map.)  Norvig defines it in one line in
http://www.norvig.com/python-lisp.html for his translation of the
first example from \(i Paradigms of AI Programming), which suggests
that it's a generally useful routine; Perl users get it for free
because of Perl's autoflattening.

    def mappend(fn, list):
	"Append the results of calling fn on each element of list."
	return reduce(lambda x,y: x+y, map(fn, list))



This same modified split operator is useful for finding all matches
for a regex in a string, too.