From kragen at pobox.com Mon Apr 2 03:37:01 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Apr 2 03:37:03 2007 Subject: APL, Python, redundancy, and polymorphism Message-ID: <20070304211133.635AFE3410A@panacea.canonical.org> (When reading this, please keep in mind I've never written an interesting program in APL or PHP, so I am speaking from ignorance.) APL goes far out of its way to generalize the meanings of things. For example, it included a factorial operator (missing in dialects like A+ and K), but rather than give an error when fed a floating-point number, it would give the value of the gamma function at the point offset by 1 from the argument --- so it coincides with factorial for integers, but is more general. The iota operator is fairly useless (I think) applied to non-scalars, but it is defined so that its useful scalar behavior is just a special case of its more general behavior when fed vectors of any length, pretending that the scalar is a vector of length 1. Along the same lines, APL's representation of booleans as 0 and 1 allows you to use +/ to count the number of times something is true; in K, the 'and' and 'or' functions have additionally been dropped in favor of the dyadic floor and ceiling functions, which act as 'and' and 'or' over the domain of 0 and 1. And then there's its infamous use of every built-in function character to denote two different functions, depending on whether it is applied to two arguments or one. Some characters have even more meanings; I think / has three (one as the "reduce" operator), and in K I think it has four. These aspects of its design are designed to remove redundancy from your program, making it shorter and more likely to mean something when you screw it up, so it gives the wrong answer rather than simply raise an exception. In general, I do not like this tendency to err quietly. I was reminded of this language strategy of nonredundancy while reading the documentation for ocamllex. Ocamllex is a lexer generator similar to Mike Lesk's lex, and in the regular expression patterns that define tokens, you can bind a variable to a part of the pattern, and there are some rules that define what the type of the variable is --- whether it's a char, a string, a char option, or a string option. (This being Ocaml, you can do different things with a string, a char, and a string option, so it's important to know which one you have if you want your program to compile.) This reminded me of the big language difference between Python and both Smalltalk and OCaml, which is that Python has relatively few kinds of collections, while Smalltalk and OCaml have lots. Python doesn't even have a char type; it just uses strings of length 1. Rather than having a fixed-size array type, a variable-size OrderedCollection type, a linked-list type, and a special type for fixed-size arrays of small integers, Python has a single resizable array type that it calls "list". (It also has an immutable tuple type.) Rather than having a red-black tree type, a hash type, an alist type, a compact type for many different dictionaries sharing the same set of keys, a skip-list type, and so on, it has a single dict type, which it implements with hashes. (However, like Smalltalk, the interfaces to these collection types are accessed by sending messages OO-style, so you can make your own collection type.) I like this a lot, even though it is one factor in making Python dramatically slower than Smalltalk. It makes it a lot easier to get started with the language, and the speed penalty is often acceptable, and it makes programs shorter. (In cases where efficiency is really important, there are extension libraries that provide other kinds of containers, some of which are even in the standard distribution; but they are used many times less frequently than lists.) This struck me as being somehow analogous to the APL approach; where APL uses one operator to mean many different (but related) things, Python uses one data structure to represent many different (but similar) kinds of collections. It even uses the same indexing operator to index into dicts and lists. PHP, JavaScript, and Lua take this approach even further, in slightly different directions. So why do I like Python's data structures being general in this way and dislike APL's operators being similarly general? I think it's because I don't expect Python's data structure overloading to hide bugs, but only to cost performance, while APL's operator overloading definitely does hide bugs. However, there are several cases where Python's data structures are fairly strict in ways that help catch bugs, but which are irritating coming from Perl or JavaScript, and which make your programs bigger: - There's no implicit conversion between strings and numbers, or actually strings and anything. - Trying to access off the end of an array gives you an error rather than returning null or resizing the array. There are explicit 'append' and 'extend' methods that make the array bigger. - Dicts are different from lists, so indexing into a list with a number, or trying to append to a dict, gives you an error. (This also helps with efficiency; I vaguely recall that Python's predecessor ABC used a single painfully-slow AVL tree for a single data structure that served both purposes.) - Dicts are also different from objects, so accessing an object attribute with a run-time name requires hairy syntax. (Also, this namespace separation provides more flexibility for emulating dicts with user-defined objects; because Perl doesn't do this, it requires a much hairier approach to doing this (called "tie"), which I consequently use much less.) You could imagine that Smalltalk's approach of having different kinds of collections for, say, fixed and variable-size arrays, might also catch some kinds of errors, in addition to improving speed; but I don't think it does. From kragen at pobox.com Thu Apr 5 03:37:01 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Thu Apr 5 03:37:04 2007 Subject: the end-to-end principle in human society: scholarly writing and freedom of speech Message-ID: <20070314193211.F2246E34116@panacea.canonical.org> Here I discuss the forms the end-to-end principle, considered as a computer systems architectural principle, has taken over the years, how it is reshaping society, and how it is merely an extension of the fundamental human right to freedom of speech --- and the consequences. I'm sorry to be sending out such a rough draft. It probably should be completely rewritten. Contents -------- The End-To-End Principle in Mainframes The End-To-End Principle in the Internet Scholarly Writing and the End-To-End Principle The OSI Layer Cake Content-Centric Networking Who Are The Ends? Manifesto Footnotes The End-To-End Principle in Mainframes -------------------------------------- The end-to-end principle was first stated, as far as I know, in Butler Lampson's "Hints For Systems Design" paper back in the 1970s. Basically, as I remember it, Butler's argument is that any kind of error-checking or error correction has to be done by the original creator of the information and the final user in order to be trustworthy. In Butler's example (again, as I remember it), in getting a data file from the program that originally created it to the program that eventually uses it, there are a number of places the file can be corrupted: - the operating system routines that accept data from the program may have defects that corrupt the data; - the memory in which those routines store the data before writing it to disk may be flaky and corrupt the data; - the bus that connects the memory to the disk controller may be flaky; - the disk controller may be flaky; - the wires from the disk controller to the disk may be noisy; - the magnetic domains on the spinning disk may demagnetize over time; - and when reading the data back, again we must contend with wires, disk controllers, buses, memory, and operating system. Today we could add Solid Oak Software Cyber Patrol corrupting the data in transit over a network as another potential defect. In this example, if the original program adds a CRC to the data file, the program that uses the file will be able to detect all of these errors with high probability; while if, say, the disk controller adds a CRC to each data block on disk, it will cost just as much, but will detect only a fraction of the errors. While a chain of overlapping error-checks, each covering only two or three links of the communications path, could in principle detect an error at any point along the path, such a chain is much more complex than a single end-to-end check, and also much more likely to fail. So this kind of "hop-by-hop" error detection and correction is useful "as an optimization", because it can often *recover* from errors more cheaply --- but you shouldn't depend on it. Butler wrote his paper in the 1970s, when many expensive computers offered a lot of "hop-by-hop" error detection --- in their CPUs, in their memory interfaces, in the buses to their peripherals, and so on. A few years later, Tandem stole the high-reliability computer market with a design which, I believe, hewed much more closely to the end-to-end principle: rather than adding trimodular redundancy and CRCs and the like to each step in the path, they built their computers as what we would now call "clusters" of individual computers, which (I believe) had relatively little redundancy inside each one. The End-To-End Principle in the Internet ---------------------------------------- Telephone networks are designed for high reliability, hop-by-hop: each node in the path of your call is specified to keep running 99.999% of the time, at least in the US. Telco packet data protocols like X.25 used a similar approach --- each router in an X.25 path acknowledges each packet it receives from the previous router, so that a lost or corrupted packet can be retransmitted from the previous router, instead of from the original sender. Telco protocols for digital "isochronous" communications, such as voice data, reserve periodic timeslots for data transmission --- to make sure that no data is ever delayed or lost in transit, unless something goes wrong. The internet protocols have a completely different design. Routers are free to drop packets at any time without telling anybody, and they do, even if nothing is going wrong --- simply due to network congestion. Typically they only drop about 0.1% to 5% of the packets, at least according to mtr, but they do drop packets, all the time. They drop even more packets when a router goes down. This allows internet protocols to work correctly over lossy radio networks, noisy modem lines, and long network paths operated by a dozen or more independent companies. The sending and receiving computers are responsible for making sure nothing gets lost or corrupted, usually using a protocol called TCP. Each packet of data has a checksum to detect (with some probability) whether the packet has been corrupted in transit, and a sequence number so the receiver knows what order the packets go in, whether it's receiving duplicate packets, and whether it's missing any. Radio networks typically do retransmit packets at a hop-by-hop level, because they often have packet loss rates above 20% per hop, and rarely below 5%. Amateur radio operators use a variant of X.25 called AX.25 for this. If you're transmitting over a string of five noisy radio hops, each of which drops 20% of your packets, that's 67% packet loss, and TCP doesn't work very well any more. So while TCP is more *reliable* at recovering from packet losses (since it handles routers crashing and being buggy), AX.25 is more *efficient* at it. A side effect of radio-network packet retransmission is that sometimes it's the acknowledgement, not the data packet, that got lost, so sometimes you get multiple copies of a packet if you have an 802.11 link in your path --- something you'll almost never see with wireline networks like Ethernet. One nice effect of this is that internet communications costs a lot less than phone communications, because you can do it with shoddier equipment and more casual agreements among service providers. The more important effect of this end-to-end principle is that, since almost all the smarts of an application like voice over IP or file transfer or hypertext lives in your computer and the computer you're talking to, not in the dozen routers in between you, you can make a new networked application and start running it just by convincing one person to install it on their computer. This is why telephone networks have been promising videophones for 50 years, investing lots of money in services, and never delivering, but it took one creative college student a few days to revolutionize the music industry with Napster. Scholarly Writing and the End-To-End Principle ---------------------------------------------- In mainstream journalism, for some reason, it seems to be considered bad form to cite sources, so that, for example, your readers can judge whether a quote was taken out of context; perhaps you're supposed to trust the journalist to have investigated adequately and represented the facts accurately, not argue with them. As you might expect, this kind of situation seems to attract the kind of writers who are least likely to investigate adequately. By contrast, in academic publishing, there seems to be a norm that no assertion, however widely accepted, should make it into your paper without a footnote to tell your readers why you believe it, unless it's a result of your personal direct observation, in which case you're expected to say that. Sometimes this makes for arduous reading and even more arduous writing, but it has the benefit that the reader can follow the references and, perhaps, find that the citing author has made a mistake; this makes it less likely that a simple error will be propagated from generation to generation. I have seen it asserted somewhere that this norm is what defines "scholarly writing". There's a clear analogy here to end-to-end error-correction in computer systems. The reader is expected to read the words written by the guy who originally asserted something and make up his own mind what they mean and whether they're well-supported --- not rely on error-free transmission of that information by a pyramid of intermediate authors. A weaker form of this norm has arisen in the blogosphere: if you refer to some published document, whether a blog post or a government report, whether to cite it with approval or dispute it, you are quite strongly expected to provide a link to that document, so that your readers can make up their own minds. (You are also expected to link to the source from which you heard about it.) The OSI Layer Cake ------------------ Often, computer networks are described in terms of a somewhat bogus seven-layer cake model developed by OSI, the failed phone-company attempt to reinvent the internet as something stupid. The lowest layers are things like the physical medium, the internetworking packet protocol that packages stuff up so it can move from physical medium to physical medium, and the stuff that puts checksums and sequence numbers on it to make sure it can be retransmitted and corruption detected, and the highest layers are things like data representation formats and the "application layer", which is the actual "application" you're using the network for --- HTTP to implement a distributed hypertext system, say, or IRC for online chatting, or NTP for setting your computer clock. The idea is that each communication starts at the top of the stack, tunnels its way down through the layers to the bottom of the cake, getting wrapped in successively more goop as it goes, bumps along the bottom couple of layers to get to its destination, and then erupts up through the layers at the destination, losing the goop as it goes. Different network systems have slightly different actual sets of layers, which map with more or less fudging onto the seven-layer OSI model. To be painfully concrete, and use the real layers that actually exist on the internet, an HTTP request starts as my browser, say, deciding it wants to load an image from some URL. So it packages that request into an HTTP GET request, which is a blob of text a few dozen words long, and asks the operating system to make it a connection to the server. The operating system [0] sends some DNS packets to random places around the network, which hopefully tell it the IP address of the server, and so it sends a TCP SYN packet to that IP address asking to open a new connection. To actually send the TCP packet, it puts it inside an IP packet, and then tries to figure out what to do with the IP packet, which involves consulting its routing tables and probably deciding it should send it to the same router it uses for everything else, which is connected to its Ethernet port, at which point it sends an ARP request out on the Ethernet port asking for the Ethernet address of the router. The router sends back an ARP reply with the router's Ethernet "MAC" address, and then the computer sends the router an Ethernet packet or "frame" (by wiggling voltages on the wire between the computer and the router), within which is an IP packet, within which is the TCP packet or "segment". Later on, my computer will send the router more Ethernet frames containing IP packets containing TCP segments containing the actual text of the HTTP request. At the other end, the router connected to the web server sends a different Ethernet frame to the web server, containing the same IP packet containing TCP containing HTTP, and the web server goes through more or less the same process in reverse: examine the IP packet to see if it's to the web server or to someone else (and who it's from), examine the TCP packet to see which connection it's for, hand the HTTP data to the particular server program handling that connection, and then figure out what the HTTP data means and what to do about it. So in this case there are actually only five layers --- HTTP, TCP, IP, Ethernet MAC, and voltages on a wire, not seven. Other connections have more or fewer layers, and you can increase the layer count almost arbitrarily if you start doing stuff like tunneling IP over HTTP, which you can do. So those are some of the reasons why the seven-layer model is pretty bogus, but that doesn't make it useless for discussion. Anyway, the "application layer" protocol like HTTP is often called "layer 7", since it corresponds to the seventh (or maybe sixth and seventh) layers in the OSI model. But humorous discussions of network architecture often mention layers 8 and 9: layer 8 is the guy sitting at the keyboard clicking the link, and layer 9 is the organization he works for. Content-Centric Networking -------------------------- Van Jacobson is a guy who's famous for developing a lot of the stuff that keeps the current internet running. He's recently been working on some stuff he calls "content-centric networking", on the thesis that in, say, HTTP, the "ends" being connected are not two computer application programs, like a web browser and a web server --- they are a web browser and a named piece of information like an image. The web server is often merely a passive conduit through which that information passes. He points out that a big part of the difference between the internet and the telephone network is that the telephone network is basically providing the service of connecting wires together, so that if you wiggle the voltage level on one wire, the voltage level on the other wire wiggles too; while all the internet and computer networking work is layers on top of that. And it turned out that it's a lot easier to route data around at that higher level instead of hooking up a shared wire with whoever you wanted to talk to. So the internet was designed, basically, to provide remote-login and voice transmission services between a guy and a computer, or between two guys: making a connection. And all of our security and performance stuff has been about how to make those connections more secure and faster and more reliable. But the web server, which is the ultimate endpoint of the connection from the HTTP/TCP/IP point of view, is really just another kind of conduit, and a heck of a lot of the current usage of TCP/IP is for this kind of thing. So maybe we should be focusing on talking to the information instead, which is what Freenet and BitTorrent do, not to the server. Who Are The Ends? ----------------- I talk to people over AIM and Yahoo Messenger all the time. They both run on top of a "reliable" protocol, TCP, which is supposed to keep data from getting lost or corrupted. But I frequently have trouble with data getting lost or corrupted with these protocols, for several reasons: - Their laptop runs out of battery. - I lose my wireless connection for too long to keep the TCP connection alive. - I'm using a locutorio computer with Cyber Patrol installed, and it cuts the connection because the other person said, "fuck". - My IM client crashes. - I send a message that gets eaten by the buggy IM server that relays data between us. - I don't speak Spanish very well, and they're speaking Spanish. So I use end-to-end error-detection --- I try to remain alert to signs of miscommunication and communication loss in the conversation, and "retransmit" by repeating myself or explaining things in a different way. Likewise, even the image on the web server is not really the endpoint of my communication --- it's really the person who put it there, maybe the owner of the server or maybe not. The server, in this case, is merely a medium for communication among people. [1] Generally, in all of these examples, computer programs and computer data are merely proxies for people who want to communicate. People are the ends. PEOPLE ARE THE ENDS! Manifesto --------- We have begun to transform our society with the power of communications networks based on the end-to-end principle, as it changes us in return. Fundamentally, the power of these communications networks is the power of free speech or free communication, which enables the independent investigation of truth. Free communication is the ability to communicate with anyone who wants to communicate with you, about anything, as much as you want, in private or public, as you choose. End-to-end computer networks make this ability not only legal, but practical and widespread. In a world of free communication, abusive dictatorships cannot stand; human suffering cannot be concealed; information arbitrage in markets becomes marginal; and political propaganda becomes obvious for what it is, and only those who seek it out will be deceived by it. Furthermore, the world of free communication decimates the barriers to knowledge that keep so much of the human race in poverty. Finally, this new practical ability to exercise our fundamental right of free speech has given rise to new ways of developing knowledge and organizing work: ranging from Wikipedia and the free-software movement (producing fruits like the Firefox web browser and the Linux kernel) to eBay, Google, and the blogosphere. Our world already abounds with the fruits of these changes. Given these blessings, only in the most extreme cases, such as preventing imprisoned convicts from plotting further crimes, can interference with this right of free communication be tolerated. In particular, this means that the following are intolerable: - restrictions on the use, ownership, importation, or manufacture of particular equipment for communications, particularly restrictions intended to prevent people from having access to certain information, such as legal enforcement of Digital Restrictions Management schemes; - restrictions on the permissible content of public communications; - restrictions on the permissible identities of public speakers, or requirements that these speakers not be anonymous; - schemes to prevent anonymous speech and publication, such as requiring communications technology to attach traceable serial numbers to each communication; - schemes to prevent private communication, such as prohibitions or restrictions on access to encryption hardware or software; - "takedown notice" mechanisms that enable purported copyright holders to exercise prior restraint over the speech of others, particularly without prior judicial review. Furthermore, these are intolerable not just when they are imposed by the state, but also when they are imposed by private actors with sufficient power to make them difficult to escape --- for example, by a telecommunications company. Footnotes --------- [0] Actually, this is usually a library which talks to a server at my ISP, which does the actual work of spewing DNS packets all over the internet for me. [1] There are exceptions, like network monitoring reports and computer simulations, but nearly all the communication over the web is from people to people, not between people and computer programs. From kragen at pobox.com Mon Apr 9 03:37:01 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Apr 9 03:37:04 2007 Subject: comparative study of iteration and ordered collections References: <1172281271.3852@arch> <87fy88ohd1.fsf@thunderbird.scannedinavian.com> Message-ID: <20070319163103.B2D81E34108@panacea.canonical.org> Introduction ------------ I'm trying to figure out what kind of data structure I should use for ordinary ordered collections in Bicicleta: some kind of numerically- indexed vector, or a Lisp-style externally-singly-linked list? I feel that I have to decide this now, because I need to add variadic functions and introspective slot/property/method listing. Weighing on the side of numerically-indexed vectors, there are more efficient implementations possible (I think), they work reasonably well in Python, JavaScript, and Perl, and they don't require people to think recursively. Weighing on the side of Lisp-style lists, they don't require an additional primitive type, they don't introduce numbers (as indices) into parts of your program that have nothing to do with arithmetic, and they might not require people to think recursively. Both structures are capable of the same set of operations; it's just a matter of which ones are expensive and which ones are dangerous. So I thought I'd look to see if one or the other would really cramp anybody's style. (In here, where I say "array", it means "vector", and where I say "list", I might mean either "linked list" or "vector".) Contents -------- Introduction Contents STL Algorithms Common Lisp Sequences Dictionary My Own Programming Style in Python My Own Programming Style in JavaScript Efficiency The Strange Case of string.join Conclusions STL Algorithms -------------- So here's a survey of the 1994/1995 STL algorithms and the iterator types they use. The STL contains a comprehensive collection of generally-useful algorithms, and carefully distinguishes what kind of data structure traversal each algorithm needs. Immutable Lisp-style lists can support "input" and "forward" traversal, but not "bidirectional" or "randomaccess", and by virtue of being immutable, they don't support "output" directly --- but generally "output" iterators can map to a returned list value. for_each input find input find_if input adjacent_find forward count input count_if input mismatch input equal input search forward copy input output copy_backward bidirectional iter_swap forward swap_ranges forward transform input output replace forward replace_if forward replace_copy input output replace_copy_if input output fill forward fill_n output generate forward generate_n output remove forward remove_if forward remove_copy input output remove_copy_if input output unique forward unique_copy input output reverse bidirectional reverse_copy bidirectional output rotate forward rotate_copy forward output random_shuffle randomaccess partition bidirectional stable_partition bidirectional sort randomaccess stable_sort randomaccess partial_sort randomaccess partial_sort_copy input randomaccess nth_element randomaccess lower_bound forward upper_bound forward equal_range forward binary_search forward merge input output inplace_merge bidirectional includes input set_union input output set_intersection input output set_difference input output set_symmetric_difference input output push_heap randomaccess pop_heap randomaccess make_heap randomaccess sort_heap randomaccess max_element forward min_element forward lexicographical_compare input next_permutation bidirectional prev_permutation bidirectional accumulate input inner_product input partial_sum input output adjacent_difference input output So only the following 18 algorithms out of the 64 require bidirectional or random-access iterators: copy_backward reverse reverse_copy random_shuffle partition stable_partition sort stable_sort partial_sort partial_sort_copy nth_element inplace_merge push_heap pop_heap make_heap sort_heap next_permutation prev_permutation The other 46 algorithms work fine without. For most of the above 18, there are also well-known ways to get the desired effect with side-effect-free data structures. copy_backward sounds like reverse_copy, but it makes a forward copy --- it just does it in reverse order so that it works when copying stuff upwards to an overlapping range. reverse is very straightforward to perform with linked lists. I think it is only done with a bidirectional iterator in STL because there's no "backwards output iterator", which may be because STL's lists are doubly-linked, so they have bidirectional iterators. random_shuffle looks pretty hairy. Probably best just to assign a random number to each element and sort them, although that feels like cheating. partition (and stable_partition) is straightforward to do with lists, just less efficient. As for sort and stable_sort, quicksort and mergesort can easily be done on lists. The clever thing about mergesort is that it's normally a stable sort, which is to say that it doesn't reorder elements that compare equal, but running mergesort using "forward iterators" suggests that you should use a different way of dividing up the list --- say, unshuffling it --- but that doesn't give you a stable sort. There's a non-clever way to make any sort stable, which is to add a secondary sort key that is the original position in the list. There may be a clever way to do a stable mergesort without knowing the list lengths in advance, but I don't know it. partial_sort selects the smallest N elements and sorts them. A lazy mergesort should provide this nicely. nth_element is quickselect, which I think can also be done with lists without too much difficulty. inplace_merge can be done with merge; indeed, in STL, merge is also present. That leaves only the heap and permutation operations. I really don't know about them. This discussion leaves out the fact that several of the algorithms (upper_bound, lower_bound, equal_range, binary_search, includes) have implementations that run faster when they have random-access iterators available. Common Lisp Sequences Dictionary -------------------------------- Common Lisp has a similar, though slightly smaller, set of algorithms that can work on any kind of sequence, including lists and vectors; it's known as the Sequences Dictionary. Here's the list: (copy-seq sequence) (elt sequence index) setfable (fill sequence item &key start end) (make-sequence result-type size &key initial-element) (subseq sequence start &optional end) setfable (map result-type function &rest sequences+) (map-into result-sequence function &rest sequences+) (reduce function sequence &key key from-end start end initial-value) (count item sequence &key from-end start end key test test-not) (count-if predicate sequence &key from-end start end key) (count-if-not predicate sequence &key from-end start end key) (length sequence) (reverse sequence) (nreverse sequence) (sort sequence predicate &key key) (stable-sort sequence predicate &key key) (find item sequence &key from-end test test-not start end key) (find-if predicate sequence &key from-end start end key) (find-if-not predicate sequence &key from-end start end key) (position item sequence &key from-end test test-not start end key) (position-if predicate sequence &key from-end start end key) (position-if-not predicate sequence &key from-end start end key) (search sequence-1 sequence-2 &key from-end test test-not key start1 start2 end1 end2) (mismatch sequence-1 sequence-2 &key from-end test test-not key start1 start2 end1 end2) (replace sequence-1 sequence-2 &key start1 end1 start2 end2) (substitute newitem olditem sequence &key from-end test test-not start end count key (substitute-if newitem predicate sequence &key from-end start end count key) (substitute-if-not newitem predicate sequence &key from-end start end count key) (nsubstitute newitem olditem sequence &key from-end test test-not start end count key) (nsubstitute-if newitem predicate sequence &key from-end start end count key) (nsubstitute-if-not newitem predicate sequence &key from-end start end count key) (concatenate result-type &rest sequences) (merge result-type sequence-1 sequence-2 predicate &key key) (remove item sequence &key from-end test test-not start end count key) (remove-if test sequence &key from-end start end count key) (remove-if-not test sequence &key from-end start end count key) (delete item sequence &key from-end test test-not start end count key) (delete-if test sequence &key from-end start end count key) (delete-if-not test sequence &key from-end start end count key) (remove-duplicates sequence &key from-end test test-not start end key) (delete-duplicates sequence &key from-end test test-not start end key) There's also the Conses Dictionary, which is a much larger collection of algorithms that apply specifically to conses, mostly singly-linked lists made of them; many of these are widely-applicable algorithms as well, like set-difference and mapcan, that just happen to work only on lists. I haven't surveyed them because there are so many of them, although perhaps I should survey them if my objective is to see whether picking vector as a default ordered collection would cause problems. There's an Arrays Dictionary too, but it is much more minimal, containing things like array-dimensions and simple-vector-p. Some notes on these, because they are different in some particulars from the corresponding algorithms in other languages: - unlike in C++ or Haskell, there is no lazy sequence type in Common Lisp. - in the absence of initial-value, reduce calls its function with zero arguments when the sequence is empty. - map maps in parallel over the sequences, which could express certain kinds of vector math more conveniently. It stops when any of the input sequences runs out. - map-into stops when it reaches the end of its output sequence or any of the input sequences. - setfing subseq only works if the new subsequence is of the right length, unlike in Python. - as in STL, sequence indices count from zero, and ranges designated by start-end pairs include the start but not the end. - many of the functions include a :key argument which extracts the key to be compared, tested, etc. Usually, this is equivalent to composing some functional argument with the key, potentially in some hairy way (e.g. for sort or merge, #'< becomes #'(lambda (x y) (< (key x) (key y)))) but can be implemented more efficiently. My Own Programming Style in Python ---------------------------------- In Python, Perl, and JavaScript, the "lists" are numerically-indexed arrays, so if there's anywhere where I'd be prone to lots of numerical indexing of lists, it should be in these languages. So here's the Python I have lying around: bicicleta/objcalc.py bicicleta/objcalcpp.py fib.py laptoptable.py readbmp.py safe_repr.py ~/cursmail/cursmail.py ~/bloompy/bloom.py ~/textindex/textindex.py One by one: * bicicleta/objcalc.py, bicicleta/objcalcpp.py These two are the object-calculus interpreter I posted to kragen-hacks a couple of years ago. It's extremely non-numerical. It accesses lists with zip, for x in y:, list comprehensions, list appending with +, string.join, and construction of lists with "list displays" (the Python name for listing a bunch of expressions inside square brackets, separated by commas, to make a list of their return values); and there's one place where it says: if not lines[-1]: lines = lines[:-1] It also looks at strings a bit --- it cares whether their length is 0 and whether their char 0 is in "self.namechars". With these exceptions --- testing for emptiness, looking at the first item, looking at the last item, removing the last item --- it gains nothing from the array nature of Python lists. * fib.py Contains no lists, because it's a deliberately stupid implementation of the fibonacci sequence and is only four lines long. * laptoptable.py This program parses a text file full of RFC-822 headers and turns it into a web page with a table. It tests the first character of a line with line[0], strips the first character of a key with k[1:], and tests len(sys.argv) and extracts the first item with sys.argv[1]; otherwise it accesses lists and strings with string.strip, string.replace, string.join, map, list comprehensions, string.split, list displays (sort of). There's also a fair bit of JavaScript in this file. * readbmp.py This program is fairly numerical. It parses a 24-bit-color BMP file into an array of arrays of floats, generates another array of random floats, adds them together element by element, and turns the result into PostScript. Accordingly, it has stuff like this: (width, height) = struct.unpack(' I'm pleasantly surprised to see some promise in Bicicleta's current concrete syntax. A call to collect ----------------- So consider this call to prog.collect: prog.collect(f: "(" + f.item + ")", for_each="cthulhu", where=(f.item == "u").not) This returns the list ["(c)", "(t)", "(h)", "(l)", "(h)"]. Having just explained this code to several people, I am now aware that it is not a marvel of clarity, so I will start by explaining what this does, and some of its internal workings. It makes a list of "(" plus a character plus ")", for each character of the string "cthulhu", as long as the character is not "u". There are three arguments to 'prog.collect': 'arg1', which is "(" + f.item + ")"; 'for_each', which is "cthulhu"; and 'where', which is (f.item == "u").not. 'f' is a name used to refer to the collect expression as a whole. It's a little odd to have "arguments" whose value depends on the function they're being passed to, so I should explain that they're not really arguments, but methods. Bicicleta doesn't really have arguments. This expression is evaluated by evaluating prog.collect (the results of the 'collect' method on the variable 'prog', which conventionally refers to the top level of the program), deriving a new object from it by overriding the 'arg1', 'for_each', and 'where' methods described above, and then calling the '()' method on the resulting object. If we didn't want to do this last step, we could write prog.collect {f: "(" + f.item + ")", for_each="cthulhu", where=(f.item == "u").not } which is just the object (the same object named by f). Prog.collect evaluates 'arg1' and 'where' in a series of objects with different 'item' methods, which return successive elements of the 'for_each' value, in order to construct the list. 'for_each' can be any kind of sequence, not just a string. Mechanics of 'collect' ---------------------- Here's the full code for prog.collect: # Collect: map+filter, in a more listcompy shape. # WORDY! Uck! Avoids prog.if because prog.if depends on collect. collect = {collect: arg1 = collect.item, cursor = collect.for_each.cursor item = collect.cursor.item where = prog.sys.bool.true next = collect { cursor = collect.cursor.advanced } '()' = collect.cursor.empty.if_true( then = collect.cursor else = collect.where.if_true( then = collect.arg1 @ collect.next() else = collect.next())) } 'arg1' defaults to collect.item (the same as f.item in the call earlier); 'item' is defined as collect.cursor.item; 'cursor' defaults to collect.for_each.cursor; 'where' defaults to true; 'next' is a method that returns the same object, except with a new value for 'cursor' (giving it different values for 'item', 'arg1', 'result', and maybe 'where'); and '()' either returns an empty list (if 'cursor' is empty) or either collect.next() or collect.arg1 @ collect.next(), depending on whether 'where' is true or false. '@' is the cons or list-construction operator. So, in the call above, initially 'item' is "c", 'arg1' is overridden to be "(c)", 'where' is overridden to be true, and the cursor is not empty, so we end up with '()' returning "(c)" @ collect.next(). In collect.next, the cursor is advanced to point to the next item, 'item' is "t", arg1 evaluates to "(t)", and '()' evaluates to "(t)" @ collect.next(), so the top-level '()' evaluates to "(c)" @ ("(t)" @ some other stuff), and so it goes on. In the case where item is "u", because cursor is pointing to the beginning of "ulhu", 'where' evaluates to 'false', so the collect.where.if_true expression returns collect.next(), ignoring the "(u)" that 'arg1' would compute. Eventually, the cursor is empty, and '()' just returns that (empty) cursor, which serves to terminate the list; probably I should return prog.sys.nil instead. In those cases, it doesn't matter what 'item' and 'arg1' evaluate to, even if they evaluate to errors, because they're not being returned. Likewise in cases where 'where' evaluates to false --- '()' just returns collect.next() and bypasses 'result' and 'arg1' entirely. I anticipate that utilities like "collect" will be able to keep explicit recursion confined to tiny corners of the system libraries and to problems that really benefit from recursion. Why I Think This is Cool (Bicicleta, Python, OCaml, and Squeak) --------------------------------------------------------------- Loops are confusing and complicated, especially in functional languages that implement them by recursion. A lot of loops can be subsumed by simple one-variable list-comprehensions, often with improved comprehensibility and brevity. For this reason, Python and Haskell have list-comprehension syntax built into the language, so that you can write (in Python): ["(" + item + ")" for item in "cthulhu" if item != "u"] Which gives you the same result as the Bicicleta expression: prog.collect(f: "(" + f.item + ")", for_each="cthulhu", where=(f.item == "u").not) (The .not is just because I haven't implemented != for strings yet, because right now my !=-derived-from-== magic is locked up in a numeric class from which I should factor out a "comparable".) To my eyes, the Python version is more readable, but the difference is not enormous; they are closer to one another than either is to rv = [] for item in "cthulhu": if item != "u": rv.append("(" + item + ")") # now do something with rv If I added special syntax to Bicicleta to do list-comprehensions, I coule eliminate the "prog.collect" part: [f: "(" + f.item + ")", for_each="cthulhu", where=(f.item == "u").not] But even without special syntax, I think it's better already than Smalltalk: 'cthulhu' asArray select: [:c | c ~= $u] thenCollect: [:c | '(', c asString, ')'] Or OCaml: let list_of_string string = let rv = ref [] in for i = String.length string - 1 downto 0 do rv := string.[i] :: !rv done ; !rv in List.map (fun item -> "(" ^ String.make 1 item ^ ")") (List.filter ((<>) 'u') (list_of_string "cthulhu")) ;; Although, to be fair, a lot of the verbosity in the Smalltalk and OCaml versions has to do with excessive incompatible types (lists, strings, arrays, characters) rather than the clumsiness of the non-list-comprehension syntax. But consider in the ideal case, where those incompatibilities don't exist: ["(" + item + ")" for item in "cthulhu" if item != "u"] 'cthulhu' select: [:c | c ~= $u] thenCollect: [:c | '(', c, ')'] prog.collect(f: "(" + f.item + ")", for_each="cthulhu", where=f.item != "u") List.map (fun item -> "(" ^ item ^ ")") (List.filter ((<>) 'u') "cthulhu") ;; With corresponding bits rearranged to more or less line up: "(" + item + ")" for item in "cthulhu" if item != "u" thenCollect: [:c | '(', c, ')'] 'cthulhu' select: [:c | c ~= $u] "(" + f.item + ")", for_each="cthulhu", where=f.item != "u" List.map (fun item -> "(" ^ item ^ ")")"cthulhu" (List.filter ((<>) 'u') This suggests that there is some brevity benefit that attaches specifically to the practice of defining new methods in the form f.item != "u", rather than creating anonymous functions, even such lightweight functions as Squeak has [:c | c ~= $u]. Currying, such as ((<>) 'u') is even shorter, of course. It turns out that you can use currying in Bicicleta similarly; you can write "u".'!=' to mean {op: '()' = "u" != op.arg1}. In this case, though, collect is defined to expect a method definition on itself, not an anonymous function that it would have to pass something to explicitly. From kragen at pobox.com Mon Apr 23 03:37:01 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Apr 23 03:37:02 2007 Subject: why Darcs rocks Message-ID: <20070419020519.040C9E34104@panacea.canonical.org> I've just been importing the change history for the Bicicleta project (stored as a series of .tar.gz source tree snapshots, stone-age-style) into darcs. Often I've claimed that darcs is nice because it keeps the user-interface excise to a minimum, compared to other source-control systems; this is a sort of natural experience for how small that excise really is, since I'm currently doing almost nothing but dealing with darcs (and tar). I've just recorded 36 changesets in 82 minutes, so the average inter-changeset interval has been about 2.3 minutes, about 140 seconds. This is on a project with around 1000 lines of code as of the last changeset; the changesets I've currently committed represent about seven nights of work over two weeks. This 140-second excise means that darcs makes it practical to record changesets for work units as small as half an hour. It looks like most of the changesets I'm currently recording represent about an hour of work. Some of those 140 seconds are consumed by navigating and extracting the tar.gz snapshots, so darcs by itself is even more convenient. Darcs rocks. (P.S. some time after writing the above, I finished all of this importation work, with a total of 80 changesets. I'll push them out soon.) From kragen at pobox.com Thu Apr 26 03:37:01 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Thu Apr 26 03:37:02 2007 Subject: on low-tech Message-ID: <20070424014440.GA20604@canonical.org> I originally wrote this as part of a kragen-journal post in 2001; this is a lightly edited version of part of http://lists.canonical.org/pipermail/kragen-journal/2001-February/000475.html I've said several times that I prefer low-tech stuff, much to the shock of some of my co-workers; I'm not sure how to explain this adequately. The more I work with computers, the more I realize that complexity has hidden costs --- complex things are less reliable, fail in more unpredictable ways, are harder to diagnose problems in, are often harder to fix, are usually more trouble to keep running, are harder to change (especially to change without breaking), and are harder to get to work with other things. These alone are significant reasons to prefer a simpler solution to a more complex one unless the complex solution has significant advantages. But there's a nastier side to it, too. Complex things hide the intentions of their makers better. There's an interesting power relationship set up between a craftsman and the users of his work products: the work products are subject to the whim of the users, but also, the users are subject to the whims of the work products. This is a given in architecture: our mental state and behavior are powerfully affected by our surroundings; by manipulating people's surroundings, we can control what is easy and hard for them to do, and therefore, what they will do. So when you're interacting with an artifact, your control over your environment is directly limited by the extent of your control over that artifact. To the extent that you don't control the artifact, but are controlled by it, you are under the control of the artifact's creator. For example, I watched my ex-girlfriend Jamie deleting her spam one day. She used Outlook Express, which displayed each message as soon as she highlighted it in the message index. Because nearly all of her spam was HTML spam, it transmitted back signals to its senders (via "web bugs") that indicate that she had read it, and therefore that her address was a valid, deliverable address --- simply because she had highlighted it in order to delete it. I suggested to her that she should not view the spam before deleting it. Unfortunately, neither of us knew how to tell Outlook Express not to view a message when it was highlighted in the index, nor how to delete a single message without highlighting it; our ignorance made us powerless and subjugated her to the designs of the Microsoft engineers who wrote the program --- and, by extension, to the mentally crippled morality-disabled vomit-lapping spammers who exploit its careless intrusion on her privacy. This is a fairly innocent example; the only negative consequence is that her email address became less useful as time goes on, making it impossible for her to maintain very-infrequent email correspondences as an inadvertent result of a UI design choice. Not every harmful design choice is so innocent --- case in point: the version-to-version minor incompatibilities of Microsoft Office programs, which force you to upgrade your whole office, and which eventually render old documents unreadable --- and not every harmful design choice has such limited effects. As a result of many experiences like this, I do my best to use software that is as simple as possible. Maybe I'm too paranoid. Or maybe I'm just sensible, while many other people are blinded by the coolness of the stuff they're using. Only time and comp.risks will tell. I'd really like to work in an operating environment simple enough that I could actually read and understand every line of code, and flexible enough that I could easily change whatever I didn't like. It seems that it should be possible to build a mailreader, web browser, HTML editor, web server, GUI, and preemptively-multitasking OS, all in 100 000 lines of code or so. Viewpoints Research Institute has just gotten NSF funding to try to do this in 20 000 lines of code, but as I understand it, they are going to skip the web-browser part because they don't like the architecture of the web. I argued once that GUI programming is inherently harder than text-mode programming. Derek Robinson disagreed with me; he said that programming with the DOM in MSIE5, which one could certainly conceive as a kind of GUI programming, was far easier than programming with any GUI toolkit, or even in text mode, than anything he'd ever done before. I think he's right. From kragen at pobox.com Mon Apr 30 03:37:02 2007 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Apr 30 03:37:03 2007 Subject: "redices" and "indices" Message-ID: <20070428042436.GA748@canonical.org> I recently corrected someone who used the term "redices" to mean "reducible expressions" (in the context of the lambda-calculus and similar formal systems): "redices" is not the plural of "redex". If there were a Latin word "redix", it might pluralize as "redices", but there isn't, and "redix" is different from "redex" anyway. He pointed out: I think "redices" is fairly common use, as a googling confirms. . . . What makes you think "redices" isn't the plural of "redex"? How are you with "indices"? "Index" ------- I had to admit that he was right about "index" not being "indix". I don't speak Latin myself, but a friend of mine who does explained to me that "index" and "indices" are the nominative singular and plural of "index", a regular third-declension Latin noun. Charlton Lewis's Latin dictionary has the following entry for it [1]: index dicis, m and f [in+DIC-] , one who points out, a discloser, discoverer, informer, witness: falsus, S.: haec omnia indices detulerunt.-- An informer, betrayer, spy: vallatus indicibus: saeptus armatis indicibus: silex, qui nunc dicitur index, traitor's stone, O.--An index, sign, mark, indication, proof: complexus, benevolentiae indices: vox stultitiae: auctoris anulus, O.: Ianum indicem pacis bellique fecit, L.--A title, superscription, inscription: deceptus indicibus librorum: tabula in aedem cum indice hoc posita est, L.--A forefinger, index finger: pollex, non index: indice monstrare digito, H. I don't know the etymology of the term "redex" for certain, but it means "reducible expression", and is therefore an originally English word, not a Latin loanword. It doesn't appear in the Oxford English Dictionary Online. English Pluralization --------------------- English is somewhat unusual in that it often imports irregular pluralizations of loanwords along with the original loanwords --- thus we have mujahedin, Taliban, tableaux, criteria, cherubim, axes, and bacteria, rather than *mujahids, *Talibs, *tableaus, *criterions, cherubs, *axises, and *bacteriums. [2] This adds to the confusion of irregular plurals already natively present in English, things like "oxen", which was in Old English, coming from Proto-Germanic and ultimately from Proto-Indo-European. [3] As illustrated above, there are many cases in which the regular plural form is considered incorrect by almost everyone, but there are other cases where use of the irregular "classical" form is a way to show off the speaker's erudition; I think "index" and "cherub" are such examples. Both "indexes" and "cherubs" are legitimate English, but "indices" and "cherubim" are ways to demonstrate your erudition, and perhaps your knowledge of Latin and Hebrew. Damian Conway's article "An Algorithmic Approach to English Pluralization" [7] lists several more examples. Showing off one's knowledge may be thought to demonstrate an arrogant attitude of superiority, if those you're talking to don't share that same knowledge, or to demonstrate that you belong in the group, if they do. However, in either case, it's worse if the folks you're talking to know that the knowledge you're showing off is wrong. The Jargon File mentions deliberately irregular pluralizations in hacker jargon [4]: Further, note the prevalence of certain kinds of nonstandard plural forms. Some of these go back quite a ways; the TMRC Dictionary [from the 1950s?] includes an entry which implies that the plural of 'mouse' is meeces, and notes that the defined plural of 'caboose' is 'cabeese'. This latter has apparently been standard (or at least a standard joke) among railfans (railroad enthusiasts) for many years. On a similarly Anglo-Saxon note, almost anything ending in 'x' may form plurals in '-xen' (see VAXen and boxen in the main text) [following "oxen"]. Even words ending in phonetic /k/ alone are sometimes treated this way; e.g., 'soxen' for a bunch of socks. Other funny plurals are 'frobbotzim' for the plural of 'frobbozz' (see frobnitz) and 'Unices' and 'Twenices' (rather than 'Unixes' and 'Twenexes'; see Unix, TWENEX in main text). But note that 'Unixen' and 'Twenexen' are never used; it has been suggested that this is because '-ix' and '-ex' are Latin singular endings that attract a Latinate plural. Finally, it has been suggested to general approval that the plural of 'mongoose' ought to be 'polygoose'. The pattern here, as with other hackish grammatical quirks, is generalization of an inflectional rule that in English is either an import or a fossil (such as the Hebrew plural ending '-im', or the Anglo-Saxon plural suffix '-en') to cases where it isn't normally considered to apply. This is not 'poor grammar', as hackers are generally quite well aware of what they are doing when they distort the language. It is grammatical creativity, a form of playfulness. It is done not to impress but to amuse, and never at the expense of clarity. One might also speculate that hackers do this to poke fun at those who think that knowing Latin declensions makes them smart. Other, non-hacker occurrences of the same playful misapplied Latin pluralization have been reported, such as "grimi" for "grimaces" or "waitri" for "waitresses". [6] Back to "Redices" ----------------- So I can think of six likely interpretations a listener might arrive at when they hear "redices" used as the plural of "redex": 1. The speaker is trying to demonstrate that they're better-educated than I am, but they are failing, because they don't know enough Latin to know that "redex" isn't a Latin word. Therefore, they are not only arrogant but unskilled and unaware of it [5]. 2. The speaker is trying to demonstrate that they are as well educated as I am, but they are failing, because they don't know enough Latin to know that "redex" isn't a Latin word. Therefore, they think they are not worthy of associating with me, because they think they would need to have a classical education in order to be so, but they clearly don't have it. Furthermore, they are unskilled and unaware of it. 3. The speaker uttered an incomprehensible word. They must have a bigger vocabulary than I do; maybe they are smart and I should listen to them more carefully. 4. The speaker uttered an incomprehensible word. They must be talking nonsense. 5. The speaker is playfully forming a nonstandard plural. 6. This word "redex" I haven't heard before must be a Latin word, and "redices" is either its only correct plural or a correct show-off academic plural. I suspect that explanation #5 is the correct explanation of the term's origin, and it's prefigured by the "Unices" and "Twenices" examples from the Jargon File [4], but I intend to avoid the use of "redices" except in clearly playful contexts because of the possibility of interpretations #1, #2, #4, and especially #6, which will make it more difficult for the listener to discover the correct derivation from "reducible expression". Maybe this is just me taking myself way too seriously. Correcting People ----------------- If you correct people who are using "redices" (as I did), you run the risk of a similarly hazardous gamut of responses. 1. Why is he trying to show off his knowledge? Doesn't he know "redices" is a playful invention? He must be not only arrogant but unskilled and unaware of it. 2. I've always heard "redices" as the plural of "redex". Have I been looking dumb all these years? Oops. References ---------- These references are not intended to assign credit; they're just here so you can dig deeper if you're interested. [1] Charlton Lewis, "An Elementary Latin Dictionary", 1890, ISBN: 0199102058 > http://perseus.mpiwg-berlin.mpg.de/cgi-bin/ptext?doc=Perseus%3Atext%3A1999.04.0060%3Aentry%3D%237793 [2] Letter from Dr. Lim Chin Lam, Penang, to The Star of Malaysia newspaper, 2006-08-25: It must be noted that the above nouns have been adopted (or borrowed or hijacked) from other languages and normally retain the singular and plural forms in their original language. > http://thestar.com.my/english/story.asp?file=/2006/8/25/lifefocus/15088814&sec=lifefocus [3] Douglas Harper's Etymonline Online Etymology Dictionary entry "ox" > http://etymonline.com/?term=ox [4] Jargon File 4.2, dated 2000-01-31, attributed to a large collection of hackers but currently enclosed by Eric Raymond, section "Jargon Construction", subsection "Overgeneralization"; > http://www.science.uva.nl/~mes/jargon/o/overgeneralization.html [5] "Unskilled and Unaware of It: How Difficulties in Recognizing One's Own Incompetence Lead to Inflated Self-Assessments", by Justin Kruger and David Dunning, published in the Journal of Personality and Social Psychology, December 1999, Vol. 77, No. 6, 1121-1134 > http://www.phule.net/mirrors/unskilled-and-unaware.html > http://gagne.homedns.org/~tgagne/contrib/unskilled.html > http://www.apa.org/journals/features/psp7761121.pdf [6] "More False Latin", by John Algeo, American Speech, Vol. 41, No. 1 (Feb., 1966), pp. 72-74, doi:10.2307/453250 > http://links.jstor.org/sici?sici=0003-1283(196602)41%3A1%3C72%3AMFL%3E2.0.CO%3B2-G [7] "An Algorithmic Approach to English Pluralization", by Damian Conway; this describes the algorithm in Lingua::EN::Inflect. > http://www.csse.monash.edu.au/~damian/papers/HTML/Plurals.html