From kragen at canonical.org Mon Jun 2 03:37:02 2008 From: kragen at canonical.org (Kragen Javier Sitaker) Date: Mon Jun 2 03:37:03 2008 Subject: Failing to upgrade to CryWrap Message-ID: <20080523223404.15842183495@panacea.canonical.org> (Available in HTML at .) So, in March, we upgraded our machine from Sarge to Etch. We had been using `sslwrap` in Sarge, but `sslwrap` doesn't exist in Etch. According to Jonathan McDowell, the guy who used to maintain the Debian `sslwrap` package: sslwrap (2.0.6-18) unstable; urgency=low * Users might like to consider switching away from sslwrap to crywrap or investigating whether more recent versions of the services they're sslwrapping are themselves now ssl enabled. It is envisaged that at some point in the future I will request removal of sslwrap from the archive, though I hope to investigate the possibility of a smooth upgrade path to crywrap before that happens. sslwrap is effectively dead upstream and I think it's probably better to consider the existing alternatives that can perform the same function than continue to work on sslwrap long term. -- Jonathan McDowell Sat, 13 Aug 2005 13:01:06 +0100 (from ) See also , where the maintainer requested its removal from Debian. Why I Tried, Then Gave up on `CryWrap` ------------------------------------ `apt-cache search sslwrap` found only `crywrap`, and `apt-cache show crywrap` said: > `CryWrap` is intended to be a drop-in replacement for `sslwrap`. This is more or less a blatant lie. `CryWrap`'s command-line options have nothing in common with `sslwrap`'s, and `sslwrap` is written to run from `inetd` --- for example, it reports its errors through `syslogd`, not by printing them to standard error. `CryWrap` is not, even though it has an `--inetd` option. Here are the problems I encountered trying to use `CryWrap`: 1. `CryWrap` doesn't support `sslwrap` command-line options. 2. `CryWrap` reports its errors to `stderr`. 3. When I did get `CryWrap` to work, it reliably took about 110 seconds to negotiate an SSL connection, which is longer than Thunderbird is willing to wait. 4. `CryWrap`'s `sslwrap` wrapper doesn't support `inetd`. 5. `CryWrap`'s documented `-v` flag doesn't work as documented. I wasted an hour trying to get `CryWrap` to work, and eventually gave up and installed `stunnel4` instead. Here are some details, in case you find yourself in a similar situation: ### `CryWrap` doesn't support `sslwrap` command-line options. ### The line in our /etc/inetd.conf for `sslwrap` looked like this, all on one line: pop3s stream tcp nowait root /usr/sbin/tcpd /usr/sbin/sslwrap -cert /etc/sslwrap/server.pem -addr 127.0.0.1 -port 110 I `apt-get install`ed `crywrap`, changed `sslwrap` to `crywrap`, and hoped for the best. Initially, that failed because `tcpd` was specially configured to allow connections to `sslwrap` from weird places, in `/etc/hosts.deny`: ALL EXCEPT sslwrap: PARANOID EXCEPT Our Argentine ISP, TeleCentro, is so incompetent that our reverse DNS maps our IP address to a name that doesn't exist. So `tcp_wrappers`'s `PARANOID` rule won't allow us to connect to tcp-wrapped services. So I changed that to say ALL EXCEPT crywrap: PARANOID EXCEPT Then I ran into the problem that we had removed the `/etc/sslwrap` directory and the `server.pem` file inside it that contained the server's private key. After a little bit of digging, I found out where to get the private key file, stuck it in `/etc/crywrap/server.pem`, and put the following completely wrong line in `/etc/inetd.conf`: pop3s stream tcp nowait root /usr/sbin/tcpd /usr/sbin/crywrap -cert /etc/crywrap/server.pem -addr 127.0.0.1 -port 110 You see, at this point, I still believed the package description that claimed that `CryWrap` was "a drop-in replacement". I got a log line (apparently from `tcpd`) that said the connection had been made: Mar 29 19:51:03 panacea crywrap[27458]: connect from 190.55.55.32 (190.55.55.32) But the `crywrap` process had died, and there were no error messages in any of the `/var/log` files explaining why. This is because of problem #2, "`CryWrap` reports its errors to `stderr`," which I explain below. Upon consulting the man page and debugging from the error message for a while, I ended up with the following line in `/etc/inetd.conf` instead: pop3s stream tcp nowait root /usr/sbin/tcpd /home/kragen/crywrap -d 127.0.0.1/110 -i `/etc/crywrap/server.pem` is the default location for `CryWrap` to look for a server certificate, so I omitted it from the command line. ### `CryWrap` reports its errors to `stderr`. ### In order to find out what was wrong, I temporarily ran `/home/kragen/crywrap` instead of `/usr/sbin/crywrap` from `inetd`. `/home/kragen/crywrap` is this script: #!/bin/sh /usr/bin/strace -s4096 -o /tmp/crywrap.strace /usr/sbin/crywrap "$@" And it turned out that `CryWrap` was writing its error messages to `stderr`, file descriptor 2, instead of to a log file. `stderr`n in a process run from `inetd` is actually connected to the socket talking to the client, so writing error messages to it is almost certain to violate the protocol expected by the client. Here's one sample error message (a result of problem #4, "`CryWrap`'s `sslwrap` wrapper doesn't support `inetd`", below) from strace's output (wrapped for readability): write(2, "crywrap", 7) = 7 write(2, ":", 1) = 1 write(2, " ", 1) = 1 write(2, "Could not resolve address: `/\'", 30) = 30 write(2, "\n", 1) = 1 write(2, "Try `crywrap --help\' or `crywrap --usage\' for more information.\n", 64) = 64 exit_group(64) = ? This was at the very end of the file. Now, this would not be such a heinous sin in a program that was intended to speak, say, SMTP. If a fatal error message gets sent to an SMTP client, it's likely to end up somewhere that a human being can see it and diagnose the problem. But SSL is a different matter. SSL connections are normally full of random toxic binary data, so almost no SSL-speaking programs will dump out that data on a human when there's a connection failure. So the only way I was able to find these error messages was by running the program under `strace(1)`. ### `CryWrap` took about 110 seconds to negotiate an SSL connection. ### Once I got `CryWrap` to run, my wife Beatrice was still reporting failures getting her mail in Thunderbird. `strace` showed that `CryWrap` was running and receiving data (`less /tmp/crywrap.strace` and then typing `>F` was very helpful to watch this in real time), but it was receiving it very slowly, a few bytes every few seconds. At Paul Visscher's suggestion, I tested the connection myself with the OpenSSL package's `openssl` command: openssl s_client -connect panacea.canonical.org:pop3s This did eventually connect and allow me to speak POP (simulated copy-and-paste here may contain errors): ... Timeout : 300 (sec) Verify return code: 21 (unable to verify the first certificate) --- +OK USER imaptest +OK PASS +OK QUIT DONE However, it took about a minute and 51 seconds. This is apparently more than Thunderbird's timeout. I don't know enough about SSL to know why this might be. `CryWrap` reported it with these `syslog` messages (wrapped and trimmed for readability): crywrap[27830]: Accepted connection from 190.55.55.32 on 0 to 127.0.0.1/110 crywrap[27830]: Handshake failed: A TLS packet with unexpected length was received. I never did figure out why this happened, and so I gave up on `CryWrap` and switched to stunnel4 (see below). ### `CryWrap`'s `sslwrap` wrapper doesn't support `inetd` ### There is a shell script in /usr/share/crywrap/`sslwrap` that intends to make crywrap act like `sslwrap`, but it doesn't consider the case of trying to run from inetd (-i or --inetd) flag. Because it's a badly-written shell script, it doesn't notice that its "listen port" parameter is missing; it merely tries to invoke `CryWrap` with "-l /" (`CryWrap` uses a slash to separate IP address from port, instead of the traditional colon; in this case, both the IP address and the port are missing, leaving only the lonesome "/", like a girl who's been stood up on a date. `CryWrap` reports this by sending the helpful message: crywrap: Could not resolve address: `/' to the would-be SSL client. I extracted it from an `strace` output file in `/tmp`, except that I had to use `strace -ff` to follow the children of the `/usr/share/crywrap/sslwrap` script. (I guess I could have just redirected `stderr` to a file instead of using `strace`.) ### `CryWrap`'s documented `-v` flag doesn't work as documented. ### `CryWrap`'s man page documents a `-v` flag. `-v 0` is documented to turn off client certificate validation, although having it turned off is documented to be the default. We thought that perhaps the default was actually something other than what it was documented to be, because on the successful `openssl s_client` connections (see above under #3), we were getting this message: crywrap[28190]: Error getting certificate from client: The peer did not send any certificate. And it seemed plausible that this might explain the slowness (#3). So I tried adding `-v 0` to the command line, because the man page says: > `--verify` (`-v`) [LEVEL] > > > Set the level of client certificate verification. Level > > one simply logs the result, level two and above abort if > > the certificate could not be verified. > > Default is 0. If you actually try running crywrap with `-v 0`, you get this error message: kragen@panacea:~$ /usr/sbin/crywrap -l /3802 -d /110 -v 0 crywrap: Too many arguments Try `crywrap --help' or `crywrap --usage' for more information. Except that I didn't originally get the error message at the command line; I had to dig it out of `strace` output in `/tmp` after editing `/etc/inetd.conf` and restarting `inetd`. It turns out that `-v0` is the supported syntax, despite what the man page says, and in violation of the usual Unix conventions. No space is permitted. Success with `stunnel4` ----------------------- I did this: $ sudo apt-get install stunnel4 Then, after skimming the `stunnel` man page, I stuck this in `/etc/inetd.conf` (all on one line) in place of the `crywrap` line: pop3s stream tcp nowait root /usr/sbin/tcpd /usr/bin/stunnel -p /etc/crywrap/server.pem -r 110 That worked. Then I moved `/etc/crywrap/server.pem` to `/etc/stunnel/server.pem` and all was good. The total elapsed time since giving up on `CryWrap` was just under eleven minutes. Things I Learned ---------------- Or was reminded of. 0. It's easy to underestimate how much of a pain in the ass your software will be for other people. Presumably `CryWrap`'s author wouldn't have had any of the above problems (except for #3, and he could have probably diagnosed that one). 1. If I write software and claim it's a "drop-in replacement" for something else, someone is going to be sad. Or pissed off. Because I'll probably forget something. (Although hopefully I'll do better than this!) 2. It's good to be careful about where error messages go. 3. I should try to make sure that my software handles errors (e.g. missing listen port) in a graceful fashion, i.e. by bombing out with an error ("listen port required") instead of proceeding to invoke something else with some broken default (in this case, the empty string) and relying on it to emit a useful error message (``crywrap: Could not resolve address: `/'``). Generally it's pretty easy to make this mistake in shell scripts, but in this case the listen port was explicitly set to the empty string before command-line parsing, as a default, so the problem would have been the same regardless of language. 4. It takes as long to write stuff like this up as it does to experience it. 5. Violating established conventions is likely to cause some frustration; be sure you're doing it for a good reason. By convention `-v 0` is equivalent to `-v0` when `-v` takes an argument; the violation of this convention made the software harder to use. 6. `stunnel` rocks and can do what `sslwrap` did. `CryWrap` sucks and can't. 7. OpenSSL has the `openssl s_client` command, which is like an SSL version of `netcat`, and also `openssl s_server`. These should be very handy for troubleshooting SSL stuff in general. 8. I'm not a great sysadmin, and I tend to be too persistent when I should give up and try something else a little sooner. Credits ------- Thanks to Gergely Nagy for writing `CryWrap`, Jonathan McDowell for maintaining the `sslwrap` Debian package for so long, Rick Kaseguma for writing `sslwrap` in the first place, Beatrice Murch for having the patience to help me test the mail server after the upgrade, Paul Visscher for helping me out with most of the above stuff and also doing a bunch of the work of the Etch upgrade on our machine, and Brett Smith and Jason Cook for doing most of the rest of that work. From kragen at pobox.com Thu Jun 5 03:37:02 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Thu Jun 5 03:37:03 2008 Subject: notes on using QEMU Message-ID: <20080520042720.B0969183437@panacea.canonical.org> I'm running QEMU with kqemu on my old 700MHz laptop. User-mode stuff is slowed down only slightly. This command line: time for x in $(seq 10000); do :; :; :; :; done takes 1.17 1.19 1.20 1.22 user seconds in emulation and 1.13 1.13 1.14 1.14 user seconds outside QEMU. However, it takes about 100ms of system time in place of about 10ms. (The `-kernel-kqemu` flag may solve this; haven't measured.) I had some kind of keyboard problem when I ran QEMU 0.8.2-4etch1 with `-snapshot`. Like, the keyboard just didn't work. That problem went away when I built QEMU 0.9.1 from source and started using that, but I still can't use `-snapshot` and `-loadvm` together. Networking: `tap` ----------------- This was a bad idea (for me). By default, QEMU uses `user` networking, which proxies network connections through normal sockets, like `slipknot` or `slirp` or `term`. (In fact, it uses `slirp`.) I thought this didn't give me a way to talk to it over the network (for example, if I'm running a web server on it). So I thought `-net tap` could help with this, but it has some drawbacks. It requires running QEMU as root, and then the network interface on the emulated machine needs to be configured statically, e.g. in `/etc/network/interfaces`, since `-net tap` doesn't provide DHCP by default. And then you have to set up IP masquerading, more or less as follows: qemu -net nic -net tap,script=ifup "$image" In file `ifup`: set -e /sbin/ifconfig "$1" 172.20.0.1 echo 1 > /proc/sys/net/ipv4/ip_forward /sbin/iptables -t nat -A POSTROUTING --source 172.20.0.0/24 -j MASQUERADE This does actually work, but you have to configure the network stuff inside of QEMU: IP address, netmask, default gateway, and worst of all, DNS server. And I think it might allow other people on your LAN to masquerade through you. What would be ideal would be bridging the virtual interface to my real Ethernet interface, but I never got around to doing this. Networking: `-redir` -------------------- It turns out there's an easier way. I can use the default `user` networking, and if I have a web server on the emulated host on port 8080, I can say qemu -redir tcp:8000::8080 "$image" and connect my web browser to . This works beautifully. The one downside I've found is that if you're using `qemu -loadvm`, the inner virtual machine has to re-request DHCP before the redirection works. Startup: `-loadvm` ------------------ Bootup takes an annoyingly long time. But, if you don't regularly have any permanent changes you want to save, you can use the `savevm` command to save an image of the virtual machine state after a boot, and then use `qemu -loadvm` to start QEMU in the already-booted state. From kragen at pobox.com Mon Jun 9 03:37:01 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Jun 9 03:37:04 2008 Subject: Improving "science" in eSpeak's lexicon Message-ID: <20080520042720.F2CA018343D@panacea.canonical.org> (This is available in HTML at .) So I've been playing around with speech synthesis software tonight. [eSpeak](http://espeak.sourceforge.net/) looks a lot nicer than [Festival](http://www.cstr.ed.ac.uk/projects/festival/), just in that it's much easier to adjust its speed, correct its pronunciation, and play with variations: whisper, different accents, pitch, word spacing, creaky voice. I got to thinking, what would a logical policy for updating its lexicon look like? I thought the results I came up with were interesting. Maybe some other people will be interested too. The problem ----------- [eSpeak](http://espeak.sourceforge.net) gets "neuroscience" and "pseudoscience" wrong, pronouncing them with a `[[s,i@ns]]` rather than a `[[s'aI@ns]]`. It also gets "omniscience" and "prescience" wrong, or at least pronounces them rather differently than I would: $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "The science of neuroscience is not a scientific or quasiscientific pseudoscience. Conscientiously pursue omniscience and prescience." D@2 s'aI@ns Vv n'3:r-@s,i@ns I2z n,0t#@ saI@nt'IfIk _:_:O@ kw,eIzaIsi@nt'IfIk sj'u:d@s,i@ns k,0nsI2;'EnS@sli p3sj'u: '0mnIs,i@ns _:_:and pr'i:si@ns I would pronounce the "science" in "omniscience" and "prescience" as `[[S@ns]]` and put the accent on another syllable. There's a special rule for "scien" beginning a word, and for "conscience": en_list:conscience k0nS@ns en_rules: _sc) ie (n aI@ en_rules:?8 _sc) ie (n aIa2 However, Jonathan Duddington has said he wants to keep the eSpeak distribution small, so he "wouldn't want to include too many unusual or specialist words". (See where he talks about why he doesn't want to import the Festival lexicon.) Already, `espeak-data/en_dict` is 80KB, which is half the size of the `speak` binary. Replacement strategies ---------------------- There are several possible strategies that a maintainer could adopt in order to improve the coverage of their special-case word files without letting them get large. Suppose that there is a scalar metric of "goodness" that can be applied independently to each special case. Here are three plausible strategies, ordered from least to most stringent. - C-: They could never remove items from the file, adding new items as long as they were better than the worst item in the file. This will probably cause the average quality of the entries in the file to gradually decline, because many of the most important entries were probably added early on. It will eventually result in a very large file with very low average quality per entry, but very comprehensive coverage. - C+: They could keep the number of items in the file fixed, adding new items as long as they were better than the worst item in the file. This will cause the program to gradually work better, but each new version will introduce regressions --- words that the previous version pronounced correctly, but the new one does not. - A: They could never remove items, but add new items as long as they improved the median item quality of the file --- that is, as long as the new item improved the program's performance more than most of the items in the file. This will gradually slow down and eventually stop the addition of new items, because that median quality will gradually increase. I am going to approximate "quality" with "frequency", on the theory that mispronouncing a rare word is always better than mispronouncing a common one. Note the analogy to Google's famous hiring policy: only hiring candidates who raised their average ability. Evaluating word frequencies --------------------------- Are these "science" words significant enough to include? `en_list` only contains 2869 lines, maybe 2400 of which are words. So maybe only the top 2400 or so exceptions to the normal rules of pronunciation are currently considered for inclusion. Some time ago, I tabulated the frequencies of words in the British National Corpus and put the results online at . It has 109557 lines, ordered from the most common words ("the", "of", and "and", each occurring millions of times) to the least common (with a cutoff of 5 occurrences, because most of the words with fewer were actually misspellings). I selected 20 lines at random from `en_list` with the following results: kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ ~/bin/unsort < en_list | head -20 this %DIs $nounf $strend $verbsf barbeque bA@b@kju: con k0n ?5 thu TIR // Thursday _: koUl@n Ukraine ju:kr'eIn peculiar pI2kju:lI3 unread Vnr'Ed $only inference Inf@r@ns Jos? hoUs'eI unsure VnS'U@ survey $verb ? $accent epistle I2pIs@L Munich mju:nIk scenic si:nIk synthesise sInT@saIz corps kO@ $only rajah rA:dZA: transports transpo@t|s $nounf Where do these special cases appear in the British National Corpus tabulation? Here are some results, edited for readability: kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -niE ' (this|barbeque |con|thu|ukraine|peculiar|unread|inference|Jos?|unsure|survey|epistle|munich |scenic|synthesise|corps|rajah|transports)$' /home/kragen/devel/wordlist 22:463240 this 1178:7999 survey 5102:1441 peculiar 5831:1200 corps 7165:888 ukraine 8977:634 munich 9045:627 unsure 10552:494 inference 11134:455 con 15127:275 scenic 29899:82 epistle 31386:74 transports 34270:62 synthesise 37255:52 unread 73679:11 thu 74154:11 rajah 87737:8 barbeque The 50th-percentile among the sample of 20 (of which two weren't words, and a third wasn't found) seems to be line 11 134 with the word "con". That is, the exceptions in `en_list` are mostly drawn from the most frequently used eleven thousand words in the language. (Maybe words like "barbeque", "rajah", and "unread" should be dropped.) So under the policies "C+" and "C-", any word that is more common than "barbeque", at position 87737 in the British National Corpus tabulation, (or maybe some word even a bit rarer than that) should be added to the file. (Under policy "C+", some word would be removed to compensate, raising the threshold.) Under the policy "A", the threshold would be "con", at position 11 134. Unfortunately, Jos? is missing. I think I excluded accented characters when I tabulated the frequencies initially. Anyway, that gives us a way to compare the "science" words: kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -n scien[tc] /home/kragen/devel/wordlist 870:10597 science 1614:5922 scientific 2584:3547 scientists 3865:2088 sciences 3977:2005 scientist 5342:1355 conscience 13365:338 conscientious 16976:227 scientifically 25757:109 consciences 26015:107 conscientiously 27861:93 unscientific 37040:53 omniscient 44349:36 prescient 49031:29 neuroscience 49706:28 prescience 50457:27 scientificity 50587:27 omniscience 53155:24 scientism 62346:17 geoscience 66943:14 scientia 67285:14 neuroscientists 68176:14 conscientiousness 82060:9 geoscientists 84433:8 scientology 84434:8 scienter 86513:8 geosciences 90235:7 neurosciences 93073:7 biosciences 93074:7 bioscience 95039:6 scientifique 95591:6 pseudoscience 103190:5 presciently 103191:5 prescientific Of these, only those more common than "conscience" seem to deserve a place in `en_list`. How does eSpeak do now? $ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "Science is scientific and done by scientists, who work in the sciences. A scientist with a conscience may be conscientious. Those with scientifically-minded consciences will conscientiously avoid unscientific claims of omniscient beings or prescient prophets." s'aI@ns I2z saI@nt'IfIk _:_:and d'Vn baI s'aI@nt#Ists _:_:h,u: w'3:k I2nD@2 s'aI@nsI2z a2 s'aI@nt#Ist wI2D a2 k'0nS@ns m'eI bi: k,0nsI2;'EnS@s DoUz wI2D saI@nt'IfIkli m'aIndI2d k'0nS@nsI2z wIl k,0nsI2;'EnS@sli; a2v'OId VnsaI@nt'IfIk kl'eImz Vv '0mnIs,i@nt b'i:;INz _:_:O@ pr'i:si@nt pr'0fIts It pronounces everything correctly until it gets to "omniscient" and "prescient", and maybe its pronunciations for those are correct, but at least they're not the pronunciations I would use. Under policy "A", those words are not common enough to add to `en_list`, because they would lower the average frequency of words in `en_list` unless you removed a less common word to compensate. Under policies "C+" and "C-", not only "omniscient" and "prescient" qualify, but so do "neuroscience", "geoscience", "neuroscientists", and "geoscience", which eSpeak currently mispronounces. (Including all the exceptions that as rare as "prescient" might quadruple the size of `en_list`, and perhaps `en_dict` as a result, if arbitrary spellings were as common among rare words as they are among common words. Think of that as an upper bound. Including all the exceptions as rare as "neuroscientists" might multiply its size by seven. This is the downside of policy "C-", but it does not happen with policy "C+". On the other hand, under policy "C+", even "prescient" might not survive long after being added.) Recommendation -------------- There is a better solution than adding a bunch of one-word special cases to `en_list`. Probably in this case the solution is to change the special case for "conscience" to a special case for "conscien..." and change the "scien..." rule to a "...scien..." rule; that covers all the words except for "omniscien..." and "prescien...". Covering those two takes only two more rules in `en_rules`, if it's considered worthwhile; but "conscience" is ten times as common as both of those together, "con" three times as common, but "barbeque" 18 times less common. Alternatives ------------ I think there is a need for a larger `en_list` and `en_rules` to be available, even if they aren't part of the standard distribution. eSpeak's current footprint for a single language is about 160KB for the executable and 80KB for the dictionary. But it would be useful in many cases even if its dictionary were 800KB (as perhaps it would be with the Festival lexicon) or 8MB. And for a better user interface for making changes to the dictionary, and especially `en_rules`, since currently it's hard to know what words you're changing the pronunciation of when you change `en_rules`, and you have to master a phonological orthography system to make any contribution at all. And then there's no `git`-like infrastructure for sharing your changes, and even learning `git` is a pretty big barrier to contributions. If, instead, you could twist a knob to jog back to the last mispronounced word, then hold down a button and say its correct pronunciation, the barrier to contributions would be much lower. You would need a reasonable phonological analysis system (like in a speech-to-text system) to turn the spoken word into the string of phonemes. Then, if you could share your accumulated corrections with all other users of the software with the push of a button, the process of coming up with the tens of thousands of special cases would be a lot quicker. From kragen at pobox.com Thu Jun 12 03:37:01 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Thu Jun 12 03:37:03 2008 Subject: desmoking the smoky Buenos Aires air with a jacuzzi Message-ID: <20080520050728.D81DC183458@panacea.canonical.org> It was kind of a chilly late autumn here in Buenos Aires as I wrote this, 2008-04-19; but, because of this [rather remarkable environmental phenomenon](http://modis.gsfc.nasa.gov/gallery/individual.php?db_date=2008-04-18 MODIS image of the fires and smoke), I had the air conditioner turned on full blast. This is apparently the best we can do at avoiding the smoke that blankets the city, and I thought I should document the crazy scheme I was using, since it seemed to work. The Crazy Scheme ---------------- I heated the house a lot using the water heater; like many Argentine houses, we have one of those wonderful tankless water heaters, which can provide an unlimited supply of scalding hot water unless they catch on fire or melt or something. (Don't laugh; our friends Kevin and Alicia's caught on fire several times.) So I ran the shower very slowly to fill up the jacuzzi tub (we have a jacuzzi tub, you see), then started the bubble jets running to transfer more of the heat into the air, while using a fan to vent the hot, humid air from the bathroom to the rest of the house, to give the air conditioner more heat to chew on. (Unlike, say, the stove, the hot water heater is properly vented to outside, with a little horizontal metal chimney. And it has a hell of a gas burner inside.) I ran the air conditioner because it does remove some of the smoke from the air, but it's driven by a thermostat, so it would stop running if the air cooled off. So that's why I used the hot water heater to heat up the house. It turns out that the hot shower and the jacuzzi bubble jets also remove smoke from the air. There was quite a bit of smoke deposited around the edge of the bathtub now from the bubbles. Photos will be forthcoming eventually if I didn't lose them all. The humidity added to the air also ameliorated the eye and bronchial irritation from the smoke. Fine-Tuning ----------- I added a little shampoo to the water in order to make it pick up smoke particles and transfer heat better. (Smaller bubbles, less spherical bubbles, and less tendency for the water to repel smoky oil particles.) Unfortunately, jacuzzis being jacuzzis, this resulted in a huge mass of bubbles growing from the water and threatening to swamp the bathroom. So I added a little conditioner, and the bubbles died back down. In general, the surfactant action I was looking for and the foaming action I got are not inseparable. I wonder if I could get better results with dishwasher detergent, for example (although ideally without bleach), industrial degreasers, or some simple combination like Calblend plus simethicone. The placement of the fan in the bathroom door proved important. Ideal would be a fan at the other end of a long duct, either blowing cool, dry air into the bathroom, or sucking hot, wet air out of the top of the bathroom. What seemed to work best with the fan I have was blowing hot, wet air out of the bathroom and into the bedroom, where the air conditioner is. This produced a certain amount of fog. Real Air Filtration ------------------- I was hoping to try to get a HEPA filtration system for the house. I knew it would't be easy, because there aren't many stores here that sell them, and due to the smoke, even the stores that carry them seemed to be out. So we made do with N95 respirators, which I bought at a pharmacy that evening after hunting for hours for something better. We spent a lot of yesterday and this morning with wet bandannas around our faces, inside the house. I have a headband that I used to secure the bottom of the bandanna against my chin, and I rolled up strips of paper towel to put on each side of my nose to block the spaces there. That seemed to provide some noticeable protection. I picked up some deionized water at Carrefour that evening; I figured it might work better for air filtration than tap water. I'm not sure whether it did or not. I wish I had some way of measuring the smoke, other than by counting how often I cough, because I'd like to know (for example) whether the N95 respirators work better than the bandannas, or even whether they're effective at all against smoke. From kragen at pobox.com Mon Jun 16 03:37:01 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Mon Jun 16 03:37:02 2008 Subject: Git's first few big lessons Message-ID: <20080520050729.02C55183458@panacea.canonical.org> Here are the top few things I learned about git in the first few hours I used it. This is the document I wished I had had, on top of the various introductions floating around. Maybe it will be useful to somebody else. 0. Git handles 400MB of HTML crawl data less gracefully than it handles 700K of Python. But it handles that data more gracefully than `cp` and `rsync` do. 1. Don't `git push` to a repository that actually has a work area. Always use `git pull` instead. `git push` doesn't update the associated working area, or the index either, so if you try to `git commit` in that repository, you will commit a patch that undoes all the stuff you just did. See section "Push changes and the working copy". You can solve this with `git reset --mixed HEAD`, or eventually `git reset --hard HEAD` to throw away any changes in the working area. 23:05 < johnw> $ rsync -av .git/ server:/tmp/foo.git/ ; cd /tmp ; git clone ssh://server/tmp/foo.git 23:06 < johnw> that's all you need to setup a remote repository, and to start using it right away 2. `git repack -a -d -f` can achieve some truly astonishing compression ratios. This is how you make git checkouts faster than cp -a or rsync. In my case, three times faster than rsync over a slow network, due to a 7:1 compression ratio. 3. You have to `git add` changed files before you can `git commit` them, or use `git commit -a`, because `git commit` commits things from the index, not your work area. In older versions of Git, you used `git update-index` instead of `git add` on changed files. 4. `git commit` takes an option `--amend` which lets you amend previous commits. 5. `git clone -l` makes a hardlinked clone. 6. git has early-stage support for something called "submodules" in recent versions, similar to svn:externals. And there's an in-development `git hunk-commit` command that might end up in git someday that should add most of Darcs's UI niceness to git. I took the first 125 563 056 bytes of my mailbox and compressed them into 59M with git. However, git (1.4) doesn't seem to work very well with multi-gigabyte quantities. If you're using the Git 1.4 from Debian Stable, you'll want to know to use `init-db` instead of `init`, `repo-config` instead of `config`, and often `update-index` instead of `add`. From kragen at canonical.org Thu Jun 19 03:37:02 2008 From: kragen at canonical.org (Kragen Javier Sitaker) Date: Thu Jun 19 03:37:03 2008 Subject: upgrading to Emacs 22 is FANTASTIC Message-ID: <20080522084453.B920D18346C@panacea.canonical.org> So I just upgraded to Emacs 22 in April, despite Debian Etch not supporting it. It solves several of my daily annoyances with Emacs 21: - It recognizes "Password: " as a password prompt, so ssh and sudo get the benefit of me not having to manually type M-x send-invisible. - I can paste Unicode text into it from a web browser, including asymmetrical quotes, real apostrophes, and em dashes, and have it save them to a UTF-8 file without fuss. (Although it still displays the quotes in an obnoxious double-width fashion until the file has been saved and reloaded.) - TRAMP works out of the box. - The documentation is included, unlike in Debian. (There's a licensing dispute over whether the GNU Free Documentation License is free enough to satisfy the Debian Free Software Definition.) - comment-region now asks what comment syntax to use if it doesn't know. - When I run e.g. "darcs" by itself in shell-mode, occasionally Emacs used to take quite a while to display its output usage message, because it was reading it one character at a time. This has been fixed. I also anticipate joy using MuMaMo, but I haven't actually tried that yet. There are some changelog/news entries that sounded pretty good: ...if you set `set-mark-command-repeat-pop' to t. I.e. C-u C-SPC C-SPC C-SPC ... cycles through the mark ring. Use C-u C-u C-SPC to set the mark immediately after a jump. [Haven't tried this yet.] ...M-% typed in isearch mode invokes `query-replace' or `query-replace-regexp' (depending on search mode) with the current search string used as the string to replace. [Haven't tried this yet.] You can now customize the use of window fringes. To control this for all frames, use M-x fringe-mode or the Show/Hide submenu of... [so now I can have two 80-column windows on my screen at once, which is awesome] A new minor mode `next-error-follow-minor-mode' ... In this mode, cursor motion in the buffer causes automatic display in another window of the corresponding matches, compilation errors, etc. [Haven't tried this.] The new command `multi-occur' is just like `occur', except it can search multiple buffers. [Useful. Also I didn't know about `occur`.] The grep commands provide highlighting support. Hits are fontified in green, and hits in binary files in orange. Grep buffers can be saved and automatically revisited. [This is in fact extremely awesome.] In addition, when ending or calling a macro with C-x e, the macro can be repeated immediately by typing just the `e'. [This sounds nice, but the F3 and F4 macro keybindings are better.] The new package longlines.el provides ... "soft word wrap" [like actual word processors have since the 1970s. Turns out to be fantastic.] SES mode (ses-mode) is a new major mode for creating and editing spreadsheet files. [Haven't tried this yet.] The new package table.el implements editable, WYSIWYG, embedded `text tables' in Emacs buffers [Haven't tried this yet.] The new package flymake.el does on-the-fly syntax checking of program source files. [Haven't tried this yet.] savehist saves minibuffer histories between sessions. [Haven't tried this yet.] isearch in Info uses Info-search and searches through multiple nodes. [This is fantastic.] Atomic change groups: To perform some changes in the current buffer "atomically" so that they either all succeed or are all undone, use `atomic-change-group' around the code that makes changes. [Sounds like a fantastic idea, but I haven't tried it either.] So far I've only noticed two new annoyances: one is that it uses its own python-mode that I don't like as well as the one that comes with Python, and the other is that C-x C-f RET no longer reverts the file to the version in the filesystem (assuming the buffer wasn't edited); now you actually have to type the filename. The stuff in the NEWS file (C-h N) looks pretty innocuous. Nothing is terribly exciting, though. From kragen at canonical.org Mon Jun 23 03:37:01 2008 From: kragen at canonical.org (Kragen Javier Sitaker) Date: Mon Jun 23 03:37:03 2008 Subject: wood and stone personal digital assistants Message-ID: <20080522084453.D3B1118343E@panacea.canonical.org> (Available in HTML at .) Polished-Stone Handheld Computers --------------------------------- So I've been thinking about making a handheld computer with the look and feel (shininess, irregularity, weight, seamlessness) of a polished semiprecious stone. One way to do this would be to embed the electronics in polyester resin poured into a mold, with an embedded induction coil for charging, some embedded lead shot for weight, and a dark, but not quite opaque, surface layer to hide the interior except for when it was glowing. Input would probably be piezoelectric, localizing surface taps or using rhythm. (See earlier kragen-tol post [magic boxes and secret knocks][magic].) Output could be through embedded LEDs shining through the surface layer or through audio, especially if you held it against a window. (How much lead shot would you need? Lead has a density of 11.3g/cc, against quartz's 2.6g/cc and [the polyester resin's 1.11g/cc][EP4117], so only 14.6% of the volume would need to be lead to equal quartz's density.) It would be shockproof, waterproof, crushproof, not particularly prone to damage from ESD, and it would feel really good in your hand. Some hard silicone around the outside might improve its thermal conductivity. (There are hard silicone resins with high thermal conductivity, right?) Beatrice suggested that you could use an actual polished semiprecious stone instead; cut out a circle from one side, drill out a cavity underneath, put the electronics inside, pot them with epoxy, replace the circle, wipe off the excess epoxy, and then polish the result. Wood-Block Handheld Computers ----------------------------- Another "everyday object" kind of electronic device case: a block of wood. Some time ago I saw a web page about a wooden clock. It seems to be widely available now; for example, advertises it for ?93.99. It explains: > A totally minimal block of wood with digital numbers floating > across the surface. These clever clocks have a very thin layer of > real maple wood veneer that permits the LEDs to shine through. > > Each one is slightly different due to the natural variation in > wood grain. > > Dimensions: 208 x 90 x 90mm > Weight: 1.2kg Another page says: > TO:CA 'wood' LED clock designed by kouji iwasaki in 2002. this > 'wooden' LED clock won top prize at the asahikawa international > design fair in 2002. A third page says they're actually made of MDF under the maple veneer, and has a photograph of the back that seems to confirm this, and a fourth page says the manufacturer is "Takumi of Japan". I think a handheld computer that looks like a block of wood would be pretty nice too. Something the size of a business card (3.5" x 2", or 89 x 51 mm) but fairly thick (say, 15mm), with veneer on at least one side. The resolution of the display would be limited by the light blurring on the way through the translucent veneer; each spot of light would have a radius on the order of the thickness of the veneer. [Veneers are typically 0.8mm][veneers] but are available as thin as 0.3mm. If spaced 1.6mm apart, you could get almost 1800 pixels in a rectangular array into the business-card size. You could do a little better with a hexagonal array: if the distance from the center of a regular hexagon to the center of one of its sides is r, then the distance to one of its corners is about 1.15r, which is the same as the length of each side; and its area is 1? * 1.15r * 2r = 3.45r?, which is 14% smaller than a square circumscribed around the same size of circle. In the case of r=0.8mm, you'd have 2.2mm? per pixel instead of 2.56, so you'd get about 2000 pixels. But then you'd have to deal with the hexagonal array in your software. 1800 pixels is enough for about 45 letters in a traditional 5x8 single-bit-deep font, which is pretty cramped; my cheap two-year-old US$30 cellphone has something like 65 letters'worth of space on its display. But it's enough to be useful. It's a lot more than any of the under-US-$10 devices I picked up for the "[cheap electronics dissection project][electronics]" in 2006, and they are useful for some things. I don't know how easy or hard it is to populate a PC board with 1800-2000 LEDs. I know I wouldn't want to do it by hand. You could hollow out the middle of a block of wood with just a drill and jigsaw; a keyhole saw or wire saw might work in place of the jigsaw. Cutting all the way through it would be a lot easier than just chiseling out a hollow in one side of the block; then you'd need to put veneer on both sides instead of just one. To add strength and keep it from sounding hollow, you'd probably want to pot the whole interior with epoxy or something. You could have a couple of finishing nails visible on one end if you wanted to charge it through actual electrical contacts rather than with induction. Other Everyday Items -------------------- You could also embed handheld computers in the following: oyster shells; bricks; pens (I suggested this previously on kragen-tol); ceramic tiles; beanbags, pillows, and stuffed animals (like the Chumby and the Furby). References ---------- [EP4117]: http://www.eagerplastics.com/4117.htm "Eager Plastics EP4117" Eager Plastics, aka Eager Polymers, has an "[EP4117][] General Purpose Polyester Laminating Resin" with a density of 1.11 g/cc. [magic]: http://lists.canonical.org/pipermail/kragen-tol/2002-April/000700.html In April 2002, I posted "[magic boxes and secret knocks][magic]" to kragen-tol. [veneers]: http://www.diyinfo.org/wiki/Using_Veneers "an article on DIYinfo.org" The article [Using Veneers][veneers] describes the different kinds of wood veneers available today. [electronics]: http://courageous.murch-sitaker.org/~kragen/electronics/ In 2006 I wrote a web page about my "[cheap electronics dissection project][electronics]", where I bought a bunch of cheap electronics and looked inside them. From kragen at canonical.org Thu Jun 26 03:37:01 2008 From: kragen at canonical.org (Kragen Javier Sitaker) Date: Thu Jun 26 03:37:02 2008 Subject: HTML is terser and more robust than S-expressions Message-ID: <20080522084453.EA12D183479@panacea.canonical.org> (This is available in HTML at .) HTML is more succinct for things in its intended domain than S-expressions, but still has better error-detection and correction capabilities. S-expression fans like to say that HTML, SGML, and XML are just bastardized S-expression languages. SGML partisans often respond that matching end-tags allow for better error-reporting and correction. But for typical HTML content --- mostly running text with a little bit of interspersed markup --- S-expressions are not only harder to correct, but also more verbose. Consider this partial paragraph from the Ur-Scheme web page :
  • Reasonably fast. It generates reasonably fast code — when compiled with itself, it runs 2? times faster (in user CPU time) than when it's compiled with Chicken, 1? times faster than when it's compiled with...
  • Now, in traditional HTML, I could have left out the quotes around the URL and the ending `` tag. Consider this S-expression version: (li (b "Reasonably fast.") " It " (b "generates reasonably fast code") " " mdash " when compiled with itself, it runs 2? times faster (in user CPU time) than when it's compiled with " (a :href "http://www.call-with-current-continuation.org/" "Chicken") ", 1? times faster than when it's compiled with...") Most of the markup constructs take up more characters here: LI: '
  • ' (end tag could be omitted in traditional HTML) '(li "")' B: '' '(b "") ' B: '' (the second one) '" (b "") "' --- '—' '" mdash "' A: '' (quotes could traditionally be omitted) '" (a :href "" "") "' If you look at this in a fixed-width font, you'll see that the number of markup characters is detectably smaller in the S-expression serialization of the structure, with the exception of the first two. I maintain that this is typical of the bulk of HTML, especially if you weight it by how often people write it instead of how often it gets sent to browsers. You can come up with examples where that is not the case: ... ... vs. (html (head (title "...") (link :rel "stylesheet" :href "../../style.css") (meta :http-equiv "Content-Type" :content "...") (style :type "text/css" "..."))) but those structure-heavy, text-light examples with long-winded tag names are relatively rare for people to read and write. Of course, the cost of terser syntax is often that errors are hard to diagnose. Ada's `end loop`, `end if`, `end record`, and so on mean that if you leave out an `end` delimiter, the compiler will usually be able to tell you which one you left out. At the opposite end of the spectrum, S-expression languages in which all the various kinds of `end` are spelled as `)` can only tell you when they get to the end of the program or to something that doesn't make sense in the current context. > This is not a phenomenon limited to end-delimiters. In > programming languages, there are many other examples of verbosity > that helps to diagnose errors; for example, explicit type > declarations, mandatory delimiter characters (in cases where the > syntax would be no more ambiguous if they were removed from the > grammar), sequences of single-line comments, and the conventional > parenthesization of the arguments of fixed-arity functions ("ratio > square sin x square sin y" is perfectly unambiguous, after all, > and Forth, PostScript, Logo, and REBOL use more or less that > syntax.). However, in the case of HTML, the terser syntax does not make errors harder to diagnose; in fact, the HTML syntax permits better error-detection and even error-correction, because all of the end-tags are explicitly labeled. (It differs from SGML in this regard; in SGML, you can write `
  • ` and eliminate the redundant end-tags altogether.) From kragen at canonical.org Mon Jun 30 03:37:02 2008 From: kragen at canonical.org (Kragen Javier Sitaker) Date: Mon Jun 30 03:37:03 2008 Subject: double-ended log-structured filesystems Message-ID: <20080522084453.A802C18345F@panacea.canonical.org> I use a sort of log-structured filesystem for my notebooks. I fill the notebooks in chronological order (more or less) from the second page to the last page. (The first page is left blank at first.) Everything is under some heading; the current heading is repeated at the top of every page, with the date, but sometimes there are several headings on a single page. The headings are underlined so they're easy to see looking at the page. So I can find things by paging through the recent pages and looking at the headings. When that gets to be too much, I append a new "table of previous contents" section, under a heading just like everything else; it lists all the headings, with dates, since the last "table of previous contents". The first page contains a list of tables of previous contents, with their dates, so that I can find them relatively quickly. This allows me to find my notes more quickly by reading through the few pages that are full of tables of previous contents, rather than leafing through all the pages in the book looking for headings. If I were a disk, which I'm not, this would be a reasonably efficient scheme for writes: regardless of how much stuff I have to write, I could append it all in a single write to the end of the currently-written data, possibly including a new table of previous contents, then update the "superblock" on the first page with a pointer to the new table. So writing any amount of data less than a notebookfull requires a seek to the end of the previous ToC, possibly a read of data following it, a write of the new data, and possibly a second seek and a second write to the superblock. Two seeks. Finding something in a notebook with three ToCs requires at most four seeks: one to each ToC, then another one to the data; if it's not listed in any ToC, you can sequentially scan for it after the last ToC. With this scheme, there's a tradeoff (for either humans or for disks) between the amount of sequential scanning you may have to do (due to still-unrubricated items) and the number of ToCs you may have to seek to and read. Beatrice pointed out the other day that it would be easier for a human to write the notes sequentially from the beginning of the book, while writing the ToC entries sequentially from the end of the book. This way, all the ToC entries are in a single sequential chunk, the tradeoff between maximum sequential scan length and ToC fragmentation is eliminated, and writing still requires only two seeks. Of course she is correct, and this might be a reasonable strategy for log-structured filesystems too, although there are usually more levels of indirection: from superblock, through various levels of inodes and directories, to the actual file extents on disk. You could probably do a reasonable job by putting a B-tree of pathnames at a fixed location of the disk, and putting the inodes and data extents contiguously somewhere else. `/var/cache/locate/locatedb` is a reasonable approximation of the contents of this B-tree; on my current laptop, it's 5.3MB, indexing 95GB of files using 596 662 inodes (i.e. 596 662 files, although `sudo locate / | wc -l` only finds 494 488 files.). Repacking a 5-20MB B-tree when it got too large and loose would take a significant fraction of a second on a modern disk, but on my laptop would take perhaps 10-20 seconds, due to the slowness of on-CPU disk encryption. So it might be better to defragment the tree incrementally.