From kragen at pobox.com Thu Mar 6 03:37:02 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Thu Mar 6 03:37:03 2008 Subject: dynamic code generation to superoptimize calendar calculations Message-ID: <20080303094312.GA8484@canonical.org> In http://blog.plover.com/calendar/leapday.html Mark-Jason Dominus suggests this algorithm for calculating leap years, as a proposed replacement for the Gregorian system: 1. Divide the year by 33. If the result [remainder?] is 0, it is not a leap year. Otherwise, 2. If the result is divisible by 4, it is a leap year. This Dominus Calendar has an average leap-day correction of 0.24242424... leap-days per year, against the Gregorian calendar's 0.2425, the tropical year's 0.24219 or so leap-days per year, and the vernal equinox year's 0.2422 or so leap-days per year. Perhaps it would be slightly simpler, and equally accurate, to do the following: 1. Divide the year by 33. 2. Divide the remainder by 4. 3. If the remainder is 1, it is a leap year. Other values that would work in place of 1 are 2 and 3, but not 0. I suspect there is some simpler algorithm (in the sense of not requiring division by a large number such as 33) that is also more accurate. You could write the second one as \ y . ((y % 33) % 4) == 1 and the first one as \ y . (\ r . (r != 0) && ((r % 4) == 0)) (y % 33), or if you're into flat representations, r = y % 33 r2 = r % 4 result = r2 == 1 or r = y % 33 r2 = r != 0 r3 = r % 4 r4 = r3 == 0 result = r2 & r4 If our repertoire of primitives is %, ==, !=, <, &, and |, our repertoire of operands is limited to integers in [0, 33] and previously produced results, then the formula in each step has at most 39 * 6 * 39 = 12168 possibilities. If our number of steps is limited to 5, then there are only 12168^5 programs to search to guarantee that we find the above two, which is unfortunately 266 744 826 599 558 381 568 programs. That's on the edge of being practical to search by brute force; I think it would be less than a thousand years on a thousand-CPU cluster. The four-step programs should be practical to exhaustively search --- they should take a few days at most on such a cluster. A language including + but only integers up to 17 would also include a form of my short program, and would have at most (18+4) * 7 * (18+4) = 3388 possible formulas for a step, which gives less than 3388^4 programs to search, which is only 131 756 972 359 936. More exactly, there are 2527 possible first formulas, 2800 possible second ones, and then 3087, 3388. The actual product is 74 001 973 953 600, almost twice as small. Type-compatibility (the result has to be boolean; & and | can only apply to two booleans; other operators can only apply to two numbers) restricts the number of formulas further, but that number is difficult to calculate exactly. One difficulty is that measuring the accuracy of such a leap-day calculation program presumably requires running it a number of times on different year numbers, which adds an order of magnitude or two to the search time. If you have to run each one on average 30 times to find out that it's unacceptable, you could end up running 2 x 10^15 steps or so, which is maybe 10^16 instructions. A modern quad-core CPU can probably run around 5 x 10^9 instructions per second, so that leaves 2 x 10^5 seconds of testing if each primitive instruction above is a CPU instruction. A better approach may be to run the instructions SIMD-fashion, APL-like, on a bunch of years at once. This should avoid the need for dynamic code generation to get reasonable performance, although you probably still want to have the inner loops written in C or assembly. (I don't think you can use MMX or SSE for the division, but you can probably use it for comparison and population count.) From kragen at pobox.com Sat Mar 29 17:29:30 2008 From: kragen at pobox.com (Kragen Javier Sitaker) Date: Sat Mar 29 17:29:32 2008 Subject: orthographic reform of English Message-ID: <20080329212930.GA23684@canonical.org> At some time in their lives, all eccentrics who spend a lot of time reading must take on the doomed project of the orthographic reform of their language. Occasionally this project is not doomed; for example, if their scheme is backed by a king or revolutionary government, it may have some chance of success. There is a history of some of these successful attempts in http://en.wikipedia.org/wiki/Spelling_reform and a catalogue of fourteen unsuccessful attempts in English at http://en.wikipedia.org/wiki/English_reform. So I am offering these suggestions for the orthographic reform of English without any real hope that they have any chance of widespread adoption, except perhaps through automated translation software. Briefly, I advocate phonetic spelling, syllable blocks, boldface for sentence stress, and syntactic layout. 1. Phonetic spelling. There's an existing, widely-understood phonetic alphabet, used in almost all the dictionaries of the world except for English ones; it's called the International Phonetic Alphabet, or IPA. Continuing to write English in the impoverished Latin alphabet, without even using accents as most other languages do, wastes the time of countless generations of youngsters, who could be spending their elementary-school days on algebra, music, literature, art, or vocabulary, rather than spelling. So we should write English with the IPA. Of course, we would have to pick a standard pronunciation to use for the phonetic spelling. I propose using the dialect of English with the largest number of speakers: Indian English, with 350 million users. It may have the disadvantage that its phonology is somewhat less complex than that of most American, English, and Australian dialects, which may make it difficult to infer the English (etc.) pronunciations for words from their spelling. But this should be much less of a problem than at present. George Bernard Shaw famously willed much of his estate to a failed attempt to promulgate a phonetic spelling system for English. See http://en.wikipedia.org/wiki/Shavian for details. Other famous would-be English-spelling reformers include Benjamin Franklin, Melvin Dewey, Theodore Roosevelt, Mark Twain, and Noah Webster. 2. Syllable blocks. Korea's Hangul is the only script to successfully combine the easy skimmability of Chinese logograms with the easy learning of phonetic writing systems. So the letters used in writing English should be similarly arranged into syllable blocks. I have the impression that Korean has very little inflection and consequently fewer inflection-related vowel changes, so this may not work as well for English as for Korean, but most words in English do not have any inflection-related vowel changes either. For example, I think the previous sentence contains none, and this sentence contains only "think". Note that, according to Wikipedia, although hangul was created in the 1400s and promoted by the king, it didn't displace the Chinese-character system until the 20th century; from http://en.wikipedia.org/wiki/Hangul#Other_names: Until the early twentieth century, hangul was denigrated as vulgar by the literate elite who preferred the traditional hanja writing system[citation needed]. They gave it such names as: * Eonmun ("vernacular script"). * Amkeul ("women's script"). * Ahaekkeul or ahaegeul ("children's script"). * Achimgeul ("writing you can learn within a morning"). 3. Boldface for sentence stress. This *convention* is already widely used in *comic books*, in order to facilitate *comprehension*. I *suspect*, but have no *proof*, that it could convey *much* of the emotional *content* that is so often *misread* in *email* today. Conveying emotions *clearly* with only *word choice* is a very difficult *discipline*, the discipline of *poetry*. While poetry is a *priceless part* of our cultural *heritage*, it is a *serious problem* that communicating emotions *clearly* through email requires *writing poetry*. 4. Syntactic layout. Rather than being divided into paragraph blocks, text should be divided into lines according to phrasal divisions, and indented to show the hierarchical structure of the phrases. This is essentially universal practice for writing computer programs, with the partial exception of assembly language, and has been for decades, for the excellent reason that it makes the programs dramatically easier to understand. Buckminster Fuller called it "ventilated prose", and used it for the same reason, but the unfortunate effect of his writing in this format was that his work was often dismissed as "poetry": Though the preparation for that mid-nineteen-thirties presentation had been developed under the close observation of the corporation's Director of Research, my final written presentation of it was declared by the Direcdtor to be incomprehensible. Disgruntled, I re-read it carefully and returned to the Director saying, "Please listen to this," and proceeded to read in spontaneously metered "doses" from my manuscript. As I read I also watched for expressions of comprehension on the Director's face. The Director pondered each verbal dose, and when his face signalled "that is clear" I would intuitively measure out the next portion. Finally, the Director said, "Why don't you write it that way?" I said, "I am reading directly and without skipping from my original text"; so the Director said, "It just doesn't read that way." The explanation was that the intuitive doses did not correspond to conventional syntax. When the re-written report was submitted, the Director said, "This is lucid, but it is poetry, and I cannot possibly hand it to the President of the Corporation for submission to the Board of Directors." I insisted that it was obviously not poetry, since both he and I knew how I had chopped up a conventional prose report. The Director said, "I am having two poets for dinner tonight and I will take this to them and see what they say." He returned the next day and said, "It's too bad --- it's poetry." (That's according to http://webhome.cs.uvic.ca/~vanemden/zzVentProse.html which has no visible authorship information, but it is on Maarten van Emden's home page, and it is supposedly a quote "from the preface of No More Second-Hand God" by Buckminster Fuller, Southern Illinois University Press, 1963.") Here's an example, supposedly from "Intuition", via http://listserv.acsu.buffalo.edu/cgi-bin/wa?A2=ind9411&L=geodesic&T=0&P=5919 And wherever they came from, The thoughts arranged in this book Are discoveries Of its author Since he first came in 1913 To think That nature did not have Separate departments of Mathematics, physics, Chemistry, biology, History, and languages, Which would require Department head meetings To decide what to do Whenever a boy threw A stone in the water, With the complex of consequences Crossing all departmental lines. Ergo, I came to think that nature Has only one department -- And I set out to discover its Obviously Omnirational Comprehensively co-ordinate system, And thankfully found it. Fuller's "ventilated prose" fails to take advantage of indentation. More recently, a group of researchers have written software to parse and automatically reformat text in this format, under the name "Visual-Syntactic Text Formatting" or "Live Ink", and conducted numerous experiments to measure its effect on readability. They found that it improved readability substantially. For more details, see http://www.readingonline.org/articles/r_walker/ "Visual-Syntactic Text Formatting", by Stan Walker, P. Schloss, C. R. Fletcher, C. A. Vogel, & R. C. Walker, 2005-05, via Reading Online 8(6), ISSN 1096-1232; their software online at http://phil.red-castle.com/cgi-bin/HtmlClipRead80.exe rendered Fuller's text above as follows: And wherever they came from, The thoughts arranged in this book are discoveries of its author since he first came in 1913 to think that nature did not have separate departments of mathematics, physics, chemistry, biology, history, and languages, Which would require department head meetings to decide what to do whenever a boy threw a stone in the water, With the complex of consequences crossing all departmental lines. The parsing contains some errors; this would be more accurate: And wherever they came from, The thoughts arranged in this book are discoveries of its author since he first came in 1913 to think that nature did not have separate departments of mathematics, physics, chemistry, biology, history, and languages, which would require department head meetings to decide what to do whenever a boy threw a stone in the water, with the complex of consequences crossing all departmental lines. There is considerable room for debate about the best layout for English text; even for simpler languages like OCaml that are traditionally written indented in this fashion, there is often some ambiguity about the best way to format code. The basic principle, though, is that the hierarchical structure of the sentences should be reflected in a layout with the smaller parts of the sentence indented further to the right. These changes to English orthography would make English much easier to learn, read, write, and even speak. But there is no chance that they will ever be adopted, even if people came to believe that I was some kind of super-genius; the obstacles to orthography changes are simply too great.