offline web reading
Kragen Sitaker
kragen at pobox.com
Thu Nov 23 03:37:01 EST 2006
Mostly a fairly random collection of notes.
I'm using WWWOFFLE 2.8d now for offline web browsing. It's a proxy
server that saves all the pages you ask for, and in offline mode,
returns you an ugly page that tells you it remembered your request.
Then, when you go online and give it the "fetch" command, it downloads
all the pages you previously asked for, plus all their inline images,
stylesheets, Flash files, and so on.
The main differences between WWWOFFLE and a normal caching proxy server:
- it caches stuff longer than the HTTP spec says it should, and in
offline mode, it serves it up when you ask for it (in lieu of an
error page) even if the HTTP spec says it's stale;
- it remembers requests it couldn't fulfill in order to fulfill them
later.
It also has a fairly kickass recursive-get mode.
It actually allows me to use e.g. Google Maps offline! Hooray for
REST!
Some features I wish it had/bugs that annoy me:
- being responsive. Actually, I can't tell if the long spinners when
Firefox gets pages from it are Firefox's fault or WWWOFFLE's, but I
suspect the latter;
- noticing DNS server changes, which is apparently impossible to do
with the standard resolver library in a portable fashion, other than
by restarting the server;
- remembering which pages I'd actually asked for, and which ones I was
"done with", which clearly requires some better UI;
- implicitly recursively getting later pages of multi-page articles
(e.g. on Wired and OSNews; clearly this has to have per-site regexes
and crap);
- storing annotations on the pages so I could search by annotations
(it has support for searching your offline cache of part of the
web);
- a standard blocklist of ad providers, since ads made up a
substantial fraction of the hazardous, expensive, battery-draining
time that I was online;
- better integration into the browser UI --- I'd like to have buttons
for "recursively fetch from this page" (which would take me to an
options screen, rather than fetch immediately), "block this URL"
(likewise, since I'd need to be able to wildcard the URL), "pages
that linked to this page", and "already-stored pages linked from
this page" --- which should also show up in link colors.
- while you can see the list of pending fetches
(http://localhost:8080/index/outgoing/?sort=alpha;all) if you see
something bad in the list (hundreds of requests for things like
http://ad.doubleclick.net/adj/ttm.osnews/ros;tile=2;sz=120x600;ord=08028075),
it's too hard to remove it (the remove button is hidden by default
and takes you to a different page when you use it) and blacklist it.
Blacklisting is an available option, but also hidden by default,
and fairly painful to use; also, it doesn't actually remove it
from the list of pending fetches. Click "Config", change Path to
"Any Path" (two clicks), change Arguments to "Any or none" (two
clicks), click "Change URL-SPECIFICATION", page down five times to
dont-request, click it, click the "Yes" radio button, click "make
change", close window --- 15 UI actions to add each wildcard, and
that's once you're looking at the list of pending fetches rather
than the page with the offending content, although I suppose the
list of pending fetches tells you who the worst offenders are, but
it doesn't show where they're linked from or what other similar
pages were (although that is available through a few more clicks),
so it's hard to distinguish ads from Google Maps images
(e.g. kh1.google.com).
- a less ugly UI
- would it be so hard to use <label> elements in the UI?
- better handling of temporary failures, such as timeouts. "Better
handling" means "retry".
- also it would be nice if I could upload my configuration and list of
desired URLs to a server somewhere else (like my colo) which would
do the fetches for me while I was offline, and then download a big
blob of compressed web pages later. Ideally I could put the
configuration and URLs in an encrypted file that I could take to an
internet cafe and upload, and download the encrypted blob of
compressed web pages in the same way.
- when I used Squid for offline web browsing, I added meta http-equiv
refreshes (typically refreshing about once every half hour) to all
of its error pages, so that any tab displaying an error page would
eventually display the web page.
- URL blocking that actually worked would be nice too. I added a
bunch of domains to the dont-fetch list, but WWWOFFLE kept fetching
stuff from them anyway.
- hey, it would be nice if I could see why a particular page wasn't
fetched after the first time I try to GET it. E.g. if it's an
image, the first GET might be as an inline image --- I've seen this
happen sometimes.
- there's an option to list the "pages" that were requested the last
time I was offline. As I said before, it would be nice to have a
better display that distinguished pages I actually loaded
interactively, by clicking on a link, from inline resources like
<iframe>, <script>, <img>, <embed>, and <style> links (which might
be difficult without some more browser integration), it would also
be nice to be able to see the titles of the pages. As it is, it's
difficult to find the dozens of web pages I wanted to read, in among
the thousands of Google Maps images.
- while wwwoffle does nicely store things on disk in separate files,
the content of the HTTP response isn't stored in a separate file ---
so you can't use gthumb, gv, file, gzip, and so on, on the files
from wwwoffle's cache dirs --- you have to access them through
wwwoffle, which means through HTTP. That's a little inconvenient.
- At first you might wonder why "sort by file type" doesn't display
the content-type it's sorting by. Then you realize that it's
actually sorting by the file extension, and is therefore worse than
useless:
e.g. http://www.folklore.org/StoryView.py?project=Macintosh&story=90_Hours_A_Week_And_Loving_It.txt
(which is HTML!) ends up next to
http://gnosis.cx/download/gnosis/util/convert/curses_txt2html.py
- It's fairly opaque about how it chooses to purge things from cache.
There's a "purge" section in the config file, but from my reading of
it, it shouldn't delete anything until I haven't read it for four
weeks. But in fact it's already deleted a bunch of stuff.
- And it would be nice if it remembered more than one version of each
page.
- strace says it forks a new thread (using clone) for every child
request. Perhaps this explains why it is so slow.
- And sometimes it tries to send me an error page with
"Transfer-Encoding: chunked,chunked" and two layers of chunking,
which Firefox doesn't like --- the result is that the error page
displays with some three-digit hexadecimal numbers at the top. (I
think this should have been fixed in WWWOFFLE 2.8b.)
- Every once in a while, especially under heavy load, it serves up the
wrong representation --- I got PNGs and GIF89as for a bunch of my
HTML pages today.
- You can push it on or offline with an HTTP GET.
- I caught it doing continuous DNS lookups on www.meetomatic.com,
which had an A record, for many minutes --- several requests per
second for perhaps half an hour. This wouldn't be so bad except
that it did lots of AAAA requests, which failed and thus weren't
cached.
- Its connect timeouts seem to be painfully short, on the order of
tens of seconds.
As far as I can tell from reading the changelog (NEWS file), none of
the things listed above have been fixed in more recent WWWOFFLEs.
Unfortunately I think a good system for this kind of thing implicitly
encodes some workflow. At a minimum, each URL is in one of these
states:
- nobody cares
- requested
- currently being fetched
- waiting to be read
- discarded
- archived
The idea is that from "waiting to be read", a page can go into
"discarded" or "archived". I also think you need some kind of
categorization system for this --- anyway, I do --- such that some of
these states can pertain to a particular category. Normally pages
should inherit their categories from the page they were linked from,
and you should be able to see and edit the categories for a page in
the browser when you have it open.
More information about the Kragen-tol
mailing list