[Aptitude-devel] Caching network data

Sat May 2 02:32:26 UTC 2009

On Fri, May 01, 2009 at 02:04:01PM +0200, Stefano Zacchiroli <zack at debian.org> was heard to say:
> On Thu, Apr 30, 2009 at 08:41:51AM -0700, Daniel Burrows wrote:
> > I'd much rather have something that aptitude can link against at
> > run-time.
> 
> Ah, I didn't get that from your first post: a proxy-enabled network
> library, is that it? That would be very reasonable, even though I
> don't know have any such library right on top of my head.

  I'm not really interested in the proxy part, although I'd use it if
it was there.  I'm more interested in basically getting two functions
that let me store something I retrieved and check for it later:

  cachePut : (URL * File) -> ()
  cacheGet : URL -> File Option

  (types written out for you ML-heads ;-) )

  The proxy functionality would only be interesting in that I could
check the last-seen date and decide whether to use the cached value
based on that, and it looks like apt's built-in HTTP client is capable
of doing that.  So the logic would be something like

def cachePut(url, file):
    # If the file is too large for the cache, don't even try.
    if getFileSize(file) > getCacheLimit():
        return

    with sqliteTransaction():
        # If the file will make the cache too large, drop old files
        # until it fits.
        while sqliteDbSize() + getFileSize(file) > getCacheLimit():
            sqliteDo("drop from Cache as Victim where not exists
                         (select from Cache as Older
                          where Older.LastUsedDate < Victim.LastUsedDate)")

        # Stuff the file into the cache with its timestamp set to the
        # present time.
        sqliteDo("insert into Cache (URL, File, LastUsedDate)
                             values (?, ?, Now())",
                 url, file)

def cacheGet(url):
    with sqliteTransaction():
        for itemId, file in
          sqliteDo("select (Id, File) from Cache where URL = ?",
                   url):
            # We found an entry in the cache; bump its timestamp and
            # return it.
            sqliteDo("update Cache set LastUsedDate = Now()
                      where Id = ?", itemId)
            return Some(file)

        return None()

  (code written to illustrate the logic; there are some obvious
inefficiencies in it and a few practical complications, like stuffing
files into and out of SQLite, dealing with SQLite's space overhead when
deciding what to drop, etc, are ignored)

  "Bulk" gets and/or puts would make things a bit faster, but I don't
know if they'd be necessary, given the low load I expect to put on the
cache.

> >   The thing that appeals to me about sqlite is that it can handle a
> > lot of the most tricky bits of managing a cache (maintaining
> > consistency, concurrent read/write access, indexed lookups, throwing
> > out old entries, etc) in a lightweight way, with no need for an
> > external process.
> 
> True. The drawbacks are that if you want it to be transparent for the
> user you have to be a bit careful. A cache can grow and I (as a user)
> like to know that I can throw it away when I want without breaking the
> depending application. That works for most of the stuff living under
> /var/cache/. So, if you want to go that way, I suggest to use a
> .sqlite file there, ensuring that the user can delete it whenever she
> wants.

  Well, aptitude can run as a user, so what I was thinking about was

~/.aptitude/http_cache

  and throwing out old data when the cache exceeds a certain size.  (see
above) I'm not sure what the default limit will be; I'll have to see how
much data it usually stores, but I guess 10MB would be very generous,
especially if I (say) run things through bzip2 on the way in.  /var
could be used, but then I'd have to deal with sticky directories and
stuff.

  It's crude, but should be effective for my purposes, which right now
are mainly to not download the same 200 changelogs every time aptitude
starts, and eventually to not download the same 200 screenshots every
time the user browses a package list, thus ensuring that the ftpmasters
don't hunt me down and kill me. :-)

  Anyway, I would still rather have a prebuilt library, just trying to
explain why I think it's not totally nuts to roll my own.

  Daniel