[Aptitude-devel] r3015 - in branches/aptitude-0.3/aptitude: . src/generic/problemresolver

Daniel Burrows dburrows@costa.debian.org
Wed, 20 Apr 2005 17:19:38 +0000

Author: dburrows
Date: Wed Apr 20 17:19:35 2005
New Revision: 3015

Add a writeup of the theory behind the problem resolver.

Modified: branches/aptitude-0.3/aptitude/ChangeLog
--- branches/aptitude-0.3/aptitude/ChangeLog	(original)
+++ branches/aptitude-0.3/aptitude/ChangeLog	Wed Apr 20 17:19:35 2005
@@ -1,3 +1,10 @@
+2005-04-20  Daniel Burrows  <dburrows@debian.org>
+	* src/generic/problemresolver/model.tex:
+	  Add some discussion of the theoretical basis for the new problem
+	  resolver.
 2005-04-10  Daniel Burrows  <dburrows@debian.org>
 	* src/cmdline/cmdline_do_action.cc, src/generic/aptitude_resolver.cc, src/generic/aptitude_resolver.h:

Added: branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex
--- (empty file)
+++ branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex	Wed Apr 20 17:19:35 2005
@@ -0,0 +1,569 @@
+\lhead{Modelling and Resolving Software Dependencies}
+\rhead{Daniel Burrows}
+\author{Daniel Burrows \url{<dburrows@debian.org>}}
+\title{Modelling and Resolving Software Dependencies}
+\newcommand{\st}{\ |\ }
+  Many Linux distributions and other modern operating systems feature
+  the explicit declaration of (often complex) dependency relationships
+  between the pieces of software that may be installed on a system.
+  Resolving incompatibilities between different pieces of software is
+  an NP-complete problem, and existing solutions require the user to
+  manually resolve many ``simple'' dependency problems.
+  I present a simplified, abstract model of dependency relationships,
+  and a restartable technique based on best-first-search to calculate
+  resolutions.
+It is common nowadays for hundreds or thousands of software packages
+to be installed on a single computer system, and for many of these
+software packages to interact with one another.  Because some
+combinations of software packages will not function properly -- for
+instance, an application program might require a particular version of
+a graphics library -- installing software manually while avoiding
+unexpected breakage is an increasingly unpleasant chore.
+To address this problem, programs known as \emph{package systems} were
+developed.  A package system typically manages \emph{packages} that
+consist of the files of a program or library, along with metadata such
+as the name and version of the package, a brief description of what it
+contains, and (most importantly for our purposes) a list of which
+other packages it requires or is incompatible with.  The package
+installation software warns the user upon any attempt to install or
+remove software that would violate these constraints.
+Unfortunately, the early versions of these tools replaced the chore of
+manual software installation with the chore of dependency resolution:
+for instance, installing a package of the popular game \pkg{wesnoth}
+might produce an error indicating that the user should find, download,
+and install a new version of the \pkg{SDL} graphics library.  A new
+version of the \pkg{kmail} mail client might require the user to
+upgrade his or her entire operating system, generally indicating this
+fact by a slew of cascading error messages.
+As a result of this ``dependency hell'', a new generation of automated
+tools, such as \apt and \pkg{up2date}, was developed.  These
+tools maintain a database of software installed on the user's system,
+along with software available from any number of remote sites.  To
+install or remove a piece of software, the user issues a request to,
+for instance, ``install wesnoth'' or ``upgrade kmail''.  The
+installation tool will proceed to find a set of package installations
+or removals which leads to a consistent result.  Typically, it then
+presents this list of actions to the user and prompts for
+confirmation; the user can either accept the proposed solution, or
+reject it and proceed to fix the problem in a fully manual way.
+This approach has two major drawbacks:
+\item The user interface for resolving dependencies is a ``take it or
+  leave it'' proposition: there is no way for the user to ask the
+  algorithm to find another solution.  This means that if the
+  algorithm makes a poor or undesired choice (which, as I will argue
+  below, will inevitably occur sometimes) the user is forced to fall
+  back to fully manual operation.
+\item In at least some cases (particularly \apt), the algorithm used
+  in resolving dependency conflicts deals poorly -- which is a
+  euphemism for ``not at all'' -- when there are more than two
+  versions of a package to choose from\footnote{More precisely, if
+    more than one version other than the currently installed version
+    (if any) exists.}.  For instance, if versions 1, 2, and 3 of
+  package \pkg{A} are available, with 2 being the default version of
+  the package, and if package B requires version 3 of package A, when
+  the user tries to install package B, he or she will receive an error
+  message indicating that the dependency on A cannot be fulfilled.
+Another general difficulty in solving dependencies in these systems is
+that the package systems contain many features which, although they
+are arguably ``syntactic sugar'', tend to cause algorithms that
+operate on packages to become strewn with complex iteration constructs
+and unpleasant corner cases.  Although some attempts have been made to
+find general models of package dependencies (for instance, the
+internal structures of \apt can represent either Debian or Red Hat
+packages), the models with which I am familiar work by taking a
+``greatest upper bound'' of the systems that they cover, leading to an
+even more convoluted architecture.
+  I have not yet performed an extensive survey of package systems, and
+  it may be that there already exist systems that fix one, two, or all
+  of the drawbacks listed above.
+\section{Example: the Debian Package System}
+In the Debian package system, each package may have one or more
+versions, but at most one version of each package may be installed at
+any given time.  The basic relationships between packages are
+\emph{dependencies} and \emph{conflicts}.  For instance, version
+6.14.00-1 of the \pkg{tcsh} command shell depends on version
+2.3.2.ds-4 or greater of the \pkg{libc6} package, and version 5.4-1 or
+greater of the \pkg{libncurses5} package: it may not be installed
+unless an appropriate version of each of these package is installed.
+On the other hand, the same package conflicts with all versions of the
+\pkg{tcsh-kanji} and \pkg{tcsh-i18n} packages: \pkg{tcsh} may not be
+installed at the same time as either of these packages.
+A single dependency may name several packages, combined with and OR
+operator (indicated by a vertical bar).  For instance, version
+\pkg{1.4.48} of the \pkg{debconf} package depends upon
+\verb!debconf-i18n | debconf-english!; in order to install debconf,
+you must also install one of these two packages.
+Last but not least, dpkg supports what are known as ``virtual''
+packages.  Each version of each package may \emph{provide} one or more
+package names; each named package will become a virtual package.
+Virtual packages can have the same name as a normal package, in which
+case they are known as \emph{mixed} virtual, or they can exist only
+through being provided by a normal package, in which case they are
+known as \emph{pure} virtual.  An unversioned dependency on a virtual
+package will be satisfied if any provider of the name is installed,
+while an unversioned conflicts will require that all providers of the
+name are removed -- but as a special case, direct or implicit
+self-conflicts are ignored.  Versioned dependencies and conflicts do
+not, as of this writing, follow provided package names.
+For instance, the virtual package \pkg{mail-transport-agent} is used
+to identify packages containing mail transport agents.  Every such
+package both provides and conflicts with the virtual package, and
+packages that require a mail transport agent depend on
+it.\footnote{Due to a quirk in how \pkg{apt} resolves dependencies,
+  dependencies on a virtual package are required to include a real
+  package as an alternative: for instance, \pkg{bugzilla} depends on
+  \url{sendmail | mail-transport-agent}.}
+\section{An Abstract Model of Dependency Relationships}
+As can be seen in the previous section, real dependency systems are
+complex.  This tends to complicate the business of reasoning about how
+to find solutions to dependency problems, and to cause algorithms that
+manipulate dependencies to become horribly messy.  In situations like
+this, it is often a good idea to find a simpler, more mathematical
+model of the problem to be analyzed.  Of course, it is well-known that
+package dependencies can be reduced to the satisfaction of Boolean
+equations, but such a reduction is arguably \emph{too} extreme: it
+certainly results in a mathematical model, but the model it produces
+hides the structure of the original problem.  I will instead propose a
+model which is sufficient to describe any dependency problem (at least
+in Debian) and retains the structure of a package system.
+\subsection{Basic Concepts}
+In this simplified model, the only objects in the world are packages,
+versions of packages, and dependencies.  Packages will typically be
+denoted by $p_1, \dots$; versions will typically be denoted by $v_1,
+\dots$; and dependencies are of the form $v -> \{v_1', \dots\}$,
+indicating that the version $v$ requires the versions $\{v_1',
+\dots\}$.  The package associated with a version $v$ is denoted by
+To represent the state of the entire system, the following sets are
+\item $\P$ is the set of all packages.
+\item $\V$ is the set of all package versions.
+\item $\D$ is the set of all dependencies.\footnote{Not the set of all
+    \emph{potential} dependencies, but the set of all dependencies
+    asserted in the current package system.}
+\subsection{Reduction of Debian Dependencies to the Model}
+As claimed above, it is possible to reduce a Debian dependency system
+to this abstracted model.  The reduction proceeds in approximately the
+following way:
+\item $\P$ is the set of Debian packages.
+\item $\V$ is the set of versions of those packages, plus one
+  additional version for each package.  This version represents the
+  state of the package when it is not installed.  Versions
+  corresponding to versions of the Debian package are indicated by $p
+  : n$ where $n$ is the version number, while the ``uninstalled''
+  version is indicated by $p_u$.
+\item For each dependency of the version $v$ of a package on
+  $\pkg{A}_1\ |\ \dots$, accumulate a set $S$ containing all the
+  matching versions of each named package, combined with every package
+  version that provides a named package (if the dependency is
+  unversioned).
+  For instance, if $v$ declares a dependency on \verb!A (>= 3.2) | B!,
+  versions $3.1$, $3.2$, and $3.3$ of package \pkg{A} are available,
+  versions $1$, $2$, and $3$ of package \pkg{B} are available, and
+  package \pkg{C} version $3.14$ provides \pkg{B}, then
+  $S=\{\pkg{A}:3.2,\pkg{A}:3.3,\pkg{B}:1,\pkg{B}:2,B:3,\pkg{C}:3.14\}$.
+  $\D$ contains $v -> S$ for every such dependency.
+\item For each conflict declared by a version $v$ on the package $p$,
+  accumulate a set $S$ containing all the \emph{non}-matching versions
+  of $p$ and insert $v -> S$ into $\D$.  Furthermore, if the conflict
+  is not versioned, then for each package $p'$ and version $v'$ of
+  $p'$ such that $v'$ provides $p$, let $S=\{v'' \st \pkgof{v''}=p' \land
+  \text{$v''$ does not provide $p$}\}$ and insert $v -> S$ into $\D$.
+  For instance, if $v$ conflicts with \pkg{A}, of which versions 3.2
+  and 3.3 are available, and versions 2 and 3 of \pkg{B} provide
+  \pkg{A}, then $S=\{\pkg{A}:3.2,\pkg{A}:3.3,\pkg{B}:2,\pkg{B}:3\}$.
+  \begin{note}
+    In reality, extra care must be taken to screen out self-conflicts
+    in this process, but the description above is complicated enough
+    as it stands!
+  \end{note}
+  Although the above reduction is complicated to describe, its major
+  steps must be performed whenever any program is analyzing
+  dependencies: for instance, when listing all the versions that can
+  fulfill a dependency, it is necessary to iterate over all members of
+  each OR and to search their providing packages as necessary.  Thus,
+  an ``on-the-fly'' reduction in an algorithm written for the generic
+  model is conceivably almost as efficient as an algorithm that works
+  with the Debian package structure directly.
+An \emph{installation} represents a choice about which package
+versions to install; it is a function that maps each package to a
+version of that package.  An installation $I$ is \emph{consistent} if
+it satisfies each dependency in the world: that is, for all $v ->
+\{v_1',\dots\}$ in $\D$, either $I(\pkgof{v}) \neq v$ or
+$I(\pkgof{v_i'})=v_i'$ for some $i$.
+If $I$ is an installation, then $I;p \mapsto v$ is a new installation
+which installs the version $v$ of the package $p$ and leaves all other
+packages unchanged:
+  (I;p \mapsto v)(p') = \begin{cases}
+    I(p'), & p' \neq p \\
+    v, & p' = p
+  \end{cases}
+\section{The Dependency Resolution Problem}
+\subsection{Problem Definition}
+Let $I$ be an inconsistent installation.  We would like to find a
+consistent installation that is ``similar to'' $I$.  This is the
+dependency resolution problem.  In a real package system, it
+corresponds to a situation in which the user's requests have resulted
+in an invalid state (with unsatisfied dependencies or conflicts); the
+goal of a package manager in this situation is to find a ``small'' and
+``good'' set of changes that result in a valid state.
+  This problem is poorly defined: ``small'' and ``good'' are not
+  precise terms.  The goal, from a UI point of view, is to not change
+  too many packages, but to make reasonable decisions: for instance,
+  if the user has requested that some packages be installed and these
+  installations cause dependency clashes, ``solving'' the problem by
+  cancelling the installations is probably not the desired result.
+  However, while it might have obviously wrong solutions, \emph{this
+    problem has no principled correct solution}, because it is
+  possible that if several different users view a single dependency
+  problem, each prefers a different solution from the others.  In
+  other words, some of the information necessary to find the ``best''
+  solution is inside the user's head.
+  Thus, the best we can do is to define some criteria for ``goodness''
+  and allow the user to see alternatives to an undesired solution.
+\subsection{Dependency Resolution is NP-complete}
+In order to find a ``good'' solution, we must first find \emph{any}
+solution to the existing set of dependencies.  Unfortunately, as shown
+below, this is an NP-complete problem.
+  Dependency resolution is NP-complete.
+  Proof is by reduction from CNF-SAT to the problem ``does a
+  consistent installation $I$ exist?''
+  Create one package for each variable and for each clause in the SAT
+  problem.  For each variable $x$, let the versions of the
+  corresponding package be $x:0$ and $x:1$; for each clause, create
+  exactly one version.  For each SAT clause let $v_c$ be the package
+  corresponding to the clause, and insert $v_c -> S$ into $\D$, where
+  for each literal of a variable $x$ appearing in the clause, $S$
+  contains $x:0$ if $x$ is a negative literal and $x:1$ if $x$ is a
+  positive literal.  This reduction is clearly polynomial-time; I
+  claim that a solution to this set of dependencies exists if and only
+  if a solution to the corresponding SAT problem exists.
+  Suppose that there is an assignment that solves the SAT problem.
+  Define an installation $I$ as follows: if $p$ corresponds to a
+  clause, $I(p)$ is the single version of $p$; if $p$ corresponds to a
+  variable $x$, $I(p)=x:0$ if $x$ is FALSE in the SAT solution and
+  $x:1$ if $x$ is TRUE in the SAT solution.  Now, consider any
+  dependency $v -> S$.  From the construction above, $S$ corresponds
+  to a clause of the SAT instance.  At least one literal in this
+  clause must be assigned the value TRUE (otherwise the clause is not
+  satisfied); let $x$ be the corresponding variable.  If the literal
+  is positive, then (by construction) $S$ contains $x:1$ and (since
+  $x$ must be assigned the value TRUE) $I(p)=x:1$.  On the other hand,
+  if the literal is negative, then $S$ contains $x:0$ and $I(p)=x:0$.
+  Thus, $I$ is a consistent installation.
+  On the other hand, suppose that there is a consistent installation
+  $I$.  For all variables $x$, let $p$ be the corresponding package;
+  if $I(p)=x:0$, assign FALSE to $x$, and if $I(p)=x:1$, assign TRUE
+  to $x$.  Now consider any clause in the SAT problem: from the
+  construction above, $\D$ contains a dependency $v_c -> S$ where
+  $v_c$ is the single version of the package corresponding to the
+  clause.  Since $I$ must map the clause's package to $v_c$ and $I$ is
+  consistent, there must be a version $v \in S$ such that
+  $I(\pkgof{v})=v$; i.e., $I$ installs $v$.  But from the
+  construction, $v^p$ corresponds to some variable $x$, and $v$ is
+  either $x:1$, where $x$ appears as a positive literal in the clause
+  or $x:0$, where $x$ appears as a negative literal in the clause.
+  Thus, the clause is satisfied, and so the assignment described above
+  satisfies all clauses.
+  Therefore, dependency resolution is NP-complete.
+\subsection{Don't Panic}
+Although the problem at hand is NP-complete in general, there is good
+reason to think that the instances that arise in practice are
+tractable.  It is well-known that many NP-complete problems have
+``easy'' and ``hard'' instances: some instances of the problem can be
+solved quickly by relatively naive algorithms, while others are
+intractable even using the most sophisticated techniques available.
+In the particular case of package dependencies, the traditions that
+have grown up around package tools seem to encourage the creation of
+easy instances of the dependency problem; furthermore, the user's
+desired installation is typically consistent or ``almost consistent''
+(meaning that few dependencies are violated).  It is usually
+straightforward, when solving problems in an ad hoc way, to isolate a
+small part of the dependency graph in which the problem occurs; for
+instance, by informally applying a constraint such as ``don't solve
+dependencies by removing core library packages''.  Once this is done,
+the problem can be declared either solvable or unsolvable on the basis
+of a quick analysis of that region of the graph.
+In fact, when even relatively basic search techniques are applied to a
+typical dependency problem, the difficulties that arise are related
+not to a paucity of solutions, but rather to an excess of them.  That
+is, finding \emph{a} solution is easy, but finding the \emph{right}
+solution is more problematic.  Indeed, in the Debian framework there
+is always at least one solution: removing every package on the system
+will satisfy all the dependencies.  However, for obvious reasons, this
+is not a solution that we want to produce!
+\subsection{Solving Dependencies Through Best-First Search}
+I propose the use of a relatively simple algorithm -- best-first
+search -- to resolve dependencies.  To briefly review, best-first
+search works by keeping a priority queue, known as the ``open'' queue,
+of potential (or partial) solutions.  The priority queue is sorted
+according to some heuristic function that quantifies the ``goodness''
+of each node (often in terms of nearness to a full solution).  In each
+iteration of the algorithm, the ``best'' partial solution is removed
+from the queue.  If it is a full solution, the algorithm terminates;
+otherwise, each ``successor'' node is generated and placed in the
+There are two main issues to resolve:
+\item How should successors be generated?
+\item What heuristic should be used?
+To generate successors, we could simply enqueue all possible changes
+to a single package.  However, this would result in a gigantic
+branching factor (over 1500 branches at each step in the current
+Debian archive), and it would cause the algorithm to consider
+adjusting packages that are utterly irrelevant to the problem at hand,
+as well as changing a package multiple times (which can lead to
+choices being made for reasons that are obscure to the user).  A more
+focussed approach is needed.
+Similarly, we could simply use the number of currently unsatisfied
+dependencies as our heuristic, but this does not provide any guidance
+as to how dependencies should be resolved.  If \pkg{A} depends on
+\pkg{B}, \pkg{A} is installed, and \pkg{B} is not installed, it is
+usually better to install \pkg{B} than to remove \pkg{A}; however, a
+straight count of broken dependencies might prefer either outcome.
+\subsubsection{Generating Successors to a Partial Solution}
+An obvious way of generating the successors of a given solution is to
+do it on the basis of unsatisfied dependencies.  If the installation
+$I$ does not satisfy the dependency $v -> S$, we know that $v$ is
+installed but no member of $S$ is.  To resolve this dependency, we can
+either install a different version of $\pkgof{v}$ or install any
+element of $S$.  Applying this rule to each ``broken'' dependency in
+turn will produce a set of successors that each solve at least one
+dependency (although they may break others in the process).
+However, this approach still has the potential to ``run in circles''
+by installing one version of a package, encountering broken
+dependencies, and then moving to a different version (possibly after
+resolving some dependencies of the intermediate version).  The problem
+resolver of \pkg{apt}, for instance, sometimes confuses users by
+exhibiting this behavior.  To fix this, I propose a simple rule: a
+solution should never modify a package twice.  Formally, if the
+original installation was $I_0$, then for any $I$, the installation
+$I'=I;p \mapsto v$ is only a successor of $I$ if $v \neq
+I_0(\pkgof{v})$ and $I(\pkgof{v})=I_0(\pkgof{v})$.
+One might wonder whether this approach risks overlooking solutions:
+for instance, maybe it is really necessary to ``go in circles'' in
+order to find a particular solution.  As shown below, if a solution
+cannot be generated by repeatedly using this limited set of
+successors, it is only because there is a ``simpler'' version of the
+solution (one that modifies the states of fewer packages) that is
+generated.  To prove this, I first will introduce some definitions and
+  Let $I_1$, $I_2$ be installations.
+  \begin{equation}
+    \idist{I_1}{I_2}=\len{\{p \st I_1(p) \neq I_2(p)\}}
+  \end{equation}
+  Let $I_1$, $I_2$ be installations.  An installation $I'$ is a
+  \emph{hybrid} of $I_1$ and $I_2$ if for all $p$, either
+  $I'(p)=I_1(p)$ or $I'(p)=I_2(p)$.
+  If $I'$ is a successor of $I$ with respect to $I_0$, then $I
+  \nsol{I_0} I'$.  If there exist $I_1, \dots, I_n$ such that $I_1
+  \nsol{I_0} I_2 \nsol{I_0} \dots \nsol{I_0} I_n$, then $I_1
+  \nsolmany{I_0} I_n$.
+  Let $I_c$ be any consistent installation (if one exists) and $I_0$
+  be any installation.  For all hybrids $I$ of $I_0$ and $I_c$, either
+  $I$ is consistent or there exists a successor $I'$ of $I$ such that
+  $I'$ is a hybrid of $I_0$ and $I_c$, and $\idist{I'}{I_c} <
+  \idist{I}{I_c}$.
+  Consider any hybrid $I$ of $I_0$ and $I_c$ such that $I$ is not a
+  solution.  Since $I$ is inconsistent, there is a dependency $v->S$
+  such that $I(\pkgof{v})=v$ and for all $v' \in S$, $I(\pkgof{v'})
+  \neq v'$.  However, since $I_c$ is consistent, either
+  $I_c(\pkgof{v}) \neq v$ or there exists a $v' \in S$ such that
+  $I_c(\pkgof{v'})=v'$.
+  Suppose that $I_c(\pkgof{v}) \neq v$.  Since $I$ is a hybrid of
+  $I_0$ and $I_c$, $I_0(\pkgof{v})=v$.  Thus, $I \nsol{I_0} I'$, where
+  \begin{equation}
+    I'=I;\pkgof{v} \mapsto I_c(\pkgof{v})
+  \end{equation}
+  On the other hand, if $I_c(\pkgof{v'})=v'$ for some $v' \in S$, note
+  that as before, $I_0(\pkgof{v'}) \neq v'$.  Therefore, $I \nsol{I_0}
+  I'$, where
+  \begin{equation}
+    I'=I;\pkgof{v'} \mapsto v'
+  \end{equation}
+  In either case, clearly $I'$ is a hybrid of $I_0$ and $I_c$, and
+  $\idist{I'}{I_c}<\idist{I}{I_c}$, proving the lemma.
+  For any consistent installation $I_c$ and any inconsistent
+  installation $I_0$, there exists a consistent installation $I_c'$
+  such that $I_c'$ is a hybrid of $I_0$ and $I_c$, and $I_0
+  \nsolmany{I_0} I_c'$.
+  Proof is by repeated application of the previous lemma.  For each
+  inconsistent hybrid $I$ of $I_0$ and $I_c$, let $I^{+}$ be the
+  hybrid shown to exist in the previous lemma, and define a sequence
+  $I_1, \dots$ as follows:
+  \begin{equation}
+    I_k = \begin{cases}
+      I_{k-1} & \text{if $I_k$ is consistent} \\
+      I_k^{+} & \text{otherwise}
+    \end{cases}
+  \end{equation}
+  I claim that this sequence converges; i.e., that for some finite $n$
+  and all $m>n$, $I_n=I_m$.  Proof: let $D_k=\idist{I_k}{I_c}$ and
+  $n=\idist{I_0}{I_c}$.  By the previous lemma, $D_k \geq D_{k+1}$ for
+  all $k$, and $D_k=D_{k+1}$ if and only if $I_k$ is a solution.
+  Thus, if $I_k$ is not a solution, we have $D_k \leq n-k$.  But by
+  definition, $D_k \geq 0$ for all $k$, so clearly $I_{n+1}$ is a
+  solution (else we have $0 \leq D_{n+1} \leq -1$).
+  Therefore, the theorem holds with $I_c'=D_{n+1}$.