[Aptitude-devel] r3217 - in branches/aptitude-0.3/aptitude: . src/generic/problemresolver

Daniel Burrows dburrows@costa.debian.org
Sun, 01 May 2005 01:32:13 +0000


Author: dburrows
Date: Sun May  1 01:32:10 2005
New Revision: 3217

Modified:
   branches/aptitude-0.3/aptitude/ChangeLog
   branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex
Log:
Describe more of the algorithm..need to flesh this out/make it more comprehensible.

Modified: branches/aptitude-0.3/aptitude/ChangeLog
==============================================================================
--- branches/aptitude-0.3/aptitude/ChangeLog	(original)
+++ branches/aptitude-0.3/aptitude/ChangeLog	Sun May  1 01:32:10 2005
@@ -1,5 +1,9 @@
 2005-04-30  Daniel Burrows  <dburrows@debian.org>
 
+	* src/generic/problemresolver/model.tex:
+
+	  Further description of the algorithm.
+
 	* src/generic/aptcache.cc:
 
 	  Make packages installed/removed by the resolver get marked as

Modified: branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex
==============================================================================
--- branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex	(original)
+++ branches/aptitude-0.3/aptitude/src/generic/problemresolver/model.tex	Sun May  1 01:32:10 2005
@@ -86,13 +86,14 @@
 might produce an error indicating that the user should find, download,
 and install a new version of the \pkg{SDL} graphics library.  A new
 version of the \pkg{kmail} mail client might require the user to
-upgrade his or her entire operating system, generally indicating this
-fact by a slew of cascading error messages.
+upgrade his or her entire operating system, indicating this fact by a
+slew of cascading error messages.
 
 As a result of this so-called ``dependency hell'', new and more
 automated tools, such as \pkg{apt} and \pkg{up2date}, were developed.
 These tools maintain a database of software installed on the user's
 system, along with software available from any number of remote sites.
+
 To install or remove a piece of software, the user issues a request
 to, for instance, ``install wesnoth'' or ``upgrade kmail''.  The
 installation tool will proceed to find a set of package installations
@@ -302,28 +303,39 @@
 
 An \emph{installation} represents a choice about which package
 versions to install; it is a function that maps each package to a
-version of that package.  An installation $I$ is \emph{consistent} if
-it satisfies each dependency in the world: that is, for all $v ->
-\{v_1',\dots\}$ in $\D$, either $I(\pkgof{v}) \neq v$ or
-$I(\pkgof{v_i'})=v_i'$ for some $i$.
-
-If $I$ is an installation, then $I;p \mapsto v$ is a new installation
-which installs the version $v$ of the package $p$ and leaves all other
-packages unchanged:
+version of that package.
 
-\begin{equation}
-  (I;p \mapsto v)(p') = \begin{cases}
-    I(p'), & p' \neq p \\
-    v, & p' = p
-  \end{cases}
-\end{equation}
+\begin{definition}
+  An installation $I$ installs a version $v$, written $I \installs v$,
+  if $I(\pkgof{v})=v$.
+\end{definition}
+
+\begin{definition}
+  An installation $I$ is \emph{consistent} if it satisfies each
+  dependency in the world: that is, for all $v -> \{v_1',\dots\}$ in
+  $\D$, either $I \not \installs v$ or $I \installs v_i'$ for some
+  $i$.
+\end{definition}
+
+\begin{definition}
+  If $I$ is an installation, then $I;p \mapsto v$ is a new
+  installation which installs the version $v$ of the package $p$ and
+  leaves all other packages unchanged:
+
+  \begin{equation}
+    (I;p \mapsto v)(p') = \begin{cases}
+      I(p'), & p' \neq p \\
+      v, & p' = p
+    \end{cases}
+  \end{equation}
+\end{definition}
 
 \section{The Dependency Resolution Problem}
 
 \subsection{Problem Definition}
 
-Let $I$ be an inconsistent installation.  We would like to find a
-consistent installation that is ``similar to'' $I$.  This is the
+Let $I_0$ be an inconsistent installation.  We would like to find a
+consistent installation that is ``similar to'' $I_0$.  This is the
 dependency resolution problem.  In a real package system, it
 corresponds to a situation in which the user's requests have resulted
 in an invalid state (with unsatisfied dependencies or conflicts); the
@@ -345,6 +357,7 @@
   solution is inside the user's head.
 
   Thus, the best we can do is to define some criteria for ``goodness''
+  (to prioritize solutions that are more likely to interest the user)
   and allow the user to see alternatives to an undesired solution.
 \end{note}
 
@@ -487,11 +500,14 @@
 dependencies, and then moving to a different version (possibly after
 resolving some dependencies of the intermediate version).  The problem
 resolver of \pkg{apt}, for instance, sometimes confuses users by
-exhibiting this behavior.  To fix this, I propose a simple rule: a
-solution should never modify a package twice.  Formally, if the
-original installation was $I_0$, then for any $I$, the installation
-$I'=I;p \mapsto v$ is only a successor of $I$ if $v \neq
-I_0(\pkgof{v})$ and $I(\pkgof{v})=I_0(\pkgof{v})$.
+exhibiting this behavior.  To fix this, I enforce a simple rule in
+generating solutions: a solution should never modify a package twice.
+
+\begin{definition}
+  If the original installation was $I_0$, then for any $I$, the
+  installation $I'=I;p \mapsto v$ is a successor of $I$ if $v \neq
+  I_0(\pkgof{v})$ and $I(\pkgof{v})=I_0(\pkgof{v})$.
+\end{definition}
 
 One might wonder whether this approach risks overlooking solutions:
 for instance, maybe it is really necessary to ``go in circles'' in
@@ -563,6 +579,7 @@
   installation $I_0$, there exists a consistent installation $I_c'$
   such that $I_c'$ is a hybrid of $I_0$ and $I_c$, and $I_0
   \nsolmany{I_0} I_c'$.
+  \label{thm:succ-sufficient}
 \end{theorem}
 
 \begin{proof}
@@ -591,21 +608,181 @@
 
 \subsubsection{Scoring}
 
-TODO: content
+The second key ingredient of a best-first search is a scheme for
+ordering search nodes, typically by assigning a numerical value to
+each prospective solution.  In doing so, we must balance two
+priorities: the desire to find a solution quickly, and the desire to
+find a \emph{good} solution.
+
+The most obvious way to guide the search towards a solution is to
+``reward'' avenues of inquiry that decrease the number of unsatisfied
+dependencies.  This is not, of course, guaranteed to produce a
+solution quickly; however, in practice, it seems to be a sufficient
+hint for the algorithm to reach a goal node in under 100 steps.
+Finding ``good'' solutions is somewhat more difficult, not least
+because of the fact that ``good'' is an ill-defined property.  The
+experimental implementation of this algorithm in \pkg{aptitude} uses
+the following criteria to assign scores to nodes:
 
 \begin{itemize}
-\item Score/penalty per broken dependency
-\item Score/penalty per ``step''
-\item Score/penalty for each package version
-\item Arbitrary cutoff to avoid exponential blowup
+\item Most importantly, each version of each package is assigned a
+  separate score.  The score of a potential solution is based in part
+  on the \emph{difference} between the total score of the package
+  versions it installs and the score of the versions installed by
+  $I_0$.\footnote{In principle, a solution's score is simply the total
+    score of the package versions it installs; $I_0$ is taken to be a
+    baseline for this component of the score because it does not
+    affect the ordering of solutions, and it is much more convenient
+    to calculate.}
+
+  These weights define the priorities of the search.  In the current
+  aptitude implementation, as of this writing, the scores of the
+  versions the package $p$ are assigned as follows\footnote{Very minor
+    tweaking is also performed to adjust the score of packages
+    according to their Debian-assigned ``priority'', but this affects
+    the score by no more than 5 points.}:
+
+  \begin{itemize}
+  \item If $I_0(p)$ was explicitly selected by the user, it is
+    assigned $40$ points.
+  \item If $I_0(p)$ was automatically selected, it is assigned $0$
+    points.  This has the effect of biasing the algorithm towards
+    modifying packages whose current states were not explicitly
+    requested by the user.
+  \item If $I_0$ does not remove $p$, its removal is assigned $-300$
+    points.  If $p$ is an ``essential'' package, such as the C library
+    or the \pkg{dpkg} package manager, its removal is assigned an
+    additional $-100000$ points.
+  \item If $I_0(p)$ is not the current version of the
+    package\footnote{The ``current version'' refers to the version
+      installed on the system; $I_0$ is simply a representation of the
+      package installation and removal commands issued by the user.},
+    then $p$'s current version -- which corresponds to cancelling a
+    user request or automatic decision -- is assigned $-50$ points.
+  \item Installing $p$ (i.e., the ``default'' version of $p$ if $p$ is
+    not currently installed) is assigned $-20$ points.
+  \item Upgrading $p$ (i.e., the ``default'' version of $p$ if $p$ is
+    currently installed at a different version\footnote{This is a
+      slight misnomer, as it includes downgrades to the default
+      version.}) is assigned $-15$ points.
+  \item Installing a version of $p$ other than the ``default'' version
+    (as usual, if $I_0$ does not install this version of $p$) is
+    assigned -40 points.
+  \end{itemize}
+
+\item A penalty is applied to each search node based on its distance
+  from the root of the search.  This works to favor ``simpler''
+  solutions and penalize more complex ones.
+
+\item Nodes that resolve all dependencies are given an additional
+  bonus -- usually a relatively minor one.  Goal nodes are moved
+  through the priority queue in the normal manner, rather than
+  immediately being returned, in order to ensure that ``better''
+  solutions are produced first -- and to make sure that solutions that
+  are particularly ``bad'' are not produced unless it is absolutely
+  necessary.
 \end{itemize}
 
+Thus, letting $B(I)$ be the set of dependencies that are ``broken''
+(not satisfied) by $I$ and letting $h(v)$ be the score of the version
+$v$, the total score of an installation is
+
+\begin{equation}
+  h(I) = \alpha_B|B(I)| + \alpha_L \idist{I}{I_0} + \alpha_G \delta(0,|B(I)|) + \sum_{p \in \P} h(I(p))
+\end{equation}
+
+where $\alpha_B$, $\alpha_L$, and $\alpha_G$ are weighting factors and
+$\delta$ is the Kronacker delta function (i.e., $\delta(i,j)$ is $1$
+if $i=j$ and $0$ otherwise).  In the current implementation,
+$\alpha_B=-100$, $\alpha_L=-10$, and $\alpha_G=50$.
+
+\section{Avoiding Exponential Blowup}
+
+The algorithm laid out above is sufficient to solve many of the
+dependency problems that are encountered in practice.  However, some
+problems still cause the search to take an unacceptable amount of time
+to reach a solution.  The problems observed fall into two main
+categories:
+
+\begin{enumerate}
+\item Too many reverse dependencies.
+
+  In order to calculate the score of a successor of an installation
+  (and of course to analyze that solution later on) it is necessary to
+  generate the set of dependencies which are not satisfied by that
+  successor.  However, there are some one hundred thousand
+  dependencies in the Debian archive; so that it completes in a
+  reasonable amount of time, the current implementation uses the
+  obvious optimization of only testing those dependencies which either
+  were previously broken, which impinge on the package version being
+  removed, or which impinge on the package version being
+  installed.\footnote{Recall that a successor to $I$ will install
+    version $v$ of $p$, removing $I(p)$ in the process.}
+
+  Unfortunately, some packages have very many reverse dependencies.
+  For instance, if $I$ removes the system C library, over a thousand
+  dependencies will be unsatisfied -- and simply generating the
+  successors of this node will require time \emph{quadratic} in the
+  number of reverse dependencies of \pkg{libc}; i.e., over a million
+  dependencies will be tested.  This can impose a significant
+  performance penalty on the process of successor generation.
+
+\item Removing the bottom of a dependency chain.
+
+  When an important library such as GTK is removed, it is necessary to
+  propagate the removal ``up the dependency tree''.  However, the
+  search technique outlined above will search exponentially many
+  installations before coming to this solution.  Aside from the goal
+  node of ``keep the library on the system'', the first step of the
+  search will enqueue one node for each package depending on GTK; each
+  node will remove its corresponding package.  As these nodes are
+  processed, pairs of packages will be removed from the system; then
+  triples, and so on, until the full power set of all packages
+  depending (directly or indirectly) on GTK is generated.  Worse, at
+  each step, solutions that suggest installing GTK (and removing many
+  packages) will be generated.
+\end{enumerate}
+
+The problem of ``too many reverse dependencies'' is relatively easily
+solved.  Instead of generating sucecssors for every dependency, it is
+sufficient to generate successors for a single dependency (this is
+evident from the proof of Theorem~\ref{thm:succ-sufficient}; a
+stronger condition will be demonstrated explicitly below).  In theory,
+this practice could lead to somewhat less optimal ordering of
+generated solutions, but the decrease in the problem's branching
+factor is well worth it.
+
+The second problem is somewhat trickier, but it can be resolved by
+restructuring the search space.  To each solution node $I$, attach a
+set $I_F$ of \emph{forbidden} versions; the successors of $I$ are
+restricted to those which do not install any version in $I_F$.  For
+all successors $I'$ of $I$, $I'_F \subseteq I_F$; furthermore, if a
+successor $I'$ of $I$ is generated by removing the source version of a
+dependency, then all of the targets of that dependency are members of
+$I'_F$.  This has the effect of forcing the algorithm to ``stick
+with'' a decision to forgo installing the targets of a dependency in
+favor of shifting the source.
+
+In combination with the tracking of forbidden versions, the algorithm
+is also modified to detect \emph{essential conflicts}: that is,
+dependencies which cannot be resolved due to the constraints placed on
+the generation of successors.  If $d$ is a dependency not satisfied by
+$I$, and if every package version which could resolve a dependency is
+illegal -- either because of the ``only modify it once'' rule or
+because it is a member of $I_F$ -- then $d$ is deemed to be an
+essential conflict and $I$ is discarded without further processing.
+
+Of course, it is important to verify that cutting off wide swathes of
+the search space in this manner does not impede our ability to
+generate solutions:
+
+\begin{theorem}
+  TODO.
+\end{theorem}
+
 \section{Implementation}
 
 A prototype implementation exists in the experimental branch of
-\pkg{aptitude}, but it is not currently used as the main problem
-resolver of the program and as a result has not been widely tested.
-Work is ongoing to connect this prototype to the main user interface
-of \pkg{aptitude}.
+\pkg{aptitude}.
 
 \end{document}