[Pkg-libvirt-maintainers] Bug#719675: Bug#719675: Live migration of KVM guests fails if it takes more than 30 seconds (large memory guests)

Christian Balzer chibi at gol.com
Thu Aug 15 06:49:50 UTC 2013


On Thu, 15 Aug 2013 08:16:02 +0200 Guido Günther wrote:

> On Thu, Aug 15, 2013 at 09:35:09AM +0900, Christian Balzer wrote:
> > On Wed, 14 Aug 2013 21:50:22 +0200 Guido Günther wrote:
> > 
> > > On Wed, Aug 14, 2013 at 04:49:42PM +0900, Christian Balzer wrote:
> > > > 
> > > > Package: libvirt0
> > > > Version: 0.9.12-11+deb7u1
> > > > Severity: important
> > > > 
> > > > Hello,
> > > > 
> > > > when doing a live migration using Pacemaker (the OCF VirtualDomain
> > > > RA) on a cluster with DRBD (active/active) backing storage
> > > > everything works fine with recently started (small memory
> > > > footprint of about 200MB at most) KVM guests. 
> > > > 
> > > > After inflating one guest to 2GB memory usage (memtester comes in
> > > > handy for that) the migration failed after 30 seconds, having
> > > > managed to migrate about 400MB in that time over the direct,
> > > > dedicated GbE link between my test cluster host nodes. 
> > > > 
> > > > libvirtd.log on the migration target node, migration start time is
> > > > 07:24:51 :
> > > > ---
> > > > 2013-08-13 07:24:51.807+0000: 31953: warning :
> > > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be
> > > > the async job owner; entering monitor without ask ing for a nested
> > > > job is dangerous 2013-08-13 07:24:51.886+0000: 31953: warning :
> > > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be
> > > > the async job owner; entering monitor without ask ing for a nested
> > > > job is dangerous 2013-08-13 07:24:51.888+0000: 31953: warning :
> > > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be
> > > > the async job owner; entering monitor without ask ing for a nested
> > > > job is dangerous 2013-08-13 07:24:51.948+0000: 31953: warning :
> > > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be
> > > > the async job owner; entering monitor without ask ing for a nested
> > > > job is dangerous 2013-08-13 07:24:51.948+0000: 31953: warning :
> > > > qemuDomainObjEnterMonitorInternal :994 : This thread seems to be
> > > > the async job owner; entering monitor without ask ing for a nested
> > > > job is dangerous 2013-08-13 07:25:21.217+0000: 31950: warning :
> > > > virKeepAliveTimer:182 : No response from client 0x1948280 after 5
> > > > keepalive messages in 30 seconds 2013-08-13 07:25:31.224+0000:
> > > > 31950: warning : qemuProcessKill:3813 : Timed out waiting after
> > > > SIGTERM to process 15926, sending SIGKILL
> > > 
> > > This looks more like you're not replying via the keepalive protocol.
> > > What are you using to migrate VMs?
> > >  -- Guido
> > > 
> > As I said up there, the Pacemaker (heartbeat, OCF really) resource
> > agent, with SSH as transport (and only) option. 
> 
> This is not telling me how this is done within pacemaker. RHCS used to
> do this with virsh  internally. I'll check the sources once I get around
> to.

Sorry, I was assuming some familiarity with this resource agent.
It indeed creates a virsh command line internally, the relevant code for
this case is basically:
---
        # Find out the remote hypervisor to connect to. That is, turn
        # something like "qemu://foo:9999/system" into
        # "qemu+tcp://bar:9999/system"
        if [ -n "${OCF_RESKEY_migration_transport}" ]; then
            transport_suffix="+${OCF_RESKEY_migration_transport}"
        fi
---
The above defines the transport, ssh in my case.
And then later:
---
        # Scared of that sed expression? So am I. :-)
        remoteuri=$(echo ${OCF_RESKEY_hypervisor} | sed -e "s,\(.*\)://[^/:]*\(:\?[0-9]*\)/\(.*\),\1${transport_suffix}://${target_node}\2/\3,")

        # OK, we know where to connect to. Now do the actual migration.
        ocf_log info "$DOMAIN_NAME: Starting live migration to ${target_node} (using remote hypervisor URI ${remoteuri} ${migrateuri})."
        virsh ${VIRSH_OPTIONS} migrate --live $DOMAIN_NAME ${remoteuri} ${migrateuri}
        rc=$?
---
In my case the migrateuri is empty as I didn't define anything, I thus left
out the code that would potentially define it.

Hope that helps,

Christian
-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/



More information about the Pkg-libvirt-maintainers mailing list