[OmniOS-discuss] Bizarre zfs-related hang in omnios r151008 on 1-CPU VM

Fri Dec 6 04:14:28 UTC 2013

I'm investigating a bizarre hang situation which I noticed by accident
on the latest stable omnios release. When I'm running in VMware Fusion
on a 1-CPU VM and doing any significant write IO to the pool (e.g. just
dd'ing something around is enough to trigger this), the VM will, with
100% certainty, hang. Console input works, but all userspace programs
are stopped and nothing responds (e.g. attempting to telnet to sshd over
the network establishes the socket, but then sshd doesn't print the
version string).

Using some dtrace foo and kmdb I was able to trace it (roughly, the
exact stack trace changes between hangs, which is mighty weird in itself):

    atomic_dec_32_nv+8()
    dbuf_read+0x179(ffffff00d2393600, ffffff00c72f98f0, a)
    dmu_tx_check_ioerr+0x76(ffffff00c72f98f0, ffffff00d2279cf0, 0, 1e0)
    dmu_tx_count_write+0x395(ffffff00ce0536e0, 3c04000, 4000)
    dmu_tx_hold_write+0x5a(ffffff00d1a55300, 4009, 3c04000, 4000)
    zfs_write+0x3e3(ffffff00d09ef540, ffffff00028e7e60, 0,
ffffff00cd511748, 0)
    fop_write+0x5b(ffffff00d09ef540, ffffff00028e7e60, 0,
ffffff00cd511748, 0)
    write+0x250(1, 440660, 4000)
    sys_syscall+0x17a()

(usually the trace is identical up to dmu_tx_hold_write)

I can definitely confirm that this doesn't happen on omnios r151006 and
it doesn't happen on my vanilla kernels either. My suspicion is that
something got botched in the "OMNIOS#72 Integrate Joyent updated zone
write throttle" commit, but I can't put my finger on it.

Can somebody please confirm this?

Cheers,
-- 
Saso