[OmniOS-discuss] [zfs] Bizarre zfs-related hang in omnios r151008 on 1-CPU VM

Fri Dec 6 06:31:38 UTC 2013

Be sure you have the following fix; without it I recall seeing spins from
the ZPL similar to that stack trace.  With only 1 cpu, if a kernel thread
spins, it can be very hard to get other threads to run.

commit e722410c49fe67cbf0f639cbcc288bd6cbcf7dd1

Author: Matthew Ahrens <mahrens at delphix.com>

Date:   Tue Nov 26 13:47:33 2013 -0500

    4347 ZPL can use dmu_tx_assign(TXG_WAIT)

    Reviewed by: George Wilson <george.wilson at delphix.com>

    Reviewed by: Adam Leventhal <ahl at delphix.com>

    Reviewed by: Dan McDonald <danmcd at nexenta.com>

    Reviewed by: Boris Protopopov <boris.protopopov at nexenta.com>

    Approved by: Dan McDonald <danmcd at nexenta.com>

On Thu, Dec 5, 2013 at 8:14 PM, Saso Kiselkov <skiselkov.ml at gmail.com>wrote:

> I'm investigating a bizarre hang situation which I noticed by accident
> on the latest stable omnios release. When I'm running in VMware Fusion
> on a 1-CPU VM and doing any significant write IO to the pool (e.g. just
> dd'ing something around is enough to trigger this), the VM will, with
> 100% certainty, hang. Console input works, but all userspace programs
> are stopped and nothing responds (e.g. attempting to telnet to sshd over
> the network establishes the socket, but then sshd doesn't print the
> version string).
>
> Using some dtrace foo and kmdb I was able to trace it (roughly, the
> exact stack trace changes between hangs, which is mighty weird in itself):
>
>     atomic_dec_32_nv+8()
>     dbuf_read+0x179(ffffff00d2393600, ffffff00c72f98f0, a)
>     dmu_tx_check_ioerr+0x76(ffffff00c72f98f0, ffffff00d2279cf0, 0, 1e0)
>     dmu_tx_count_write+0x395(ffffff00ce0536e0, 3c04000, 4000)
>     dmu_tx_hold_write+0x5a(ffffff00d1a55300, 4009, 3c04000, 4000)
>     zfs_write+0x3e3(ffffff00d09ef540, ffffff00028e7e60, 0,
> ffffff00cd511748, 0)
>     fop_write+0x5b(ffffff00d09ef540, ffffff00028e7e60, 0,
> ffffff00cd511748, 0)
>     write+0x250(1, 440660, 4000)
>     sys_syscall+0x17a()
>
> (usually the trace is identical up to dmu_tx_hold_write)
>
> I can definitely confirm that this doesn't happen on omnios r151006 and
> it doesn't happen on my vanilla kernels either. My suspicion is that
> something got botched in the "OMNIOS#72 Integrate Joyent updated zone
> write throttle" commit, but I can't put my finger on it.
>
> Can somebody please confirm this?
>
> Cheers,
> --
> Saso
>
>
> -------------------------------------------
> illumos-zfs
> Archives: https://www.listbox.com/member/archive/182191/=now
> RSS Feed:
> https://www.listbox.com/member/archive/rss/182191/21635000-ebd1d460
> Modify Your Subscription:
> https://www.listbox.com/member/?member_id=21635000&id_secret=21635000-73dc201a
> Powered by Listbox: http://www.listbox.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20131205/fbb7d160/attachment-0001.html>