[OmniOS-discuss] latency spikes ~every hour

Wed Dec 17 21:16:37 UTC 2014

15 декабря 2014 г. 13:27:39 CET, Rune Tipsmark <rt at steait.net> пишет:
>ok I removed some of my SLOG devices and currently I am only using a
>single SLOG (no mirror or anything) and no spikes seen since.
>
>I wonder why multiple SLOG devices would cause this.
>
>br.
>
>Rune
>
>________________________________
>From: OmniOS-discuss <omnios-discuss-bounces at lists.omniti.com> on
>behalf of Rune Tipsmark <rt at steait.net>
>Sent: Sunday, December 14, 2014 2:27 PM
>To: omnios-discuss at lists.omniti.com
>Subject: [OmniOS-discuss] latency spikes ~every hour
>
>
>hi all,
>
>
>
>All my vSphere (ESXi5.1) hosts experience a big spike in latency every
>hour or so.
>
>I tested on Infiniband iSER and SRP and also 4Gbit FC and 8GBit FC. All
>exhibit the same behavior so I don't think its the connection that is
>causing this.
>
>When I modify the arc_shrink_shift 10 (192GB Ram in the System) it
>helps a bit, the spikes are still with the same regularity but latency
>peaks at about ~5000ms for the most part. if I leave arc_shrink_shift
>at the default value they can be higher - up to 15000ms.
>
>Looking at the latency as seen from the vSphere hosts the average is
>usually below 1ms for most datastores.
>
>
>
>Any ideas what can cause this or what can be done to fix?
>
>br,
>
>Rune
>
>
>------------------------------------------------------------------------
>
>_______________________________________________
>OmniOS-discuss mailing list
>OmniOS-discuss at lists.omniti.com
>http://lists.omniti.com/mailman/listinfo/omnios-discuss

One idea stems from old experience: there used to be two triggers to start a TXG sync, where cached (async) writes do actually get to disk: a timeout and a filled-up buffer. While the sync is underway, a system indeed feels locked, new i/o's are delayed, etc. In such cases the tuning was to decrease time or rather size limits, to make the txg syncs shorter and faster, though at risk of more fragmentation of on-disk data. 

This is easily seen in iostat or zpool iostat as a considerable spike in write traffic (with your ram size, the metadata is probably cached and there are not many reads needed to complete the write), occurring regularly (i.e. 5secs).

Your hourly hiccups = backups starting up, maybe?

It may be that the mechanism or the tuning handles were changed in the past year or two in illumos-zfs, but it may be that your (unnamed?) distrp still has that behavior.

--
Typos courtesy of K-9 Mail on my Samsung Android