[OmniOS-discuss] OmniOS backup box hanging regularly

Thu Oct 22 16:59:15 UTC 2015

Hello all,
I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running OmniOS bloody, which acts as a backup server for our production systems (regularly rsync'ing large files off Linux boxes, and rotating ZFS auto-snapshots to keep its space free). Sometimes it also runs replicas of infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox host to test that out, but no VMs running.
So the essential loads are ZFS snapshots and ZFS scrubs :)
And it freezes roughly every week. Stops responding to ping, attempts to log in via SSH or physical console - it processes keypresses on the latter, but does not present a login prompt. It used to be stable, and such regular hangs began around summertime.

My primary guess would be for flaky disks, maybe timing out under load or going to sleep or whatever... But I have yet to prove it, or any other theory. Maybe just CPU is overheating due to regular near-100% load with disk I/O... At least I want to rule out OS errors and rule out (or point out) operator/box errors as much as possible - which is something I can change to try and fix ;)
Before I proceed to TL;DR screenshots, I'd overview what I see:
* In the "top" output, processes owned by zfssnap lead most of the time... But even the SSH shell is noticeably slow to respond (1 sec per line when just pressing enter to clear the screen to prepare nice screenshots).
* SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long tests initiated), but was in place on the "rpool" SAS disk where it logged some corrected ECC errors - but none uncorrected.
Maybe the cabling should be reseated.
* iostat shows disks are generally not busy (they don't audibly rattle nor visibly blink all the time, either)
* zpool scrubs return clean
* there are partitions of the system rpool disk (10K RPM SAS) used as log and cache devices for the main data pool on 3TB SATA disks. The system disk is fast and underutilized, so what the heck ;) And it was not a problem for the first year of this system's honest and stable workouts. These devices are pretty empty at the moment.

I have enabled deadman panics according to Wiki, but none have happened so far:
# cat /etc/system  | egrep -v '(^\*|^$)'
set snooping=1
set pcplusmp:apic_panic_on_nmi=1
set apix:apic_panic_on_nmi = 1

In the "top" output, processes owned by zfssnap lead most of the time:

last pid: 22599;  load avg:  12.9,  12.2,  11.2;       up 0+09:52:11                                                                          18:34:41
140 processes: 125 sleeping, 13 running, 2 on cpu
CPU states:  0.0% idle, 22.9% user, 77.1% kernel,  0.0% iowait,  0.0% swap
Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap
Seconds to delay:
   PID USERNAME LWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 21389 zfssnap    1  43    2  863M  860M run      5:04 35.61% zfs
 22360 zfssnap    1  52    2  118M  115M run      0:37 16.50% zfs
 21778 zfssnap    1  52    2  563M  560M run      3:15 13.17% zfs
 21278 zfssnap    1  52    2  947M  944M run      5:32  6.91% zfs
 21881 zfssnap    1  43    2  433M  431M run      2:31  5.41% zfs
 21852 zfssnap    1  52    2  459M  456M run      2:39  5.16% zfs
 21266 zfssnap    1  43    2  906M  903M run      5:18  3.95% zfs
 21757 zfssnap    1  43    2  597M  594M run      3:26  2.91% zfs
 21274 zfssnap    1  52    2  930M  927M cpu/0    5:27  2.78% zfs
 22588 zfssnap    1  43    2   30M   27M run      0:08  2.48% zfs
 22580 zfssnap    1  52    2   49M   46M run      0:14  0.71% zfs
 22038 root       1  59    0 5312K 3816K cpu/1    0:01  0.10% top
 22014 root       1  59    0 8020K 4988K sleep    0:00  0.02% sshd

Average "iostats" are not that busy:

# zpool iostat -Td 5
Thu Oct 22 18:24:59 CEST 2015
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
pool        2.52T   207G    802    116  28.3M   840K
rpool       33.0G   118G      0      4  4.52K  58.7K
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:04 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0     10      0  97.9K
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:09 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:14 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      9      0  93.5K
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:19 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:24 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:29 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:34 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:25:39 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0     16      0   374K
----------  -----  -----  -----  -----  -----  -----
...
Thu Oct 22 18:33:49 CEST 2015
pool        2.52T   207G      0      0      0      0
rpool       33.0G   118G      0     11      0  94.5K
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:33:54 CEST 2015
pool        2.52T   207G      0     13    819  80.0K
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:33:59 CEST 2015
pool        2.52T   207G      0    129      0  1.06M
rpool       33.0G   118G      0      0      0      0
----------  -----  -----  -----  -----  -----  -----
Thu Oct 22 18:34:04 CEST 2015
pool        2.52T   207G      0     55      0   503K
rpool       33.0G   118G      0     11      0  97.9K
----------  -----  -----  -----  -----  -----  -----
...
just occasional bursts of work. 
I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb "rpool") and ran some short tests and triggered long tests (hopefully they'd succeed by tomorrow); current results are:

# for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi -a $D ; done
===== /dev/rdsk/c0t3d0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model:     WDC WD3003FZEX-00Z4SA0
Serial Number:    WD-WCC5D1KKU0PA
LU WWN Device Id: 5 0014ee 2610716b7
Firmware Version: 01.01A01
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Oct 22 18:45:28 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (32880) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 357) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x7035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   246   154   021    Pre-fail  Always       -       6691
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       14
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       4869
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       14
 16 Unknown_Attribute       0x0022   130   070   000    Old_age   Always       -       2289651870502
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       12
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2
194 Temperature_Celsius     0x0022   117   111   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4869         -
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
===== /dev/rdsk/c0t5d0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family:     Seagate SV35
Device Model:     ST3000VX000-1ES166
Serial Number:    Z500S3L8
LU WWN Device Id: 5 000c50 079e3757b
Firmware Version: CV26
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Oct 22 18:45:28 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.
Total time to complete Offline
data collection:                (   80) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 325) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10b9) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   099   006    Pre-fail  Always       -       8600880
  3 Spin_Up_Time            0x0003   096   094   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       19
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       342685681
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       4214
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       19
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   028   028   000    Old_age   Always       -       72
190 Airflow_Temperature_Cel 0x0022   069   065   045    Old_age   Always       -       31 (Min/Max 29/32)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       28
194 Temperature_Celsius     0x0022   031   040   000    Old_age   Always       -       31 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Self-test routine in progress 90%      4214         -
# 2  Short offline       Completed without error       00%      4214         -
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Vendor:               HITACHI
Product:              HUS156030VLS600
Revision:             HPH1
User Capacity:        300,000,000,000 bytes [300 GB]
Logical block size:   512 bytes
Logical Unit id:      0x5000cca02a1292dc
Serial number:                LVVA6NHS
Device type:          disk
Transport protocol:   SAS
Local Time is:        Thu Oct 22 18:45:29 2015 CEST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature:     45 C
Drive Trip Temperature:        70 C
Manufactured in week 14 of year 2012
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  80
Elements in grown defect list: 0
Vendor (Seagate) cache information
  Blocks sent to initiator = 2340336504406016
Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0   888890         0    888890          0      29326.957           0
write:         0   961315         0    961315          0       6277.560           0
Non-medium error count:      283
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Self test in progress ...   -     NOW                 - [-   -    -]
# 2  Background long   Aborted (device reset ?)    -   14354                 - [-   -    -]
# 3  Background short  Completed                   -   14354                 - [-   -    -]
# 4  Background long   Aborted (device reset ?)    -   14354                 - [-   -    -]
# 5  Background long   Aborted (device reset ?)    -   14354                 - [-   -    -]
Long (extended) Self Test duration: 2506 seconds [41.8 minutes]

The zpool scrub results and general layout:

# zpool status -v
  pool: pool
 state: ONLINE
  scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 2015
config:
        NAME                       STATE     READ WRITE CKSUM
        pool                       ONLINE       0     0     0
          mirror-0                 ONLINE       0     0     0
            c0t3d0                 ONLINE       0     0     0
            c0t5d0                 ONLINE       0     0     0
        logs
          c4t5000CCA02A1292DDd0p2  ONLINE       0     0     0
        cache
          c4t5000CCA02A1292DDd0p3  ONLINE       0     0     0
errors: No known data errors
  pool: rpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
  scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct  8 04:12:35 2015
config:
        NAME                       STATE     READ WRITE CKSUM
        rpool                      ONLINE       0     0     0
          c4t5000CCA02A1292DDd0s0  ONLINE       0     0     0
errors: No known data errors

# zpool list -v
NAME                        SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
pool                       2.72T  2.52T   207G         -    68%    92%  1.36x  ONLINE  /
  mirror                   2.72T  2.52T   207G         -    68%    92%
    c0t3d0                     -      -      -         -      -      -
    c0t5d0                     -      -      -         -      -      -
log                            -      -      -         -      -      -
  c4t5000CCA02A1292DDd0p2     8G   148K  8.00G         -     0%     0%
cache                          -      -      -         -      -      -
  c4t5000CCA02A1292DDd0p3   120G  1.80G   118G         -     0%     1%
rpool                       151G  33.0G   118G         -    76%    21%  1.00x  ONLINE  -
  c4t5000CCA02A1292DDd0s0   151G  33.0G   118G         -    76%    21%

Note the long scrub time may have included the downtime while the system was frozen until it was rebooted.

Thanks in advance for the fresh pairs of eyeballs,
Jim Klimov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20151022/d2b1a76b/attachment-0001.html>