[OmniOS-discuss] OmniOS backup box hanging regularly
Jim Klimov
jim at cos.ru
Thu Oct 22 16:59:15 UTC 2015
Hello all,
I have this HP-Z400 workstation with 16Gb ECC(should be) RAM running OmniOS bloody, which acts as a backup server for our production systems (regularly rsync'ing large files off Linux boxes, and rotating ZFS auto-snapshots to keep its space free). Sometimes it also runs replicas of infrastructure (DHCP, DNS) and was set up as a VirtualBox + phpVirtualBox host to test that out, but no VMs running.
So the essential loads are ZFS snapshots and ZFS scrubs :)
And it freezes roughly every week. Stops responding to ping, attempts to log in via SSH or physical console - it processes keypresses on the latter, but does not present a login prompt. It used to be stable, and such regular hangs began around summertime.
My primary guess would be for flaky disks, maybe timing out under load or going to sleep or whatever... But I have yet to prove it, or any other theory. Maybe just CPU is overheating due to regular near-100% load with disk I/O... At least I want to rule out OS errors and rule out (or point out) operator/box errors as much as possible - which is something I can change to try and fix ;)
Before I proceed to TL;DR screenshots, I'd overview what I see:
* In the "top" output, processes owned by zfssnap lead most of the time... But even the SSH shell is noticeably slow to respond (1 sec per line when just pressing enter to clear the screen to prepare nice screenshots).
* SMART was not enabled on 3TB mirrored "pool" SATA disks (is now, long tests initiated), but was in place on the "rpool" SAS disk where it logged some corrected ECC errors - but none uncorrected.
Maybe the cabling should be reseated.
* iostat shows disks are generally not busy (they don't audibly rattle nor visibly blink all the time, either)
* zpool scrubs return clean
* there are partitions of the system rpool disk (10K RPM SAS) used as log and cache devices for the main data pool on 3TB SATA disks. The system disk is fast and underutilized, so what the heck ;) And it was not a problem for the first year of this system's honest and stable workouts. These devices are pretty empty at the moment.
I have enabled deadman panics according to Wiki, but none have happened so far:
# cat /etc/system | egrep -v '(^\*|^$)'
set snooping=1
set pcplusmp:apic_panic_on_nmi=1
set apix:apic_panic_on_nmi = 1
In the "top" output, processes owned by zfssnap lead most of the time:
last pid: 22599; load avg: 12.9, 12.2, 11.2; up 0+09:52:11 18:34:41
140 processes: 125 sleeping, 13 running, 2 on cpu
CPU states: 0.0% idle, 22.9% user, 77.1% kernel, 0.0% iowait, 0.0% swap
Memory: 16G phys mem, 1765M free mem, 2048M total swap, 2048M free swap
Seconds to delay:
PID USERNAME LWP PRI NICE SIZE RES STATE TIME CPU COMMAND
21389 zfssnap 1 43 2 863M 860M run 5:04 35.61% zfs
22360 zfssnap 1 52 2 118M 115M run 0:37 16.50% zfs
21778 zfssnap 1 52 2 563M 560M run 3:15 13.17% zfs
21278 zfssnap 1 52 2 947M 944M run 5:32 6.91% zfs
21881 zfssnap 1 43 2 433M 431M run 2:31 5.41% zfs
21852 zfssnap 1 52 2 459M 456M run 2:39 5.16% zfs
21266 zfssnap 1 43 2 906M 903M run 5:18 3.95% zfs
21757 zfssnap 1 43 2 597M 594M run 3:26 2.91% zfs
21274 zfssnap 1 52 2 930M 927M cpu/0 5:27 2.78% zfs
22588 zfssnap 1 43 2 30M 27M run 0:08 2.48% zfs
22580 zfssnap 1 52 2 49M 46M run 0:14 0.71% zfs
22038 root 1 59 0 5312K 3816K cpu/1 0:01 0.10% top
22014 root 1 59 0 8020K 4988K sleep 0:00 0.02% sshd
Average "iostats" are not that busy:
# zpool iostat -Td 5
Thu Oct 22 18:24:59 CEST 2015
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
pool 2.52T 207G 802 116 28.3M 840K
rpool 33.0G 118G 0 4 4.52K 58.7K
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:04 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 10 0 97.9K
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:09 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:14 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 9 0 93.5K
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:19 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:24 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:29 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:34 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:25:39 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 16 0 374K
---------- ----- ----- ----- ----- ----- -----
...
Thu Oct 22 18:33:49 CEST 2015
pool 2.52T 207G 0 0 0 0
rpool 33.0G 118G 0 11 0 94.5K
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:33:54 CEST 2015
pool 2.52T 207G 0 13 819 80.0K
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:33:59 CEST 2015
pool 2.52T 207G 0 129 0 1.06M
rpool 33.0G 118G 0 0 0 0
---------- ----- ----- ----- ----- ----- -----
Thu Oct 22 18:34:04 CEST 2015
pool 2.52T 207G 0 55 0 503K
rpool 33.0G 118G 0 11 0 97.9K
---------- ----- ----- ----- ----- ----- -----
...
just occasional bursts of work.
I've now enabled SMART on the disks (2*3Tb mirror "pool" and 1*300Gb "rpool") and ran some short tests and triggered long tests (hopefully they'd succeed by tomorrow); current results are:
# for D in /dev/rdsk/c0*s0; do echo "===== $D :"; smartctl -d sat,12 -a $D ; done ; for D in /dev/rdsk/c4*s0 ; do echo "===== $D :"; smartctl -d scsi -a $D ; done
===== /dev/rdsk/c0t3d0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD3003FZEX-00Z4SA0
Serial Number: WD-WCC5D1KKU0PA
LU WWN Device Id: 5 0014ee 2610716b7
Firmware Version: 01.01A01
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Oct 22 18:45:28 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: (32880) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 357) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 246 154 021 Pre-fail Always - 6691
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 4869
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 14
16 Unknown_Attribute 0x0022 130 070 000 Old_age Always - 2289651870502
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 12
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 2
194 Temperature_Celsius 0x0022 117 111 000 Old_age Always - 35
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 4869 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
===== /dev/rdsk/c0t5d0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate SV35
Device Model: ST3000VX000-1ES166
Serial Number: Z500S3L8
LU WWN Device Id: 5 000c50 079e3757b
Firmware Version: CV26
User Capacity: 3,000,592,982,016 bytes [3.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Thu Oct 22 18:45:28 2015 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
Total time to complete Offline
data collection: ( 80) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 325) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x10b9) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 105 099 006 Pre-fail Always - 8600880
3 Spin_Up_Time 0x0003 096 094 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 19
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 342685681
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 4214
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 19
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 028 028 000 Old_age Always - 72
190 Airflow_Temperature_Cel 0x0022 069 065 045 Old_age Always - 31 (Min/Max 29/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 28
194 Temperature_Celsius 0x0022 031 040 000 Old_age Always - 31 (0 20 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Self-test routine in progress 90% 4214 -
# 2 Short offline Completed without error 00% 4214 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
===== /dev/rdsk/c4t5000CCA02A1292DDd0s0 :
smartctl 6.0 2012-10-10 r3643 [i386-pc-solaris2.11] (local build)
Copyright (C) 2002-12, Bruce Allen, Christian Franke, www.smartmontools.org
Vendor: HITACHI
Product: HUS156030VLS600
Revision: HPH1
User Capacity: 300,000,000,000 bytes [300 GB]
Logical block size: 512 bytes
Logical Unit id: 0x5000cca02a1292dc
Serial number: LVVA6NHS
Device type: disk
Transport protocol: SAS
Local Time is: Thu Oct 22 18:45:29 2015 CEST
Device supports SMART and is Enabled
Temperature Warning Enabled
SMART Health Status: OK
Current Drive Temperature: 45 C
Drive Trip Temperature: 70 C
Manufactured in week 14 of year 2012
Specified cycle count over device lifetime: 50000
Accumulated start-stop cycles: 80
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 2340336504406016
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 0 888890 0 888890 0 29326.957 0
write: 0 961315 0 961315 0 6277.560 0
Non-medium error count: 283
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Self test in progress ... - NOW - [- - -]
# 2 Background long Aborted (device reset ?) - 14354 - [- - -]
# 3 Background short Completed - 14354 - [- - -]
# 4 Background long Aborted (device reset ?) - 14354 - [- - -]
# 5 Background long Aborted (device reset ?) - 14354 - [- - -]
Long (extended) Self Test duration: 2506 seconds [41.8 minutes]
The zpool scrub results and general layout:
# zpool status -v
pool: pool
state: ONLINE
scan: scrub repaired 0 in 164h13m with 0 errors on Thu Oct 22 18:13:33 2015
config:
NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
c0t3d0 ONLINE 0 0 0
c0t5d0 ONLINE 0 0 0
logs
c4t5000CCA02A1292DDd0p2 ONLINE 0 0 0
cache
c4t5000CCA02A1292DDd0p3 ONLINE 0 0 0
errors: No known data errors
pool: rpool
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: scrub repaired 0 in 3h3m with 0 errors on Thu Oct 8 04:12:35 2015
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
c4t5000CCA02A1292DDd0s0 ONLINE 0 0 0
errors: No known data errors
# zpool list -v
NAME SIZE ALLOC FREE EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
pool 2.72T 2.52T 207G - 68% 92% 1.36x ONLINE /
mirror 2.72T 2.52T 207G - 68% 92%
c0t3d0 - - - - - -
c0t5d0 - - - - - -
log - - - - - -
c4t5000CCA02A1292DDd0p2 8G 148K 8.00G - 0% 0%
cache - - - - - -
c4t5000CCA02A1292DDd0p3 120G 1.80G 118G - 0% 1%
rpool 151G 33.0G 118G - 76% 21% 1.00x ONLINE -
c4t5000CCA02A1292DDd0s0 151G 33.0G 118G - 76% 21%
Note the long scrub time may have included the downtime while the system was frozen until it was rebooted.
Thanks in advance for the fresh pairs of eyeballs,
Jim Klimov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20151022/d2b1a76b/attachment-0001.html>
More information about the OmniOS-discuss
mailing list