[OmniOS-discuss] iSCSI traffic suddenly comes to a halt and then resumes

Wed May 27 06:58:16 UTC 2015

Hello Josten,

> On 26 May 2015, at 22:18, Anon <anon at omniti.com> wrote:
> 
> Hi Matej,
> 
> Do you have sar running on your system? I'd recommend maybe running it at a short interval so that you can get historical disk statistics. You can use this info to rule out if its the disks or not. You can also use iotop -P to get a real time view of %IO to see if it's the disks. You can also use zpool iostat -v 1.

I didn’t have sar or iotop running, but I had 'iostat -xn' and 'zpool iostat -v 1' running when things stopped working, but there is nothing unusual in there. Write ops suddenly fall to 0 and that’s it. Reads are still happening and according to network traffic, there is outgoing traffic when I’m unable to write to the ZFS FS (even locally on the server). I created a simple text file, so next time system hangs, I will be able to check if system is readable (currently, I only have iscsi volumes, so I’m unable to check that locally on server).

> 
> Also, do you have baseline benchmark of performance and know if you're meeting/exceeding it? The baseline should be for random and sequential IO; you can use bonnie++ to get this information.

I can, with 99,99% say, I’m exceeding performance of the pool itself. It’s a single raidz2 vdev with 50 hard drives and 70 connected clients. some are idling, but 10-20 clients are pushing data to server. I know zpool configuration is very bad, but that’s a legacy I can’t change easily. I’m already syncing data to another 7 vdev server, but since this server is so busy, transfers are happening VERY SLOW (read, zfs sync doing 10MB/s).

> 
> Are you able to share your ZFS configuration and iSCSI configuration?

Sure! Here are zfs settings:

zfs get all data:
NAME  PROPERTY              VALUE                  SOURCE
data  type                  filesystem             -
data  creation              Fri Oct 25 20:26 2013  -
data  used                  104T                   -
data  available             61.6T                  -
data  referenced            1.09M                  -
data  compressratio         1.08x                  -
data  mounted               yes                    -
data  quota                 none                   default
data  reservation           none                   default
data  recordsize            128K                   default
data  mountpoint            /volumes/data          received
data  sharenfs              off                    default
data  checksum              on                     default
data  compression           off                    received
data  atime                 off                    local
data  devices               on                     default
data  exec                  on                     default
data  setuid                on                     default
data  readonly              off                    local
data  zoned                 off                    default
data  snapdir               hidden                 default
data  aclmode               discard                default
data  aclinherit            restricted             default
data  canmount              on                     default
data  xattr                 on                     default
data  copies                1                      default
data  version               5                      -
data  utf8only              off                    -
data  normalization         none                   -
data  casesensitivity       sensitive              -
data  vscan                 off                    default
data  nbmand                off                    default
data  sharesmb              off                    default
data  refquota              none                   default
data  refreservation        none                   default
data  primarycache          all                    default
data  secondarycache        all                    default
data  usedbysnapshots       0                      -
data  usedbydataset         1.09M                  -
data  usedbychildren        104T                   -
data  usedbyrefreservation  0                      -
data  logbias               latency                default
data  dedup                 off                    local
data  mlslabel              none                   default
data  sync                  standard               default
data  refcompressratio      1.00x                  -
data  written               1.09M                  -
data  logicalused           98.1T                  -
data  logicalreferenced     398K                   -
data  filesystem_limit      none                   default
data  snapshot_limit        none                   default
data  filesystem_count      none                   default
data  snapshot_count        none                   default
data  redundant_metadata    all                    default
data  nms:dedup-dirty       on                     received
data  nms:description       datauporabnikov        received

I’m not sure what iSCSI configuration do you want/need? But as far as I figured out in the last 'freeze', iSCSI is not the problem, since I’m unable to write to ZFS volume even if I’m local on the server itself.

> 
> For iSCSI, can you take a look at this: http://docs.oracle.com/cd/E23824_01/html/821-1459/fpjwy.html#fsume <http://docs.oracle.com/cd/E23824_01/html/821-1459/fpjwy.html#fsume>

Interesting. I tried running 'iscsiadm list target' but it doesn’t return anything. There is also nothing in /var/adm/messages as usual:) But target service is online (according to svcs), clients are connected and having traffic.

> 
> Do you have detailed logs for the clients experiencing the issues? If not are you able to enable verbose logging (such as debug level logs)?

I have clients logs, but they mostly just report loosing connections and reconnecting:

Example 1:
Apr 29 10:33:53 eee kernel: connection1:0: detected conn error (1021)
Apr 29 10:33:54 eee iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)
Apr 29 10:33:56 eee iscsid: connection1:0 is operational after recovery (1 attempts)
Apr 29 10:36:37 eee kernel: connection1:0: detected conn error (1021)
Apr 29 10:36:37 eee iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)
Apr 29 10:36:40 eee iscsid: connection1:0 is operational after recovery (1 attempts)
Apr 29 10:36:50 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery
Apr 29 10:36:51 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery
Apr 29 10:36:51 eee kernel: sd 3:0:0:0: Device offlined - not ready after error recovery

Example 2:
Apr 16 08:41:40 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7
Apr 16 08:43:11 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7
Apr 16 08:44:13 vf kernel: connection1:0: pdu (op 0x5e itt 0x1) rejected. Reason code 0x7
Apr 16 08:45:51 vf kernel: connection1:0: detected conn error (1021) Apr 16 08:45:51 317 iscsid: Kernel reported iSCSI connection 1:0 error (1021 - ISCSI_ERR_SCSI_EH_SESSION_RST: Session was dropped as a result of SCSI error recovery) state (3)
Apr 16 08:45:53 vf iscsid: connection1:0 is operational after recovery (1 attempts)

I’m already in contact with OmniTI regarding our new build, but in the mean time, I would love for our clients to be able to use the storage so I’m trying to resolve the current issue somehow…

Matej

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://omniosce.org/ml-archive/attachments/20150527/1c170f52/attachment-0001.html>