[OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2
Stephan Budach
stephan.budach at JVM.DE
Wed May 11 13:05:08 UTC 2016
Am 11.05.16 um 14:50 schrieb Stephan Budach:
> Am 11.05.16 um 13:36 schrieb Stephan Budach:
>> Am 09.05.16 um 20:43 schrieb Dale Ghent:
>>>> On May 9, 2016, at 2:04 PM, Stephan Budach <stephan.budach at JVM.DE>
>>>> wrote:
>>>>
>>>> Am 09.05.16 um 16:33 schrieb Dale Ghent:
>>>>>> On May 9, 2016, at 8:24 AM, Stephan Budach
>>>>>> <stephan.budach at JVM.DE> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a strange behaviour where OmniOS omnios-r151018-ae3141d
>>>>>> will break the LACP aggr-link on different boxes, when Intel
>>>>>> X540-T2s are involved. It first starts with a couple if link
>>>>>> downs/ups on one port and finally the link on that port
>>>>>> negiotates to 1GbE instead of 10GbE, which then breaks the LACP
>>>>>> channel on my Cisco Nexus for this connection.
>>>>>>
>>>>>> I have tried swapping and interchangeing cables and thus
>>>>>> switchports, but to no avail.
>>>>>>
>>>>>> Anyone else noticed this and even better… knows a solution to this?
>>>>> Was this an issue noticed only with r151018 and not with previous
>>>>> versions, or have you only tried this with 018?
>>>>>
>>>>> By your description, I presume that the two ixgbe physical links
>>>>> will stay at 10Gb and not bounce down to 1Gb if not LACP'd together?
>>>>>
>>>>> /dale
>>>> I have noticed that on prior versions of OmniOS as well, but we
>>>> only recently started deploying 10GbE LACP bonds, when we
>>>> introduced our Nexus gear to our network. I will have to check if
>>>> both links stay at 10GbE, when not being configured as a LACP bond.
>>>> Let me check that tomorrow and report back. As we're heading for a
>>>> streched DC, we are mainly configuring 2-way LACP bonds over our
>>>> Nexus gear, so we don't actually have any single 10GbE connection,
>>>> as they will all have to be conencted to both DCs. This is achieved
>>>> by using VPCs on our Nexus switches.
>>> Provide as much detail as you can - if you're using hw flow control,
>>> whether both links act this way at the same time or independently,
>>> and so-on. Problems like this often boil down to a very small and
>>> seemingly insignificant detail.
>>>
>>> I currently have ixgbe on the operating table for adding X550
>>> support, so I can take a look at this; however I don't have your
>>> type of switches available to me so LACP-specific testing is
>>> something I can't do for you.
>>>
>>> /dale
>> I checked the ixgbe.conf files on each host and they all are still at
>> the standard setting, which includes flow_control = 3;
>> So they all have flow control enabled. As for the Nexus config, all
>> of those ports are still on standard ethernet ports and modifications
>> have only been made globally to the switch.
>> I will now have to yank the one port on one of the hosts from the
>> aggr and configure it as a standalone port. Then we will see, if it
>> still receives the disconnects/reconnects and finally the negotiation
>> to 1GbE instead of 10GbE. As this only seems to happen to the same
>> port I never experienced other ports of the affected aggrs acting up.
>> I also thought to notice, that those were always the "same" physical
>> ports, that is the first port on the card (ixgbe0), but that might of
>> course be a coincidence.
>>
>> Thanks,
>> Stephan
>
> Ok, so we can likely rule out LACP as a generic reason for this issue…
> After removing ixgbe0 from the aggr1, I plugged it into an unused port
> of my Nexus FEX and low and behold, here we go:
>
> root at tr1206902:/root# tail -f /var/adm/messages
> May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0
> link up, 1000 Mbps, full duplex
> May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0
> link down
> May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0
> link up, 10000 Mbps, full duplex
>
> May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0
> link down
> May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0
> link up, 10000 Mbps, full duplex
>
> So, after less than an hour, we had the first link-cycle on ixgbe0,
> alas on another port, which has no LACP config whatsoever. I will
> monitor this for a while and see, if we will get more of those.
>
> Thanks,
> Stephan
Ehh… and sorry, I almost forgot to paste the log from the Cisco Nexus
switch:
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-SPEED: Interface
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface
Ethernet141/1/9, operational duplex mode changed to Full
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface
Ethernet141/1/9, operational Receive Flow Control state changed to off
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface
Ethernet141/1/9, operational Transmit Flow Control state changed to on
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_UP: Interface
Ethernet141/1/9 is up in mode access
2016 May 11 14:07:29 gh79-nx-01 %ETHPORT-5-IF_DOWN_LINK_FAILURE:
Interface Ethernet141/1/9 is down (Link failure)
2016 May 11 14:07:45 gh79-nx-01 last message repeated 1 time
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-SPEED: Interface
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface
Ethernet141/1/9, operational duplex mode changed to Full
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface
Ethernet141/1/9, operational Receive Flow Control state changed to off
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface
Ethernet141/1/9, operational Transmit Flow Control state changed to on
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_UP: Interface
Ethernet141/1/9 is up in mode access
Despite the clock, which seems a bit off, you can see, that this looks
to the switch as if the cable had been simply pulled and then re-plugged.
Cheers,
Stephan
More information about the OmniOS-discuss
mailing list