[OmniOS-discuss] ixgbe: breaking aggr on 10GbE X540-T2

Wed May 11 13:05:08 UTC 2016

Am 11.05.16 um 14:50 schrieb Stephan Budach:
> Am 11.05.16 um 13:36 schrieb Stephan Budach:
>> Am 09.05.16 um 20:43 schrieb Dale Ghent:
>>>> On May 9, 2016, at 2:04 PM, Stephan Budach <stephan.budach at JVM.DE> 
>>>> wrote:
>>>>
>>>> Am 09.05.16 um 16:33 schrieb Dale Ghent:
>>>>>> On May 9, 2016, at 8:24 AM, Stephan Budach 
>>>>>> <stephan.budach at JVM.DE> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a strange behaviour where OmniOS omnios-r151018-ae3141d 
>>>>>> will break the LACP aggr-link on different boxes, when Intel 
>>>>>> X540-T2s are involved. It first starts with a couple if link 
>>>>>> downs/ups on one port and finally the link on that  port 
>>>>>> negiotates to 1GbE instead of 10GbE, which then breaks the LACP 
>>>>>> channel on my Cisco Nexus for this connection.
>>>>>>
>>>>>> I have tried swapping and interchangeing cables and thus 
>>>>>> switchports, but to no avail.
>>>>>>
>>>>>> Anyone else noticed this and even better… knows a solution to this?
>>>>> Was this an issue noticed only with r151018 and not with previous 
>>>>> versions, or have you only tried this with 018?
>>>>>
>>>>> By your description, I presume that the two ixgbe physical links 
>>>>> will stay at 10Gb and not bounce down to 1Gb if not LACP'd together?
>>>>>
>>>>> /dale
>>>> I have noticed that on prior versions of OmniOS as well, but we 
>>>> only recently started deploying 10GbE LACP bonds, when we 
>>>> introduced our Nexus gear to our network. I will have to check if 
>>>> both links stay at 10GbE, when not being configured as a LACP bond. 
>>>> Let me check that tomorrow and report back. As we're heading for a 
>>>> streched DC, we are mainly configuring 2-way LACP bonds over our 
>>>> Nexus gear, so we don't actually have any single 10GbE connection, 
>>>> as they will all have to be conencted to both DCs. This is achieved 
>>>> by using VPCs on our Nexus switches.
>>> Provide as much detail as you can - if you're using hw flow control, 
>>> whether both links act this way at the same time or independently, 
>>> and so-on. Problems like this often boil down to a very small and 
>>> seemingly insignificant detail.
>>>
>>> I currently have ixgbe on the operating table for adding X550 
>>> support, so I can take a look at this; however I don't have your 
>>> type of switches available to me so LACP-specific testing is 
>>> something I can't do for you.
>>>
>>> /dale
>> I checked the ixgbe.conf files on each host and they all are still at 
>> the standard setting, which includes flow_control = 3;
>> So they all have flow control enabled. As for the Nexus config, all 
>> of those ports are still on standard ethernet ports and modifications 
>> have only been made globally to the switch.
>> I will now have to yank the one port on one of the hosts from the 
>> aggr and configure it as a standalone port. Then we will see, if it 
>> still receives the disconnects/reconnects and finally the negotiation 
>> to 1GbE instead of 10GbE. As this only seems to happen to the same 
>> port I never experienced other ports of the affected aggrs acting up. 
>> I also thought to notice, that those were always the "same" physical 
>> ports, that is the first port on the card (ixgbe0), but that might of 
>> course be a coincidence.
>>
>> Thanks,
>> Stephan
>
> Ok, so we can likely rule out LACP as a generic reason for this issue… 
> After removing ixgbe0 from the aggr1, I plugged it into an unused port 
> of my Nexus FEX and low and behold, here we go:
>
> root at tr1206902:/root# tail -f /var/adm/messages
> May 11 14:37:17 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
> link up, 1000 Mbps, full duplex
> May 11 14:38:35 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 
> link down
> May 11 14:38:48 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
> link up, 10000 Mbps, full duplex
>
> May 11 15:24:55 tr1206902 mac: [ID 486395 kern.info] NOTICE: ixgbe0 
> link down
> May 11 15:25:10 tr1206902 mac: [ID 435574 kern.info] NOTICE: ixgbe0 
> link up, 10000 Mbps, full duplex
>
> So, after less than an hour, we had the first link-cycle on ixgbe0, 
> alas on another port, which has no LACP config whatsoever. I will 
> monitor this for a while and see, if we will get more of those.
>
> Thanks,
> Stephan 

Ehh… and sorry, I almost forgot to paste the log from the Cisco Nexus 
switch:

2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-SPEED: Interface 
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface 
Ethernet141/1/9, operational duplex mode changed to Full
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Receive Flow Control state changed to off
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Transmit Flow Control state changed to on
2016 May 11 13:21:22 gh79-nx-01 %ETHPORT-5-IF_UP: Interface 
Ethernet141/1/9 is up in mode access
2016 May 11 14:07:29 gh79-nx-01 %ETHPORT-5-IF_DOWN_LINK_FAILURE: 
Interface Ethernet141/1/9 is down (Link failure)
2016 May 11 14:07:45 gh79-nx-01 last message repeated 1 time
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-SPEED: Interface 
Ethernet141/1/9, operational speed changed to 10 Gbps
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_DUPLEX: Interface 
Ethernet141/1/9, operational duplex mode changed to Full
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_RX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Receive Flow Control state changed to off
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_TX_FLOW_CONTROL: Interface 
Ethernet141/1/9, operational Transmit Flow Control state changed to on
2016 May 11 14:07:45 gh79-nx-01 %ETHPORT-5-IF_UP: Interface 
Ethernet141/1/9 is up in mode access

Despite the clock, which seems a bit off, you can see, that this looks 
to the switch as if the cable had been simply pulled and then re-plugged.

Cheers,
Stephan