Network issue - Mellanox ConnectX-3 and PCIe Slot

Started by Gothmoth, December 10, 2020, 08:13:59 AM

Previous topic - Next topic

Gothmoth

hello,

i am runing windows 10 1909 64bit.

i bought two 10GBe mellanox connectx-3 cards to direct connect two systems.
everything (seem) to went fine.
windows installed drivers and in 10 minutes i had setup a 10gbe network.

until a day later.
macrium reflect was making an automatic backup and i was informed that the verify ended with a hash error.
i run the backup again.. same result: hash error.

i then did a file copy with verify (using totalcommander).
i copied ~3000 jpg and tif files to the second system.
totalcommander nearly immediately warned me that some of the written files are different.
and indeed when checking the files they showed artefacts or only half of the file was displayed, the bottom parts of the images were only artefacts.

i was pretty shocked that the file transfers, seemingly, went fine when not veryfing.
but around 5% of the files where destroyed after copying.
had i not used macrium reflect and verified the backup image, i would not have noticed it that early.

i did a bunch of tests to figure out what the culprit is (cards, cable, harddsik etc.).
i don´t want to bore you with that.

in the end i noticed that when i put the card in the lower PCIe slot (the long one at the bottom called PCI4_3) of my asus crosshair hero 7 the network card produces these errors.
files send from this system to the other system (no matter if the second system uses the internal intel 1GB via switch or direct connection to the mellanox 10GB network card) are probably defective.

when i put the network card into the second 16x slot, beneath the graphic card, the files are transfered without issues.

i have only one graphic card in this system, no other PCIe devices.

the PCI slot in question is a PCIe 2.0 x4 slot.
it should not share any resources as the other x1 PCI slots on my mainboard are not used.

the mellanox connectx-3 cards are PCIe 3.0 cards... but they should work in PCI2.0 slot.
the manual says they are compatible with PCIe 2.0 and PCIe 1.0.

i tried with the drivers windows install and the latest recommended drivers from the mellanox website:

https://i.imgur.com/fnrfokT.jpg

what can cause this?
someone with similiar experiences?


deanwebb

Well, I'm wondering about the backplane data transfer rates in those slots. 10Gbps is a LOT of throughput, so if the PCI2 slot can't support data transfer rates that high, you'll have the errors. I see a "Super I/O" chip on that diagram and I'm wondering if it's only wired to the second slot and not the third... if so, that would be something you could see for yourself on the motherboard. If that's not it, we're still dealing with other potential bottlenecks for the 10Gbps throughput. I know on systems my company provides, we needed to upgrade the previous hardware platform to get true 10Gbps packet capture. It wasn't just the card, but the chipset behind the card that had to be upgraded to meet that spec.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Gothmoth

#2
hi,

the PCI 2.0 x4 slot should be good for 4x 500 MB/s. so around 900-1100 MB/s with overhead for the 10gbe card should be no problem.
i have just tested with sata SSD´s that can not deliver more than 540 MB/s.

but even when it is bottlenecked.. it should not transfer defective files, right?
my problem is not the bandwith but that files which are transfered over the network are defective.
exe files can not be started, JPG files are destroyed.

is there no safety measure in the TCP protocol that prevents this from happening?



Otanx

I would guess it is a defective motherboard. If moving the card fixes it then it isn't the card. Depending on exactly how the files are being transferred over the network there can be error correction. At the lowest level is the packet checksum. The Mellonox card is probably doing offload so it is handling the packet checksum. Those are probably all OK. Then it hands the data off to the PCI bus. Then it would be up to the PCI bus error checking which after a quick google is not introduced until PCIx 6.0 standard. Once past the PCI bus is when the application will get it. If the application isn't doing error correction/detection then it will save a corrupt file. Another quick Google search, and it does not look like native Windows copy protocols do any error checking. So that is why it works, but files are bad. Your backup system is doing the check, and giving the hash errors.

-Otanx

Gothmoth

there is one more thing i have to check.
when testing the card i screwed it in when using the bottom slot. because it should stay there.
when using the second x16 slot for testing, i just put the card in but did not screw it on.
i did it so each time i tested the different slots.

i just noticed, that when i screw the card in the card moves a tiny bit.
because the slot shield (sorry i am german don´t know how it´s called in english) is bend a bit.
so the end of the card moves 1-1.5mm up when screwed in.

it seems to sit fine in the PCI slot but maybe the connection to one or two of the pins is not 100%.

??? could this be the reason? should i not get more error messages in that case?

will have to run more test again....

icecream-guy

can you not use the card in the working slot?  or is something else populating that slot?
:professorcat:

My Moral Fibers have been cut.

Gothmoth

#6
yes i could do that.
but this puzzles me and i want to figure out what causes it.

first and foremost i was shocked that the files were transfered with no error message but defective.
i thought that network transfer has security measures (checksums) that prevent this from happening.

the second thing is that if the slot does not work, i would return the board as defective.
but i first have to be sure it is the board not something else.

but it seem it was in part at least my error.

i always make sure that the PCI cards are firmly in the pci slots. i push them in with my thumb.
but i did not realize that, when i screw the card to the case, the card moves a bit because of the network card bracket.
it´s only ~1 mm but that seems to cause my problems.

today i have transfered 1.5 TB of data and no defective file yet.
before it was, at most, 2-4GB and i had a defective file.

a big 10GB image file was almost always defective after transfer.



EDIT:

i spoke to early.
i just rebooted, did nothing else.

after 200MB the first file was defective.

so bad connection from PCI slot to network card was not the issue.



deanwebb

Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.