Windows 2012 file copy 0 bytes/s

Started by Dieselboy, April 04, 2017, 07:29:45 AM

Previous topic - Next topic

Dieselboy

I noticed an issue when transferring a file from the network drive, located on Windows 2012 file server to the desktop of another Windows 2012 server. (I've had some weird problems come up lately!). The file copy would start, then the speed would drop to 355kbps and then 0kbps for ages. Then would start and repeat. I found an MTU issue (not done this for a while, took me an hour or so to re-learn protocol sizes and packet construction  :XD:). After fixing the mtu issue the problem was better but still there. Took a whole bunch of packet captures and cant really see anything wrong there. The MTU issue is fixed though.

TL;DR So I googled for the symptom and found that others were complaining that SMB3 copies sometimes has this problem. As a test I disabled the SMB2/3 and the connection made was identified as CIFS by the Riverbed (PS Issue there through Riverbed or not). This copy proceed fine. Also Windows 7 VM had no issue either.

I don't really want to leave off SMB2/3. Has anyone else seen this? I have some digging to do tomorrow I reckon.

wintermute000

Turn if back on and explicitly bypass the RB to double check
Then take dumps and do a packet capture on your RB then call in RB TAC, should be bread and butter stuff.

I've seen that exact symptom but with WaaS and a huge global multi-tier DMVPN network, once I brute-forced all tunnels to 1400 MTU / tcp-adjust mss and made sure no ICMP blocking thingys were in the path, problem solved.


Out of curiosity, what was your specific MTU issue and fix?

Dieselboy

I ran tests bypassing the RB and the issue was the same. Speed dropped to 0kbps. Running a "ping 192.168.7.233 -l 1394 -f -t" showed zero packet loss during this time. This ping is to the file server, with maximum data size, setting the DF bit and making it continuous (as you know but mentioning here for the sake of it).

The MTU issue was that the RB inpath interface was left at default 1500 bytes, but my VTI tunnel is capable of 1422. I had set mtu on the tunnel to 1400 and tcp adjust was 1330 but I dont remember why I'd done that, probably to investigate similar issue in the past. So TCP adjust mss would have negotiated a size to help stop fragmentation but it wasn't optimal.

So now with correct 1422 mtu and tcp adjust mss of 1394 AND the riverbed set for 1422 inpath MTU things should be optimal. There's probably no real noticeable affect, but still.

Captures taken without optimisation show what initially looks like normal traffic. What I do see though, is the window size never really goes above 2000 bytes. There's no retransmissions at all but window size goes from around 1850 up to around 1950 and then drops back down again in the 1800's, occasionally going over 2000. After some retransmissions, the window size doesn't really change much, either. Only dropping to around 1850 again.

I worked out that (2000 bytes x 8) / 0.1s average latency = ~160kbps which is about what I'm seeing TBH.

I can replicate the exact same test through HTTP (I have a company dropbox-like web server which is a web front end to the back end network drives). Using Google Chrome as the HTTP client I see faster continuous transfer up to around 400KB/s (unfortunately this is during business hours so is QoS'd below voice and video) but it never just stops transferring. The same test outside of the VPN is a tiny bit faster and I'm putting that down to the VPN being constrained by MTU and VPN overhead.

Confusingly, running the powershell command "Set-SmbServerConfiguration -EnableSMB2Protocol $false" which disables SMB2 and 3 on the file server gives a better experience in that transfer bitrate is more constant, similar with the HTTP test.
So to verify, I disabled SMB3 (and 2) and re-run the same test taking a packet capture. This time I see bit rate peak at over 600KBps. Looking at the capture, first glance shows "win=132096" on the first ACK from the client.

Running the same problematic test (SMB3 enabled) and making sure I capture the very start of the communication SYN/ACK, it takes forever just browsing the network share directory. Eventually, I manage to open the folder and start to copy and find the experience is still slow but reported bit rate is a lot better on this test going up to almost the 5Mbps maximum over 100ms. The packet capture shows that the window size is showing 132096 initially, and even going up to 405760 (which is not the same as the initial problematic capture).

Weird.





Dieselboy


Dieselboy

Moved thread to more suitable place.

I am not sure yet how to determine exactly what the protocol is doing in terms of multiple "channels". I am wondering if the packets are arriving out of order across the VPN and this is causing the problem. But I'm not sure if multiple channels are being used yet.

I can / will do more tests with different client / servers over the same VPN tunnel to see if I can see a pattern.

deanwebb

Keep in mind that Microsoft is an 800-pound gorilla. It will implement protocols as it pleases and it is the job of everyone else to change behavior to support their implementation, in their view.

With that in mind, looks like this protocol wants lots of NICs. See if you can set up a 2-NIC client and a 2-NIC server and see if the transfer goes much better than before. If not, bypass Riverbed and check.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

I was trying to find out how it works with a single nic client and a dual nic server first. How would each of them know the other had dual nic?

deanwebb

Quote from: Dieselboy on April 05, 2017, 08:28:50 AM
I was trying to find out how it works with a single nic client and a dual nic server first. How would each of them know the other had dual nic?

They likely pass parameters at the start of the session... but if the client/server aren't set right, do they pass parameters that reflect reality or parameters as coded on the box, even if they screw things up? I'm guessing the latter. The software may not be able to autodetect settings, so it relies on how they're coded in the config.
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

I gave the server a 2nd nic and left it as DHCP. Netstat now shows:


C:\Users\monkey>netstat -an | find "192.168.33.234"
  TCP    192.168.7.104:445      192.168.33.234:59462   ESTABLISHED
  TCP    192.168.7.233:445      192.168.33.234:59459   ESTABLISHED


Client is 192.168.33.234 (win 2012).
Server is 192.168.7.233 (also win 2012).
The 2nd nic is DHCP on 192.168.7.104

All I done was added an additional nic. I then re-tried the file copy (bypassing the Riverbed) and throughput was pretty high, almost to the same bandwidth as the internet pipe / vpn which is slightly under 10Mbps. But I still see the bitrate drop to 0 and then pick up again, so i would like to iron out what that is.

I also ran this command on the file server and noticed an increase in throughput (prior to adding the 2nd nic): "Set-SmbClientConfiguration -EnableBandwidthThrottling $false"

So when I'm making changes, I can see performance increases when running the tests. Peak bit rate increases. But there's still this issue with SMB3 dropping to 0bytes/s.

...Back to wireshark.

deanwebb

I see the same thing on my file transfers. Big ramp up, build a lot of momentum and then... phud. Zero. The file transfer operation looks like a set of camel humps. If I use SCP to a Linux host, the rate remains constant through the transfer.

So what causes the OS to determine that it needs to throttle back?
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

deanwebb

So I do some research...

https://support.symantec.com/en_US/article.TECH41876.html shows it's been an issue since XP, and that switches and routers can have SMB throttling commands on them.

http://serverfault.com/questions/41341/throttling-network-speed-when-copying-files-to-smb-mounted-nas-drive says SMB is even an issue on Linux, both on the client and server in this case. Recommendation there was to use rsync...

http://www.aidanfinn.com/?p=15262 talks about the SMB Bandwidth Limit...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.

Dieselboy

Dean, you see this problem too?
I need to verify but I *may* have seen this with SMB2.1 as well, I just need to confirm that the protocol used was smb 2.1.

UPDATE: I just came back here to say that this morning (after thinking about it for a while) I've set the client-side Steelhead to down-negotiate SMB2/SMB3 to SMB1, the Riverbed is now picking this up as CIFS and bitrate is constant-ish at around the 8.5mbps capacity of the VPN tunnel (or more).

As far as I know, I don't have any switch config to throttle smb. I had just noticed the setting being listed there in powershell.

I ran iperf tests yesterday just for peace of mind. Tests passed through the Riverbed get around the 8.5mb expected. Tests through the riverbed get up to 650mbps over the same VPN.  All looks good there.

Dieselboy

Dean, one thing I will mention is that when Windows reports the bitrate has dropped to 0kBps, I cannot find any lull in packets in either direction either from the server or the client during the time this happens. The 0kbps lasts for a few seconds.

I havent done this yet but I may re-install wireshark on the windows client, set the filter to only show the ip address of the server I'm downloading from, and see what is happening in terms of packets arriving at the point the bitrate goes to zero. I'm now wondering if Windows is lying to me about the bitrate. SMB1 copy looks good in the window but it might be slower than the broken SMB3.

deanwebb

Quote from: Dieselboy on April 07, 2017, 02:46:31 AM
Dean, one thing I will mention is that when Windows reports the bitrate has dropped to 0kBps, I cannot find any lull in packets in either direction either from the server or the client during the time this happens. The 0kbps lasts for a few seconds.

I havent done this yet but I may re-install wireshark on the windows client, set the filter to only show the ip address of the server I'm downloading from, and see what is happening in terms of packets arriving at the point the bitrate goes to zero. I'm now wondering if Windows is lying to me about the bitrate. SMB1 copy looks good in the window but it might be slower than the broken SMB3.

Your last line there reminded me of one of the first things I heard when supporting Windows 95: "Windows lies."

:vendors:

Problem is, that cute little graphic showing speed is how we psychologically determine if something is fast or slow. It goes to zero on a 40Gb line, we howl. It shows OC128 throughput on a 9600 baud line, we think we're going like gangbusters and don't complain about nothing... until it looks like the system may have crashed, since that file should have downloaded by now...
Take a baseball bat and trash all the routers, shout out "IT'S A NETWORK PROBLEM NOW, SUCKERS!" and then peel out of the parking lot in your Ferrari.
"The world could perish if people only worked on things that were easy to handle." -- Vladimir Savchenko
Вопросы есть? Вопросов нет! | BCEB: Belkin Certified Expert Baffler | "Plan B is Plan A with an element of panic." -- John Clarke
Accounting is architecture, remember that!
Air gaps are high-latency Internet connections.