Tuesday, March 13, 2012

Blank NFS Datastore under ESXi - Check your MTU

If you've used ESX/ESXi over NFS long enough, you have probably encountered a situation where you have a connected NFS datastore, but when you browse the datastore, it's blank, or only a few files are showing.

There are many reasons for this situation, and there is plenty of vmWare KB's about troubleshooting such a problem.

Turning on NFS logging is a good idea.

This is the standard vmWare Troubleshooting document.

One thing that you may not be thinking of is a MTU mismatch.

I had this exact problem tonight, and it unfortunately took a long time to figure out because I was suspecting FreeBSD as being the culprit due to some struggles I've had with it as of late.

I came across this by issuing this command;

vmkping -s 8500 san0 -s 8500

This tells the ping command to send a much larger payload than usual. My 10Gbe network is setup with a MTU of 9000, something I verified many times over, so I really didn't think this was the issue. I only entered it out of completeness after realizing this was a serious problem that wasn't going to be solved quickly, and I needed to start a proper documentation trail of what I was doing if I hoped to solve the issue.

I won't bore you with the details of my troubleshooting, and get to the point.

Turns out my new Dell M8024-k blade 10Gbe switch requires you to set the MTU on the LAG channel group separately from the ports. That's not uncommon, but as far as I can tell, this isn't in the M8024-k GUI anywhere, it can only be done via the CLI.

I guess I've been spoiled by LAG's that set their MTU based on the members MTU, and when I didn't see a separate LAG MTU setting in the GUI anywhere, I assumed that everything was working fine - After all, pings were working, and that didn't work in the past when I had the wrong MTU applied on different adapters (although that could also be an oddity with the FreeBSD lagg driver).

By default the Dell M8024-k makes LAG groups with a MTU of 1500. 

What's odd is that some connectivity is maintained with this mismatch. You can browse and list some directories, and even see some information - I believe the point where it really breaks is when it needs to transfer more than the 1500 limit allows.

Further oddness: FreeBSD didn't have problems mounting and browsing the directories over NFS that ESXi couldn't browse or list properly. This could be to FreeBSD connecting via UDP instead of TCP, at this point I'm not sure.

I thought I'd pass this along in case anyone else is forgetting about each point in your data transmission chain that a MTU mismatch could be affecting.

1 comment:

  1. Thanks for the post. Several have been of use to me since we seem to use similar hardware. I have to ask your opinion of Nexenta and would love some details regarding your primary/secondary SAN solutions.

    Is there any chance we can exchange information? I'm currently running a build that may be of interest to you. I am expanding it soon and would appreciate your input.

    Thanks, and if you're interested, you can reach me at kerberos242 @ gmail . Com