Technical Musings: December 2010

Friday, December 31, 2010

OpenBSD 4.8 vs FreeBSD 8.1 em0 Network Performance

As mentioned in one of my earlier postings, I have been looking into OpenBSD as a possible firewall OS.

I'll post a more opinion-based article shortly on what I think about OpenBSD vs FreeBSD, but for now, I thought I'd report this tidbit of info;

When benchmarking with netperf, OpenBSD 4.8 wasn't as fast as FreeBSD 8.1 until I tweaked some sysctls.

OpenBSD 4.8 - 520.49 Mbits/sec
FreeBSD 8.1 - 941.49 Mbits/sec

I then applied the speed tweaks for OpenBSD found here: https://calomel.org/network_performance.htm , and then was able to match FreeBSD's performance using NetPerf.

While this was with the em driver (Intel), I believe it would show on other drivers as well, as the sysctl's adjusted weren't specific to the em driver.

Conclusion? Make sure you're applying these tuning tips if you're running OpenBSD 4.8

Friday, December 17, 2010

Building a Ghetto-SAN - Part 1 - Basic Considerations

What is a Ghetto-SAN?

It's a SAN built on the cheap with whatever you can get your hands on.

In my situation, it's not sacrificing any type of reliability or speed, it's simply putting together a lot of parts that may not typically exist in enterprise SAN deployments.

I built a 6 TB FreeBSD/Samba server roughly two years ago when 1.5 TB drives first came out, but we've long since outgrown that. I could build another 12 TB easily enough, but that may not last me the year, and I don't want to start piling more servers around the office. I need consolidation as well as massive storage.

I project that we need at least 24 TB to make it two more years, and 36 TB may be preferable. Some of my data recovery projects are chewing up 3-5 TB and the complex ones involving law firms as clients sometimes mean I need to hold on to that data for 3 months before I can delete it.

I need a SAN, but I can't afford even the "affordable" SAN's from EMC, EqualLogic, etc. These things start at $100k for the features and storage density that I need today.

So I decided to build my own, basing it on opensource software (thanks guys!), ZFS, SATA drives, and whatever else I needed to make this thing the foundation of our little data centre. I'm flip-flopping on FreeBSD/Solaris/Nexenta as my SAN OS, but that's another topic.

It's been over 6 months since I started building and testing the environment, and I've learned a lot. This is not a task for the faint-at-heart, nor for those who don't like to do a lot of testing.

I plan on posting more information about what I've learned, my SAN specs, etc. over the next few months, but first I wanted to quickly comment on what I feel are two dead-end paths for some people trying to design their own Ghetto-SAN's.

Mistake #1 - Getting hung up with Controller/Disk bandwidth when you don't have the network for it

You need to realize that a 6 drive FreeBSD/ZFS raidz on a 3 year old Core 2 system with Intel's ICH 7 drive chipset, and 4 gig of RAM can saturate a 1 Gb network port - At least mine could.

This was my test bench for early SAN performance stats. It's when I knew I needed to look into faster network performance (and I eventually settled on 10 Gb Ethernet over FiberChannel and InfiniBand. That's also another post.)

If you're not talking about a 10Gbe network as the backbone for your SAN, don't worry about your drive performance. There's no need to buy seperate SAS controllers, SAS drives, and make sure everything is connected end-to-end.

Just get a simple SAS HBA for ~$200, hook it into a SAS backplane (and the old 3.0 Gb/sec ones are dropping nicely in price now) and you're going to be able to run that 1 Gb network port into the ground... and it will probably be faster than a basic 4 drive RAID-10 Direct Attached Storage (DAS) for a lot of things, thanks to ZFS having a great caching system. It beats the hell out of a PERC6 or H700 4 drive RAID-10 with 300 gig 6.0 Gb/sec SAS drives. You do have more latency, but the throughput is at least double in my tests, which well makes up for the 100 ms access time. Every VM i've moved to my test SAN has performed better, and "felt" better than the DAS units.

So you have 2 1 Gbe pots that you're going to team? Same story, I bet you'll be running ethernet into the ground before your disks are starving for I/O.

With 4 Ethernet ports, you may have cause - but you still need to do some testing. I decided on a single SAS 6.0 Gb/s HBA and SAS 6.0 Gb/s backplane for my SAN, and it's working out great for me.

Mistake #2 - Putting all your eggs in one basket

Everything fails. Everything.

If you have just one of anything that you depend on, you need to go read that last line again. I've got over 20 years experience in IT, and trust me when I say that the only thing you can count on with technology is it's eventual failure.

I have dual everything in my data centre. Dual UPS's connected to dual battery banks, feeding dual PDU's to the servers that all have dual power supplies, so each server is fed from two UPS' in case one of them fails under load when the power cuts out and the generator hasn't started yet. Dual firewalls, dual ESX servers, etc. You get the picture.

Same for your SAN. It's far better to have two cheaper SAN boxes than one more expensive one.

I've gone with a fairly expensive Primary SAN (redundant back planes, power, network, etc). It will be my main go-to box, and will deliver the day-to-day performance that I need to stay sane.

I will also have a Secondary SAN that will be run off an older server that will save my ass if the primary goes down. It won't be very fast at all, will be tight on storage, and low on RAM - BUT it will hold a copy of all my critical data, and a very recent snapshot backup of the main critical servers so I can get my network up and running within 15 minutes in case of failure in the main SAN.

This is completely redundant hardware, done on the cheap - and you need something like this if you care about your data. I just need to stuff more drives into an old case, setup some zfs send/recv and snapshot jobs, and I'm done.

I'm currently lucky - Not losing data is my primary concern. Quick repair is my second, but my clients will tolerate a rare incident that may result in their data access being interrupted for 30 minutes, followed by a day of slow service until we get the primary back on-line. Not everyone may be that lucky, so make sure you take the worst-case scenario into account when you start designing your SAN.

--

I hope to followup on this more regularly as my SAN timidly transitions from testing into full production. Comments and opinions welcome.

Intel X520 DA 10 Gbe Network Card and FreeBSD 8.2-PRE and 9.0-Current

I love my Intel X520 DA. It's two SPF+ 10 Gbe Network ports on one PCIe 4x card.

Out of the box, it works great with FreeBSD 8.1, even if you ignore Intel's optimization advice.

Currently for $380 a card (Retail close to $500), it's part of my Ghetto-SAN foundation. No need for an expensive 10 Gbe switch when you can afford to put two of these in each machine and make a simple point-to-point SAN network.

However, under FreeBSD 8.2-PRE and 9.0-CURRENT (both are not official releases yet) it won't work properly if you set the card to a MTU of 9000 or higher.

The reason is that it runs out of buffers to handle traffic of that size.

The quick fix is to put these settings into your /etc/sysctl.conf

hw.intr_storm_threshold=9000

kern.ipc.nmbclusters=262144

kern.ipc.nmbjumbop=262144 

kern.ipc.nmbjumbo9=48000

The kicker is the last line - nmbjumbo9 - By default FreeBSD allocates 6400, which isn't enough to handle what this card can produce. This isn't some little 100 Mbit card.. that's 10,000 Mbit that it needs to push. Expect to require deep buffers.

Remember to check your netstat -m output to see if you are close to exhaustion with nmbjumbo9 at 48000 - Mine was using ~36000 after some load testing.

Here's some additional information on settings for the card, and drivers, etc.

http://www.intel.com/support/network/sb/CS-025829.htm

http://downloadcenter.intel.com/SearchResult.aspx?lang=eng&ProductFamily=Network+Connectivity&ProductLine=Intel%C2%AE+Server+Adapters&ProductProduct=Intel%C2%AE+Ethernet+Server+Adapter+X520+Series

Wednesday, December 15, 2010

lagg performance penalty between ESXi 4.1 and FreeBSD 8.1

While doing some benchmarks of my ZFS/NFS/ESX setups, I started fiddling with the lagg driver to create a fail-over connection between my ESX server and the ZFS server.

Both boxes have the Intel x520DA 10Gbe adapter, which has 2 10 Gig ports.

ESXi doesn't offer proper load-balance between two ports unless you get into the Cisco Nexus vswitch, which I currently don't have.

I ran a few quick tests with my usual 2003 32 Bit VM and Performance test 6.1, and here's what I found;

Type of Test With lagg Without lagg % Diff

File Server: 478.16 491 2.6

Webserver: 84.19 86.46 2.6

Workstation: 0.52 0.6 13

Database: 127.88 136.15 6

Every test was slower with lagg being used.

This is something interesting to think about, as a lot of admins enable lagg or similar failover technology without thinking that there may be a performance impact.

At this stage, I don't know if it's on the lagg side, or ESX's side - I'd have to run further tests. I'll add it to the list of things that I'd like to know if I had time.

But for now, I know that I'll take a small performance hit to have this fail-over setup.

Sunday, December 5, 2010

Netperf and SMP - Oddness (Part 2 of 2)

To continue my previous post.. http://christopher-technicalmusings.blogspot.com/2010/12/netperf-and-smp-oddness.html

I was concerned that the poor performance of netperf when it was run on a multi-processor system was due to SMP overhead.

The only way to know for sure would be to run the test with netperf and netserver on separate (but identical) machines. That way I could ensure data was transferring across the network cable, and not being accelerated by any buffer-copy process within the processors or system.

I once again setup my netperf and netserver test, using FreeBSD 8.1 AMD64 on two Dell PowerEge 1950's.

In one set of tests, I restricted both ends to use a single CPU. In the other I let the system choose what CPU to run on (which results in the process jumping CPU a number of times as it executes the threads on the most opportunistic CPU - as noted with top -P).

The command was;

netperf -H 192.168.88.1 -L 192.168.88.2 -t TCP_STREAM -l 300 -f m

The results show no difference between multi-processor and single processor results. In either test, I saw 941.49 Megabits/sec, regardless of single or multiple CPU's.

Which means that there are some serious optimizations for working between 2 network cards in the same physical machine, as seen in my previous tests. The same tests run from one internal NIC to the other result in 4500 - 9000 MBits/sec depending on if it's all through one CPU or across multiple CPU's.

I obviously need to spend more time understanding this, as it would make a large difference when you're considering a firewall application, or anything that needs to move a lot of data between interfaces.

I welcome any insights..

Danger with Dell's H700 RAID card and FreeBSD 8.1-9.0

My new Dell PowerEdge T710 came with a fairly decent Dell PERC RAID card - the H700 with 512M of battery backed cache. It's a LSI 2108 based solution, and was providing far better performance than their older PERC5 or PERC6 solutions. It is even 6.0 G/s.

However - It's still a Dell RAID card, which are known for their slow speed. What do you expect, they need to protect their profit margins. :-)

Slow speed was something I was prepared to deal with - but that wasn't my issue; Specifically, I could never get a stable system using the mirrors I made with the H700.

It would be between a day or a few hours, and I'd start to see device timeout errors popping up on the console for one of the RAID mirror devices. I knew something was up, because my NFS connection to this box would die.

Oddly enough, the FreeBSD system was up, responsive, and I could browse the directories. I could not for the life of me get ESX to see my NFS shares on this box anymore. It didn't matter what services I restarted, it was down.

A reboot, and everything would be fine, until the next set of errors on the console. Very frustrating as you can imagine.

After some research, I've found that the LSI drivers are not the best in FreeBSD due to corporate disclosure issues. I found this to be a shame, as I've loved LSI products for years in a Windows environment, and have even used their older chipsets without issue in FreeBSD as well.

If LSI would work with the open source community a bit more, maybe we'd be have a stable driver for it. At this stage, I'd say it's not a good idea in a production FreeBSD box.

So I needed a new brand to depend on, and this time around I was going to choose a company that had solid FreeBSD support.

Research returned good opinions about the Areca brand controllers. Not only are they often on the top of the benchmarks between other brands of RAID card, but they have native FreeBSD drivers - Source or compiled kernel lodable.

I dove in a purchased an Areca1880ix-12, shown below.

It may be a bit of overkill, but this thing can be expanded to 4 gig of RAM, has a dedicated RJ-45 port in the back for full web management, and it's currently on the top of the SAS 6.0 RAID HBA speed charts.

I also really like the 4 SAS 6.0 ports. I'll explain why in a future blog article when I detail my SAN build.

After a month of running this card, I've yet to have an odd panic, disk issue, or other crash with my SAN. It's rock solid with no complaints.

It's been so long, that I don't even have the logs of the original error message that the H700 would throw.. sorry.. but you'll see it if you hook one up with FreeBSD. :-)

If anyone gets a H700 running cleanly with a RAID mirror, let me know. I still have my H700, it's destined for ebay or possibly a spare for a Windows only box.

Friday, December 3, 2010

Netperf and SMP - Oddness (Part 1 of 2)

Netperf seems to still be a fairly standard network performance tool. I see iperf out there as well, and interestingly enough, it generates very different numbers from netperf.

I've decided to go with netperf for my benchmarking needs, and have started running some simple tests with it to become more familiar with it's operation.

The first oddness I notice is with SMP. I get wildly different results depending on if multiple CPU's are used.

Let's start with what I'm running on: FreeBSD 8.1 AMD64 on a Dell 1850 PowerEdge, with 2 Xeon 5340 CPU. These machines have 2 Intel 1000MT NICs built into the board, which I have given 192.168.88.1 and 192.68.88.2 to. They run the em0 driver. I've connected them together with a cross-over cable.

FreeBSD is setup as stock, installed from the CD, no changes.

I'm using cpuset to drive the applications to one CPU or the other. Here's what I'm executing;

cpuset -c -l 0 netserver -L 192.168.88.1 -n 2

cpuset -c -l 0 netperf -H 192.168.88.1 -L 192.168.88.2 -t TCP_STREAM -l 300

By specifying the different IP's, I'm forcing the data to move from one NIC to the other. It's running IPv4, not IPv6.

This combination drives both the receive server (netserver) and the test program (netperf) from the same CPU. If I want to make them run on different CPU's I'd change one of the -l 0's to -l 1. If I want to leave it up to the kernel to schedule, I leave out the cpuset command entirely.

All hyperthreading is turned off. These are two standalone CPU's.

Here's what I'm getting, expressed in GigaBytes per Sec

Same CPU: 1.05 GB/sec

Different CPU: 0.47 GB/sec

No Preference: 0.88 GB/sec.

Very interesting..

We're looking at 1/2 speed when we run it on different CPU's. When we don't set a preference for the CPU, it will flip-flop between the two, sometimes both on one, other times separate. The speed for the no preference is almost exactly a split of the single CPU and dual CPU speeds.

I've researched this online, and found a few other people mentioning similar issues, but the threads never come to a conclusion.

There are two reasons I can think of this wide spread between single and dual CPU speeds;

1) Because all the work is happening on one CPU, there is some sort of cache/memory/buffer combining that allows for a faster transfer of data on the PCI bus. Maybe the data isn't transferring - but I do see the little link lights blinking away furiously when I run the tests.

2) There is significant overhead between the processors for SMP.

I do have a second identical PowerEdge 1850 that I plan on bringing into this equation shortly to try and figure out where this is coming from. By sending to a separate machine, I'm going to eliminate the possibility that the CPU is combining something.

However, if you're looking to make a firewall run quickly, it looks at from this first small test that a single CPU firewall will outperform a dual. That's an early conclusion, and I'll post more shortly when I know more.

If anyone has more info on this, that would be great.

Continued here..

http://christopher-technicalmusings.blogspot.com/2010/12/netperf-and-smp-oddness-part-1-of-2.html

Thursday, December 2, 2010

Switching to OpenBSD from FreeBSD for the new pf syntax

There is a new syntax available for pf in OpenBSD 4.7 and 4.8 that is quite interesting. You can read a bit more about it here;

http://marc.info/?l=openbsd-misc&m=125181847818600

The item that has me the most interested is the new NAT featureset. They've changed it so when you do NAT in the firewall rules, it appears to change the address on the fly to the new IP.

This makes matching rules further down in the ruleset more interesting, and in my mind, clearer, because as you run through the ruleset, you're not going to be concerned with pass rules for both the WAN and LAN addresses - The LAN address will become the WAN address, so there really is just the one rule.

It looks like it will be a while before FreeBSD picks up the new syntax. Currently plans are to update FreeBSD-9.0 to OpenBSD 4.5's version of pf. If you want to play with it, you'll need OpenBSD for now.

Since I'm designing a pretty hefty dual redundant firewall from scratch, complete with ALTQ, pfsync, OpenVPN, load balance, fail over, and some monitoring tools, I'm firing up a OpenBSD 4.8 box now to check it out, and see if it's really as good as it seems.

BTW, here is a link to a conversion script that should help you connvert to the new format:

http://jim-code-rand.blogspot.com/2010/05/openbsd-47-release-pfconf-conversion.html

I'll report back as I make progress. Since I have 2 identical Xeon machines to act as a firewall, I may have a chance to do a small performance test between OpenBSD and FreeBSD. I'm not sure where the advantage will be. I have a lot of faith in FreeBSD, but I also know that ALTQ and pf is a port in FreeBSD, where in OpenBSD they are built in, and have a few more features.

Time will tell.

If anyone else has recently made the switch, I'd love to hear about it..