HomeLab Adventures: The Expansioning


Supermicro X8DT3

Over the last few months, I have been building out more and more virtual machines in my HomeLab ESX Cluster.  Its time to expand.

First came a four node Ceph Cluster, then an OpenStack Juno environment, and then an OpenStack Icehouse environment. Plus, I still needed to build out a Docker/Kubernetes test environment. It started to become apparent that I needed to add more capacity. Especially memory.

I figured it was time to dive into my hardware closet and see if I had any decent hardware. Luckily I was able to find a Supermicro X8DT3, and a couple of Xeon X5550s. I was off to a good start, but what I needed most was RAM. A bit more digging and I was able to find a box of 4gb Dimms , I needed 12 total… I found exactly that.


1/2 of Memory Dimms Installed

Next, I needed to find a case. I did not feel like dropping $100+ on a new E-ATX case, especially if I did not know if my motherboard was actually working. Luckily I found the a used Cooler Master XM on Ebay. $60 shipped was worth the risk. It was not even used, and was in great shape. You can see the original sticker still in place over the dual hot swap drive bay.


Somebody Donated this Beast to the Salvation Army

I also needed a power supply, but I was not looking forward to dropping $100+ on a massive modular PSU. Instead I picked up a mid-range 650W power supply off Amazon along with a 8pin CPU splitter. 650W should be plenty to push two 90w Xeons. I read somewhere that DDR3 memory is fairly power efficient, around 2-5 watts per dim. I figured I was safe since I was not going to stuff this new box full of disks. I have a freenas box for that.

Once all my parts arrived, I spent an evening cobbling everything together.  Much to my surprise the system booted without issue. A firmware upgrade was in order, so I finished that off in no time flat.


It Lives

A bit more digging through the pile turned up an LSI-8888elp raid controller. I plan to run two reclaimed 250GB drives in raid1 config (with a cold spare still in the closet) to give me a bit of local storage, and an ssd drive for the OS.

Had the pleasure of working with this beauty on my test bench. Makes system building much more enjoyable.


Next up — install ESXi and add to the cluster. This will bring my total ESXI server count to three, perfect for a true cluster. All systems have same motherboards, CPUs, and memory configuration.

CEPH: TCP Performance Tuning


Below are a few TCP tunables that I ran into when looking into TCP performance tuning for CEPH.

Note that there are two separate sections for 10GE connectivity, so you will want to test with both to find what works best for your environment.

To implement, we just add what is below to /etc/sysctl.d/99-sysctl.conf and run “sysctl -p“. Changes are persistent across reboots. Ideally these TCP tunables should be deployed to all CEPH nodes (OSD most importantly).

## Increase Linux autotuning TCP buffer limits
## Set max to 16MB (16777216) for 1GE
## 32MB (33554432) or 54MB (56623104) for 10GE

# 1GE/16MB (16777216)
#net.core.rmem_max = 16777216
#net.core.wmem_max = 16777216
#net.core.rmem_default = 16777216
#net.core.wmem_default = 16777216
#net.core.optmem_max = 40960
#net.ipv4.tcp_rmem = 4096 87380 16777216
#net.ipv4.tcp_wmem = 4096 65536 16777216

# 10GE/32MB (33554432)
#net.core.rmem_max = 33554432
#net.core.wmem_max = 33554432
#net.core.rmem_default = 33554432
#net.core.wmem_default = 33554432
#net.core.optmem_max = 40960
#net.ipv4.tcp_rmem = 4096 87380 33554432
#net.ipv4.tcp_wmem = 4096 65536 33554432

# 10GB/54MB (56623104)
net.core.rmem_max = 56623104
net.core.wmem_max = 56623104
net.core.rmem_default = 56623104
net.core.wmem_default = 56623104
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 56623104
net.ipv4.tcp_wmem = 4096 65536 56623104

## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
net.core.somaxconn = 1024

## Increase number of incoming connections backlog, default is 1000
net.core.netdev_max_backlog = 50000

##  Maximum number of remembered connection requests, default is 128
net.ipv4.tcp_max_syn_backlog = 30000

## Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks, default is 8192
net.ipv4.tcp_max_tw_buckets = 2000000

# Recycle and Reuse TIME_WAIT sockets faster, default is 0 for both
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

## Decrease TIME_WAIT seconds, default is 30 seconds
net.ipv4.tcp_fin_timeout = 10
## Tells the system whether it should start at the default window size only for TCP connections
## that have been idle for too long, default is 1
net.ipv4.tcp_slow_start_after_idle = 0
#If your servers talk UDP, also up these limits, default is 4096
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192
## Disable source redirects
## Default is 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0

## Disable source routing, default is 0
net.ipv4.conf.all.accept_source_route = 0

Reference here

Salt Minion: ExecStart=/usr/bin/salt-minion (code=exited, status=1/FAILURE)

Royalty-free cartoon styled clip art graphic of a salt shaker character  This image is available as an EPS file for an extra $20 fee after purchasing the high resolution. In order to obtain the EPS you will need to contact customer service. You can obtain the EPS files for the whole collection by purchasing the large collection and paying an additional $100.

Ran into this error today when trying to rebuild my CEPH cluster. After removing all CEPH packages I was working through the Red Hat CEPH 1.3 install guide and was running the command below.

# salt ‘*’ state.highstate

The output of the command above indicated that one of the minions was not healthy.

Minion did not return. [Not connected]

So I logged into the problematic minion and attempted to start salt-minion manually.

# systemctl start salt-minion

Salt-minion failed to start and barfed out the errors below.

salt-minion.service – The Salt Minion
Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled)
Active: failed (Result: exit-code) since Wed 2015-08-19 09:17:08 EDT; 35min ago
Process: 841 ExecStart=/usr/bin/salt-minion (code=exited, status=1/FAILURE)
Main PID: 841 (code=exited, status=1/FAILURE)
CGroup: /system.slice/salt-minion.service

Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: File “/usr/lib/python2.7/site-packages/salt/payload.py”, line 204, in send_auto
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: return self.send(enc, load, tries, timeout)
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: File “/usr/lib/python2.7/site-packages/salt/payload.py”, line 196, in send
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: return self.serial.loads(self.socket.recv())
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: File “/usr/lib/python2.7/site-packages/salt/payload.py”, line 95, in loads
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: return msgpack.loads(msg, use_list=True)
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: File “msgpack/_unpacker.pyx”, line 142, in msgpack._unpacker.unpackb (msgpack/_unpacker.cpp:142)
Aug 19 09:17:08 osd01.lab.localdomain salt-minion[841]: msgpack.exceptions.ExtraData: unpack(b) received extra data.
Aug 19 09:17:08 osd01.lab.localdomain systemd[1]: salt-minion.service: main process exited, code=exited, status=1/FAILURE
Aug 19 09:17:08 osd01.lab.localdomain systemd[1]: Unit salt-minion.service entered failed state.

After a bit of googling I was able to figure out that the issue was related to the local salt cache files. so I took the scorched earth approach and removed them. Note that this was performed on the minion – osd01.

# cd /var/cache/salt/
# rm -Rf minion

Now lets restart salt-minion

# systemctl restart salt-minion

And check its status.

# systemctl status salt-minion
salt-minion.service – The Salt Minion
Loaded: loaded (/usr/lib/systemd/system/salt-minion.service; enabled)
Active: active (running) since Wed 2015-08-19 09:59:06 EDT; 6s ago
Main PID: 2501 (salt-minion)
CGroup: /system.slice/salt-minion.service
├─2501 /usr/bin/python /usr/bin/salt-minion
└─2729 /usr/bin/python /usr/bin/salt-minion

Aug 19 09:59:06 osd01.lab.localdomain systemd[1]: Starting The Salt Minion…
Aug 19 09:59:06 osd01.lab.localdomain systemd[1]: Started The Salt Minion.

Excellent, issue resolved. Apparently other issues, such as mismatched salt rpms between the master and the minion can also cause this error, however that was not the case for me.

CEPH: How to Restart an Install, or How to Reset a Cluster

logo_ceph_CMYK_coatedHey, are you installing CEPH in your test lab and you screwed it up, or something has gone wrong with your test cluster and you want to start over? Well, the instructions on how to do this are below.

Note that most of this process is actually well documented, but I added in a few extra steps good to run if you want to start from scratch without having to reinstall an OS.

First remove all CEPH rpms from your CEPH hosts, this includes Monitor nodes and OSD nodes. Note that I am in /root/ceph-deploy on my monitor/admin server.  If you have separate admin and monitor nodes then run these commands from your admin node.

# ceph-deploy purge mon01 osd01 osd02 osd03

Now purge all config files.

# ceph-deploy purgedata mon01 osd01 osd02 osd03

Now remove all keys

# ceph-deploy forgetkeys

Now delete any remaining files or keys from  /root/ceph-deploy, if there are any files in here that you may want to keep, just copy them to /root.

# rm -rf /root/ceph-deploy/*

Now remove Calamari. Note: that in my test lab I am running Calamari on my one MON node which also happens to be my admin node, so I run this command there

#yum remove ceph-deploy calamari-server calamari-clients

You can now start over. I am installing Red Hat Ceph 1.3, so I am using the instructions here.

You might also want to uninstall salt from your servers. If that’s the case just look for any of the packages below and rip them out. Your versions will vary.

  • salt-2015.5.3-4.el7.noarch
  • salt-minion-2015.5.3-4.el7.noarch
  • salt-master-2015.5.3-4.el7.noarch

Ceph: Simple Ceph Pool Commands for Beginners


CEPH is a very well documented technology. Just check out the documentation for ceph at ceph.com. Pretty much everything that you want to know about CEPH is documented there. However, this also means that you possibly need to dig around just to remember a few simple commands.

Because of this, I have decided to put a few notes below on creating, deleting, and working with CEPH pools.

List Pools

You can list your existing pools with the command below. In this example I only have one pool called rdb with a pool number of 13.

[root@mon01 ~]# ceph osd lspools
13 rbd,

Create a Pool

OK so now lets create a test pool, aptly named, test-pool. I will create it with 128 PGs (placement groups)
[root@mon01 ~]# ceph osd pool create test-pool 128
pool ‘test-pool’ created

Placement Groups

Note however when I run my script called pg_per_osd.bash I see my new pool (number 14) actually was created with 384 PGs. What?
[root@mon01 ~]# ./pg_per_osd.bash
dumped all in format plain

pool :  13      14      | SUM
osd.26  48      36      | 84
osd.27  30      43      | 73
osd.19  41      38      | 79
osd.20  45      37      | 82
osd.21  42      53      | 95
osd.22  43      39      | 82
osd.23  50      38      | 88
osd.24  35      51      | 86
osd.25  50      49      | 99
SUM :   384     384     |

What has occurred is that my new pool (test-pool) was created with 128 PGs, however replica PGs were created based on the global setting (osd_pool_default_size = 3) in /etc/ceph/ceph.conf.
We can verify that this is the case by taking a closer look at our new pool, test pool.
Here we can verify that the pool was actually created with 128 PGs

[root@mon01 ~]# ceph osd pool get test-pool pg_num
pg_num: 128
We can also check the pool’s setting to see how many replicas will be created. Note that the number is 3. Multiply 128 PGs by 3 replicas and you get 384.
[root@mon01 ~]# ceph osd pool get test-pool size
size: 3

You can also take a sneak-peak at the minimum number of replicas that a pool can have before running in a degraded state.
[root@mon01 ~]# ceph osd pool get test-pool min_size
min_size: 2
Other get/set commands are listed here.

Deleting A Pool

So this is a fun one as you have to be very specific to delete as pool, going as far as typing out the pool name twice and including an option that it would be very hard to accidentally type (however be careful with up arrow).
[root@mon01 ~]# ceph osd pool delete test-pool test-pool –yes-i-really-really-mean-it
pool ‘test-pool’ removed

Ceph: Show OSD to Journal Mapping


In Ceph, when you create an OSD (Object Storage Device) you also need to create its Journal, which is where data is initially written before it is flushed to an OSD. Note that too maximize I/O it is suggested to use SSD drives as the journal partitions for your OSDs (see this link for reference).

So this is exactly what I did. I basically followed the instructions here regarding the creation of OSDs and Journals.

However post-deployment, I wanted to verify that my journal partitions were actually created properly and were being used as expected. That was a little bit tougher to figure out.

First you need to ssh directly to one of your OSD Servers, this command cannot be run from the monitor/admin node.

[root@osd01 ceph-20]# ceph-disk list
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sda :
/dev/sda1 other, xfs, mounted on /boot
/dev/sda2 other, LVM2_member
/dev/sdb :
/dev/sdb1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.19, journal /dev/sde1
/dev/sdc :
/dev/sdc1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.20, journal /dev/sde2
/dev/sdd :
/dev/sdd1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.21, journal /dev/sde3
/dev/sde :
/dev/sde1 ceph journal, for /dev/sdb1
/dev/sde2 ceph journal, for /dev/sdc1
/dev/sde3 ceph journal, for /dev/sdd1

In the output above you can see I have three OSDs (sdb1, sdc1, sdd1) and you can see that my journal disk (sde) has three partitions and you can see how they are mapped to echo SSD.

Ceph: Show Placement Group Totals by OSD


Note that I did not write this scriptlet this nor do I claim to have written this scriptlet. However, I did want to make sure that I did not lose the link to such a very handy command.

The original can be found here, plus the original article has links to several more useful Urls, so feel free to check it out.


Anyway, now that we got out of the way, here is the script.

ceph pg dump | awk '
 /^pg_stat/ { col=1; while($col!="up") {col++}; col++ }
 /^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, RSTART, RLENGTH); poollist[pool]=0;
 up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = substr(up, RSTART+RLENGTH) }
 for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];}
 printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| SUM \n");
 for (i in poollist) printf("--------"); printf("----------------\n");
 for (i in osdlist) { printf("osd.%i\t", i); sum=0;
 for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; poollist[j]+=array[i,j] }; printf("| %i\n",sum) }
 for (i in poollist) printf("--------"); printf("----------------\n");
 printf("SUM :\t"); for (i in poollist) printf("%s\t",poollist[i]); printf("|\n");

This gives you a nice output as shown below.