CEPH: TCP Performance Tuning


Below are a few TCP tunables that I ran into when looking into TCP performance tuning for CEPH.

Note that there are two separate sections for 10GE connectivity, so you will want to test with both to find what works best for your environment.

To implement, we just add what is below to /etc/sysctl.d/99-sysctl.conf and run “sysctl -p“. Changes are persistent across reboots. Ideally these TCP tunables should be deployed to all CEPH nodes (OSD most importantly).

## Increase Linux autotuning TCP buffer limits
## Set max to 16MB (16777216) for 1GE
## 32MB (33554432) or 54MB (56623104) for 10GE

# 1GE/16MB (16777216)
#net.core.rmem_max = 16777216
#net.core.wmem_max = 16777216
#net.core.rmem_default = 16777216
#net.core.wmem_default = 16777216
#net.core.optmem_max = 40960
#net.ipv4.tcp_rmem = 4096 87380 16777216
#net.ipv4.tcp_wmem = 4096 65536 16777216

# 10GE/32MB (33554432)
#net.core.rmem_max = 33554432
#net.core.wmem_max = 33554432
#net.core.rmem_default = 33554432
#net.core.wmem_default = 33554432
#net.core.optmem_max = 40960
#net.ipv4.tcp_rmem = 4096 87380 33554432
#net.ipv4.tcp_wmem = 4096 65536 33554432

# 10GB/54MB (56623104)
net.core.rmem_max = 56623104
net.core.wmem_max = 56623104
net.core.rmem_default = 56623104
net.core.wmem_default = 56623104
net.core.optmem_max = 40960
net.ipv4.tcp_rmem = 4096 87380 56623104
net.ipv4.tcp_wmem = 4096 65536 56623104

## Increase number of incoming connections. The value can be raised to bursts of request, default is 128
net.core.somaxconn = 1024

## Increase number of incoming connections backlog, default is 1000
net.core.netdev_max_backlog = 50000

##  Maximum number of remembered connection requests, default is 128
net.ipv4.tcp_max_syn_backlog = 30000

## Increase the tcp-time-wait buckets pool size to prevent simple DOS attacks, default is 8192
net.ipv4.tcp_max_tw_buckets = 2000000

# Recycle and Reuse TIME_WAIT sockets faster, default is 0 for both
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

## Decrease TIME_WAIT seconds, default is 30 seconds
net.ipv4.tcp_fin_timeout = 10
## Tells the system whether it should start at the default window size only for TCP connections
## that have been idle for too long, default is 1
net.ipv4.tcp_slow_start_after_idle = 0
#If your servers talk UDP, also up these limits, default is 4096
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192
## Disable source redirects
## Default is 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.all.accept_redirects = 0

## Disable source routing, default is 0
net.ipv4.conf.all.accept_source_route = 0

Reference here

CEPH: How to Restart an Install, or How to Reset a Cluster

logo_ceph_CMYK_coatedHey, are you installing CEPH in your test lab and you screwed it up, or something has gone wrong with your test cluster and you want to start over? Well, the instructions on how to do this are below.

Note that most of this process is actually well documented, but I added in a few extra steps good to run if you want to start from scratch without having to reinstall an OS.

First remove all CEPH rpms from your CEPH hosts, this includes Monitor nodes and OSD nodes. Note that I am in /root/ceph-deploy on my monitor/admin server.  If you have separate admin and monitor nodes then run these commands from your admin node.

# ceph-deploy purge mon01 osd01 osd02 osd03

Now purge all config files.

# ceph-deploy purgedata mon01 osd01 osd02 osd03

Now remove all keys

# ceph-deploy forgetkeys

Now delete any remaining files or keys from  /root/ceph-deploy, if there are any files in here that you may want to keep, just copy them to /root.

# rm -rf /root/ceph-deploy/*

Now remove Calamari. Note: that in my test lab I am running Calamari on my one MON node which also happens to be my admin node, so I run this command there

#yum remove ceph-deploy calamari-server calamari-clients

You can now start over. I am installing Red Hat Ceph 1.3, so I am using the instructions here.

You might also want to uninstall salt from your servers. If that’s the case just look for any of the packages below and rip them out. Your versions will vary.

  • salt-2015.5.3-4.el7.noarch
  • salt-minion-2015.5.3-4.el7.noarch
  • salt-master-2015.5.3-4.el7.noarch

Ceph: Simple Ceph Pool Commands for Beginners


CEPH is a very well documented technology. Just check out the documentation for ceph at ceph.com. Pretty much everything that you want to know about CEPH is documented there. However, this also means that you possibly need to dig around just to remember a few simple commands.

Because of this, I have decided to put a few notes below on creating, deleting, and working with CEPH pools.

List Pools

You can list your existing pools with the command below. In this example I only have one pool called rdb with a pool number of 13.

[root@mon01 ~]# ceph osd lspools
13 rbd,

Create a Pool

OK so now lets create a test pool, aptly named, test-pool. I will create it with 128 PGs (placement groups)
[root@mon01 ~]# ceph osd pool create test-pool 128
pool ‘test-pool’ created

Placement Groups

Note however when I run my script called pg_per_osd.bash I see my new pool (number 14) actually was created with 384 PGs. What?
[root@mon01 ~]# ./pg_per_osd.bash
dumped all in format plain

pool :  13      14      | SUM
osd.26  48      36      | 84
osd.27  30      43      | 73
osd.19  41      38      | 79
osd.20  45      37      | 82
osd.21  42      53      | 95
osd.22  43      39      | 82
osd.23  50      38      | 88
osd.24  35      51      | 86
osd.25  50      49      | 99
SUM :   384     384     |

What has occurred is that my new pool (test-pool) was created with 128 PGs, however replica PGs were created based on the global setting (osd_pool_default_size = 3) in /etc/ceph/ceph.conf.
We can verify that this is the case by taking a closer look at our new pool, test pool.
Here we can verify that the pool was actually created with 128 PGs

[root@mon01 ~]# ceph osd pool get test-pool pg_num
pg_num: 128
We can also check the pool’s setting to see how many replicas will be created. Note that the number is 3. Multiply 128 PGs by 3 replicas and you get 384.
[root@mon01 ~]# ceph osd pool get test-pool size
size: 3

You can also take a sneak-peak at the minimum number of replicas that a pool can have before running in a degraded state.
[root@mon01 ~]# ceph osd pool get test-pool min_size
min_size: 2
Other get/set commands are listed here.

Deleting A Pool

So this is a fun one as you have to be very specific to delete as pool, going as far as typing out the pool name twice and including an option that it would be very hard to accidentally type (however be careful with up arrow).
[root@mon01 ~]# ceph osd pool delete test-pool test-pool –yes-i-really-really-mean-it
pool ‘test-pool’ removed

Ceph: Show OSD to Journal Mapping


In Ceph, when you create an OSD (Object Storage Device) you also need to create its Journal, which is where data is initially written before it is flushed to an OSD. Note that too maximize I/O it is suggested to use SSD drives as the journal partitions for your OSDs (see this link for reference).

So this is exactly what I did. I basically followed the instructions here regarding the creation of OSDs and Journals.

However post-deployment, I wanted to verify that my journal partitions were actually created properly and were being used as expected. That was a little bit tougher to figure out.

First you need to ssh directly to one of your OSD Servers, this command cannot be run from the monitor/admin node.

[root@osd01 ceph-20]# ceph-disk list
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sda :
/dev/sda1 other, xfs, mounted on /boot
/dev/sda2 other, LVM2_member
/dev/sdb :
/dev/sdb1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.19, journal /dev/sde1
/dev/sdc :
/dev/sdc1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.20, journal /dev/sde2
/dev/sdd :
/dev/sdd1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.21, journal /dev/sde3
/dev/sde :
/dev/sde1 ceph journal, for /dev/sdb1
/dev/sde2 ceph journal, for /dev/sdc1
/dev/sde3 ceph journal, for /dev/sdd1

In the output above you can see I have three OSDs (sdb1, sdc1, sdd1) and you can see that my journal disk (sde) has three partitions and you can see how they are mapped to echo SSD.

Ceph: Show Placement Group Totals by OSD


Note that I did not write this scriptlet this nor do I claim to have written this scriptlet. However, I did want to make sure that I did not lose the link to such a very handy command.

The original can be found here, plus the original article has links to several more useful Urls, so feel free to check it out.


Anyway, now that we got out of the way, here is the script.

ceph pg dump | awk '
 /^pg_stat/ { col=1; while($col!="up") {col++}; col++ }
 /^[0-9a-f]+\.[0-9a-f]+/ { match($0,/^[0-9a-f]+/); pool=substr($0, RSTART, RLENGTH); poollist[pool]=0;
 up=$col; i=0; RSTART=0; RLENGTH=0; delete osds; while(match(up,/[0-9]+/)>0) { osds[++i]=substr(up,RSTART,RLENGTH); up = substr(up, RSTART+RLENGTH) }
 for(i in osds) {array[osds[i],pool]++; osdlist[osds[i]];}
 printf("pool :\t"); for (i in poollist) printf("%s\t",i); printf("| SUM \n");
 for (i in poollist) printf("--------"); printf("----------------\n");
 for (i in osdlist) { printf("osd.%i\t", i); sum=0;
 for (j in poollist) { printf("%i\t", array[i,j]); sum+=array[i,j]; poollist[j]+=array[i,j] }; printf("| %i\n",sum) }
 for (i in poollist) printf("--------"); printf("----------------\n");
 printf("SUM :\t"); for (i in poollist) printf("%s\t",poollist[i]); printf("|\n");

This gives you a nice output as shown below.


Ceph: Cluster Updates Are Stale. The Cluster isn’t updating Calamari. Please contact Administrator


There are probably many different issues that can cause this error in the Calamari WebUI. However this fixed worked for me. Note that this was post install and OSD deployment…  I did not have a working cluster at this point.

First ssh into your Admin node, in this case my admin and monitor node are on in the same.

Then run the command below.

root@mon01 calamari]# calamari-ctl clear
[WARNING] This will remove all stored Calamari monitoring status and history.  Use ‘–yes-i-am-sure’ to proceed
OK, now run the command above again with the “–yes-i-am-sure‘ option.
[root@mon01 calamari]# calamari-ctl clear –yes-i-am-sure
[INFO] Loading configuration..
[INFO] Dropping tables
[INFO] Complete.  Now run `calamari-ctl initialize`
Now reinitialize Calamari.
[root@mon01 calamari]# calamari-ctl initialize
[INFO] Loading configuration..
[INFO] Starting/enabling salt…
[INFO] Starting/enabling postgres…
[INFO] Initializing database…
[INFO] Initializing web interface…
[INFO] Starting/enabling services…
[INFO] Restarting services…
[INFO] Complete.

Refresh the Calamari WebUI and the error should be gone, or at least it was for me.

Ceph: Troubleshooting Failed OSD Creation


Introduction to Ceph

According to Wikipedia “Ceph is a free software storage platform designed to present object, block, and file storage from a single distributed computer cluster. Ceph’s main goals are to be completely distributed without a single point of failure, scalable to the exabyte level, and freely-available”

More information pertaining to Ceph can be found here.

Lab Buildout

In my homelab I am building out a small Ceph cluster for testing and learning purposes. My small cluster consists or 4 virtual machines as shown below. I plan to use this cluster primarily as a backend for OpenStack.

Monitor Servers
Count 1
Memory (GB) 2
Primary Disk (GB) 16
OSD Servers
Count 3
Memory (GB) 2
Primary Disk (GB) 16
OSD Disk (GB) 10
OSD Disk (GB) 10
OSD Disk (GB) 10
SSD Journal (GB) 6

Troubleshooting OSD Creation

On my monitor server which is also serving as my Admin node, I run the following command to remove all partitioning on all disks that I intend to use for Ceph.

# for disk in sdb sdc sdd sdd; do ceph-deploy disk zap osd01:/dev/$disk; done
Next I run the command below to prepare each OSD and specify the journal disk to use for each OSD. This command “should” create a partition on each OSD, format label it as a Ceph disk, and then create a journal partition for each OSD on the journal disk (sde in this case).
#ceph-deploy osd prepare osd01:sdb:sde osd01:sdc:sde osd01:sdd:sde
Unfortunately, the command below kept failing, stating that it was unable to create some of the partitions on each disk, while creating partitions on some of the disk, and mounting them locally. This left my OSDs in a bad state as running the command again would throw all sorts of errors. So I figured that I would start over and run the zap command again. However now this command was failing with errors as some of the disks were mounted and Ceph was running.
Next step was to ssh into the OSD server, aptly named, osd1 and stop ceph.
# /etc/init.d/ceph stop
Then unmount any OSDd that were mounted.
# umount /var/lib/ceph/osd/ceph-7 /var/lib/ceph/osd/ceph-8 /var/lib/ceph/osd/ceph-9
Then using fdisk, delete any existing partitions, this seemed to be necesary to remove partitons created on the SSD journal disk. Next run partx to force the OS to re-read the partition table on each disk.
# for disk in sdb sdc sdd sde; do partx -a /dev/$disk; done
At this point I was able to log back into the admin node and re-run the prepare command.

Additional Troubleshooting

So, apparently this was not the end of all my woes. I ran into the same issue on my second OSD server, osd02. First thing I did was ssh into the OSD server and run the command below.
[root@osd02 ceph]# /etc/init.d/ceph status
=== osd.3 ===
osd.3: not running.
=== osd.13 ===
osd.13: running {“version”:”0.94.1″}
=== osd.14 ===
osd.14: running {“version”:”0.94.1″}
So I stopped Ceph.
[root@osd02 ceph]# /etc/init.d/ceph stop
=== osd.14 ===
Stopping Ceph osd.14 on osd02…kill 224396…kill 224396…done
=== osd.13 ===
Stopping Ceph osd.13 on osd02…kill 223838…kill 223838…done
=== osd.3 ===
Stopping Ceph osd.3 on osd02…done
Then I unmounted the osd.3.
[root@osd02 ceph]# umount /var/lib/ceph/osd/ceph-3
Then I locally prepared osd3, where /dev/sdb is the osd disk and /dev/sde is the journal disk.
[root@osd02 ceph]# ceph-disk -v prepare –fs-type xfs –cluster ceph — /dev/sdb /dev/sde
I then verified that I had three Ceph journal partitions on my ssd
[root@osd02 ceph]# fdisk -l /dev/sde
WARNING: fdisk GPT support is currently new, and therefore in an experimental phase. Use at your own discretion.

Disk /dev/sde: 6442 MB, 6442450944 bytes, 12582912 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: gpt

#         Start          End    Size  Type            Name
1         2048      4098047      2G  unknown         ceph journal
2      4098048      8194047      2G  unknown         ceph journal
3      8194048     12290047      2G  unknown         ceph journal

Then I checked my OSDs again. All were running
[root@osd02 ceph]# /etc/init.d/ceph status
=== osd.13 ===
osd.13: running {“version”:”0.94.1″}
=== osd.14 ===
osd.14: running {“version”:”0.94.1″}
=== osd.18 ===
osd.18: running {“version”:”0.94.1″}