Hey, are you installing CEPH in your test lab and you screwed it up, or something has gone wrong with your test cluster and you want to start over? Well, the instructions on how to do this are below.
Note that most of this process is actually well documented, but I added in a few extra steps good to run if you want to start from scratch without having to reinstall an OS.
First remove all CEPH rpms from your CEPH hosts, this includes Monitor nodes and OSD nodes. Note that I am in /root/ceph-deploy on my monitor/admin server. If you have separate admin and monitor nodes then run these commands from your admin node.
# ceph-deploy purge mon01 osd01 osd02 osd03
Now purge all config files.
# ceph-deploy purgedata mon01 osd01 osd02 osd03
Now remove all keys
# ceph-deploy forgetkeys
Now delete any remaining files or keys from /root/ceph-deploy, if there are any files in here that you may want to keep, just copy them to /root.
# rm -rf /root/ceph-deploy/*
Now remove Calamari. Note: that in my test lab I am running Calamari on my one MON node which also happens to be my admin node, so I run this command there
You can now start over. I am installing Red Hat Ceph 1.3, so I am using the instructions here.
You might also want to uninstall salt from your servers. If that’s the case just look for any of the packages below and rip them out. Your versions will vary.
In Ceph, when you create an OSD (Object Storage Device) you also need to create its Journal, which is where data is initially written before it is flushed to an OSD. Note that too maximize I/O it is suggested to use SSD drives as the journal partitions for your OSDs (see this link for reference).
So this is exactly what I did. I basically followed the instructions here regarding the creation of OSDs and Journals.
However post-deployment, I wanted to verify that my journal partitions were actually created properly and were being used as expected. That was a little bit tougher to figure out.
First you need to ssh directly to one of your OSD Servers, this command cannot be run from the monitor/admin node.
[root@osd01 ceph-20]# ceph-disk list
WARNING:ceph-disk:Old blkid does not support ID_PART_ENTRY_* fields, trying sgdisk; may not correctly identify ceph volumes with dmcrypt
/dev/sda :
/dev/sda1 other, xfs, mounted on /boot
/dev/sda2 other, LVM2_member
/dev/sdb :
/dev/sdb1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.19, journal /dev/sde1
/dev/sdc :
/dev/sdc1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.20, journal /dev/sde2
/dev/sdd :
/dev/sdd1 ceph data, active, unknown cluster 6f7cebf2-ceef-49b1-8928-2d36e6044db4, osd.21, journal /dev/sde3
/dev/sde :
/dev/sde1 ceph journal, for /dev/sdb1
/dev/sde2 ceph journal, for /dev/sdc1
/dev/sde3 ceph journal, for /dev/sdd1
In the output above you can see I have three OSDs (sdb1, sdc1, sdd1) and you can see that my journal disk (sde) has three partitions and you can see how they are mapped to echo SSD.
So before we get started deleting a Storage Repository, we need to know a few key terms.
In XenServer a Storage Repository is a storage target that contains virtual disks (VDIs) and isos.
A PBD (physical device block) is what they call the interface between a physical host and an attached SR, which is responsible for storing the device config that allows the host to interact with the storage target.
So now that we have gotten that out of the way, lets get started.
First of all let me start this off by saying that there is a lot of information out there on how to setup a dedicated storage interface on XenServer. However, I was unable to find anything specifically related to bonding two unmanaged interfaces and use them for as a dedicated uplink, which is seems rather silly to me as why would you not want to have a highly redundant network connection to your NFS storage. I digress.
Anyway, the first thing you need to do is to ssh into one of your XenServer hosts. In my environment I am building out a three node cluster and I need to make sure that I am working specifically with the first host in the cluster. So….
First thing you need to do is change the network backend of your Xenserver from "openvswitch" to "Linux Bridge". You accompish this with the following command.
#xe-switch-network-backend bridge.
Now you will need to reboot. Note that you can check your network-backend mode at any time with the following command.
#cat /etc/xensource/network.conf
First get the uuid of the local xenserver host, use the hostname to do this.
# xe host-list name-label=xen01
The command above will return the uuid of the server.
Then you need to get a list of pifs on the host that you are working with (making sure to exclude any other host's interfaces). The command below will output this list. We will need to grab the uuids of eth2 and eth3, since they are the interfaces that we are going to use to build our bond. Note that we are running this command so that it will spit out our MAC addresses as well… make sure that you take note of these as you will need them.
Next we will tell XenServer to "forget" or un-manage eth2. Then we will do the same to eth3. We will use the uuids of these interfaces to identify them to XenServer.
Example with interface eth2 in unmanaged mode. Rinse and repeat for eth3.
If you have successfully removed them its time to start creating your bond.
First define your bond in /etc/modprobe.conf. I am calling my bond, bond51
alias bonding bond51
options bond51 miimon=100 mode=7
Then edit /etc/sysconfig/network-scripts/ifcfg-eth2 and /etc/sysconfig/network-scripts/ifcfg-eth3. Make them look like the file below. Change the device name for ifcfg-eth3 to eth3.
DEVICE=eth2
BOOTPROTO=none
HWADDR=<MAC ADDRESS OF YOUR INTERFACE>
ONBOOT=yes
MASTER=bond51
SLAVE=yes
Then create /etc/sysconfig/network-scripts/ifcfg-bond0
Beep Boop. Ifup bond51 to bring up the bond and its slave members.
You can check the status of the bond via the command below.
cat /proc/net/bonding/bond51
Please know that I have done little more than reboot the XenServer host to make sure that the configuration that I built would persist across reboots, and failover from one interface to another. I have not tested performance yet in any way shape or form.
As a Systems Administrator, I deal with Raid 1(mirroring) pretty much exclusively. Hell, nowadays when building a server the server automatically mirrors your Operating System disks for you, which means that you do not even need to understand what is happening behind the scenes. You just pop your two drives in your server and go. However the world of the San Administrator is much more complicated.
First off its important to know that RAID stands for either “Redundant Array of Independent Disks”, or less commonly “Redundant Array of Inexpensive Disks”. Either way you slice it (pun intended) the basic idea of RAID is to combine multiple hard disks to either increase performance or increase redundancy.
Before I get started its important to introduce the term LUN. A LUN is a logical disk that consists of raw physical
disk space. LUNs are created as a basic part of the storage provisioning process. They are presented across a SAN to a server as a single physical disk.
Note that the title of this article is “Raid Levels Explained and Simplified“, and when I say simplified I mean it. I am going to give a brief overview of most of the common RAID levels and then present a weakness and strength. Scroll down to the bottom of the article for links to more in depth articles and web pages.
RAID 0: Striped…No Fault Tolerance
OK, in my opinion, and in the opinion of many other, RAID 0 is not even RAID, because there is no redundancy. If a disk fails, you are toast. Basically your take a slice of two disk or more disks and create a LUN. For example lets say that you as the Sysadmin request 1 80GB disk from your local SAN Admin. In the scenario below your SAN guru would carve 8 10GB blocks and present them in order (block 1,2,3,4,5,6,7,8) to you as a single LUN. RAID 0 provides good read and write performance. In the end RAID 0 is striping which is the most important thing that you probably need to know about it.
A little background…
Most of the time, I have used the RDAC driver in Linux to manage SAN disks in Linux. The RDAC driver is used to hide the complexity of multiple paths and to
present redundant paths as a single path which can be used as you would
a standard SCSI / IDE / SAS / SATA drive. Seeing only one device makes managing your disks much easier.
However where I work we only use RDAC with our IBM FastT, Sun 6140 and STK Flexline storage arrays. RDAC is not for LSI based storage such as Hitachi, Clarion, and EMC. For these servers we manage SAN disk with DM-Multipath.
Setup…
Setting up DM-Multipath is not hard, first you need to make sure that you install the package, device-mapper-mulitpath, and you will need to configure your multipath.conf and drop it into /etc. Below is some info on how to do so.
You will also need to make sure that you enable the multipathd daemon. This daemon is in charge of checking for failed paths.
Multipath Command…
For those use to using RDAC, DM-Multipath takes some getting used to, especially when you see the output from fdisk -ll.
In one particular instance I was given the disk name of /dev/sdm as the name of the new disk on this box. The output from the fdisk -l command is not exactly helpful, as there are a ton of psuedo devices showing up in my output. This is where the multipath command comes in handy.