Today while working in the Datacenter, I accidentally destroyed a fibre cable running to one of our database servers. All the databases went poof and the DBAs freaked out. After I reminded them that this was a non-production database I went about running a new fibre cable.
Being that this box was dual pathed back to the SAN, the server should have had additional paths back to its disks and the databases should have keep running along just fine. However, as luck would have it, the secondary path was down at the time I destroyed the primary.
Anyway after reseating the secondary fibre and running a new primary cable I was able to verify that I had multiple connections back to the disks with the following commands.
# vxdmpadm getdmpnode dmpnodename=3pardata0_4641
NAME STATE ENCLR-TYPE PATHS ENBL DSBL ENCLR-NAME
3pardata0_4641 ENABLED 3PARDATA 4 4 0 3pardata0
Where 3pardata0_4641 is the disk name and PATHS is obviously the number of paths back to the SAN disk.
Even better yet you can check DMP (dynamic multipath status) for all disks with the command below.
Got a heads up from a DBA today stating that one of the their database servers was running hot from an i/o perspective. So in order to troubleshoot jumped on the server and ran an iostat
iostat -d -x 5 3
Based on the output below i was able to determine which disks where the most utilized by looking at the last column which is %utilized. Two other columns to take note of are await and svctime,where await time is the average response time in ms for an i/o request to the device, including any time spent waiting in a queue. Svctime is the average time it took to service a request after it was sent to the disk and out of the queue. In this case service times are low so i can pretty much rule out SAN issues.
VxVM65519 0.00 0.00 230.80 20.80 5753.60 556.80 25.08 1.77 6.94 3.50 99.08
VxVM65516 0.00 0.00 257.60 21.80 6022.40 467.20 23.23 1.68 6.02 3.01 93.22
VxVM65515 0.00 0.00 265.80 18.80 6563.20 364.80 24.34 1.42 4.89 2.69 94.58
VxVM65513 0.00 0.00 233.20 24.00 6032.00 969.60 27.22 1.41 5.47 2.99 88.92
VxVM65493 0.00 0.00 308.80 21.00 7590.40 944.00 25.88 1.74 5.27 2.52 98.14
VxVM65492 0.00 0.00 262.80 20.40 6502.40 716.80 25.49 1.76 6.10 2.94 99.22
In this case svctimes are ok so i know that the disks are performing well, however i do know that the hot disks are under Veritas control, so i take a quick look at the disk groups on the box
Then do i vxstat on each disk group individually to look for the ones with high number or reads and rights. Then i can advise the dbas which specific database is causing the load on the box.
>vxstat -g example_dg -i 1
Its been a while since I have done much with Veritas Storage Foundation, so I was at a bit of a loss after firing up the VEA gui on a fresh install on Centos 5.4, and not seeing any of the agents that I was used to seeing. A quick check on the command line showed that the Storage Agent was in fact running. A vxdisk list displayed my disks without issue, but the GUI was blank except for the server name.
A quick google search lead me to this Symantec KB article…
While this article was helpful it did not solve my issues as I mentioned above, my storage agent was in fact running without issue. However I was able to verify similar errors in my vxsis.log
Thu Mar 27 12:40:03 2008:3342:get_objects_by_type: Error in GetObjects : 0xc1000039
Thu Mar 27 12:40:03 2008:3342:database::HandleILCacheRequest:Error in get_objects_by_type() : 0xc1000039
Thu Mar 27 12:40:03 2008:3342:rpc_object_fetch: Could not fetch objects
So I started searching on the errors above, and ran into this post on the Symatec Forum, where one of the posters sugguest that the /etc/host file might be the culprit.
Sure enough a quick check of my ./etc/hosts revealed that my servers ip and FQDN were not present. I corrected and then followed the steps below.
First I stopped VEA Service
> /opt/VRTS/bin/vxsvcctrl stop
Then I reconfigured Veritas
I restarted my VEA GUI and reconnected and found all was as it should be.