Cluster Health Monitor & OCLUMON
The Cluster Health Monitor (CHM) stores real-time operating system metrics in the CHM repository that you can use for later triage with the help of Oracle Support should you have cluster issues.
It consists of System Monitor Service, Cluster Logger Service, CHM Repository
The OCLUMON command-line tool is included with CHM and you can use it to query the CHM repository to display node-specific metrics for a specified time period. You can also use oclumon to query and print the durations and the states for a resource on a node during a specified time period.
System Monitor Service
There is one system monitor service on every RAC node. The system monitor service (osysmond) is the monitoring and operating system metric collection service that sends the data to the cluster logger service. The cluster logger service receives the information from all the nodes and persists in a CHM repository-based database.
[root@pk3-iub-rp-od01 bin]# ps -ef | grep osys
root 7229 27949 0 12:34 pts/0 00:00:00 grep osys
root 9865 1 4 Nov06 ? 2-05:33:49 /u01/app/11.2.0.4/grid/bin/osysmond.bin
Cluster Logger Service
There is one cluster logger service (ologgerd) on only one node in a cluster and another node is chosen by the cluster logger service to house the standby for the master cluster logger service. If the master cluster logger service fails (because the service is not able come up after a fixed number of retries or the node where the master was running is down), the node where the standby resides takes over as master and selects a new node for standby. The master manages the operating system metric database in the CHM repository and interacts with the standby to manage a replica of the master operating system metrics database.
[root@pk3-iub-rp-od01 bin]# ps -ef |grep olog
root 10450 1 0 Nov06 ? 10:54:06 /u01/app/11.2.0.4/grid/bin/ologgerd -m pk3-iub-rp-od02 -r -d /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
root 12824 8395 0 12:38 pts/0 00:00:00 grep olog
[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get master
Master = pk3-iub-rp-od01
Done
[root@pn3-esk-rp-od01 bin]# ./oclumon manage -get replica
Replica = pk3-iub-rp-od02
Done
CHM Repository
The CHM repository, by default, resides within the Grid Infrastructure home and requires 1 GB of disk space per node in the cluster. You can adjust its size and location, and Oracle supports moving it to shared storage. You manage the CHM repository with OCLUMON.[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get reppath
CHM Repository Path = /u01/app/11.2.0.4/grid/crf/db/pk3-iub-rp-od01
Done
[root@pk3-iub-rp-od01 bin]# ./oclumon manage -get repsize
CHM Repository Size = 61511
Done
OCLUMON Usage
Use the oclumon dumpnodeview command to view log information from the system monitor service in the form of a node view. A node view consists of seven views when you display output.SYSTEM: Lists system metrics such as CPU COUNT, CPU USAGE, and MEM USAGE
TOP CONSUMERS: Lists the top consuming processes
PROCESSES: Lists process metrics such as PID, name, number of threads, memory usage, and number of file descriptors
DEVICES: Lists device metrics such as disk read and write rates, queue length, and wait time per I/O
NICS: Lists network interface card metrics such as network receive and send rates, effective bandwidth, and error rates
FILESYSTEMS: Lists file system metrics, such as total, used, and available space
PROTOCOL ERRORS: Lists any protocol errors, All protocol errors are cumulative values since system startup.
-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- Below retrieves the info for the last one hour for a specific node
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -n pk3-iub-rp-od01 -last "01:00:00" > /tmp/oclumon.txt
-- for specific time
./oclumon dumpnodeview -allnodes -s "time_stamp" -e "time_stamp"
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -s "2014-12-24 10:05:00" -e "2014-12-24 10:10:00" > /tmp/oclumon.txt
-- Without -v you will have only SYSTEM and TOP CONSUMER
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- with -warning only node views with warning will be shown
[root@pk3-iub-rp-od01 bin]# ./oclumon dumpnodeview -allnodes -warning -v -s "2014-12-24 10:05:00" -e "2014-12-24 10:05:10" > /tmp/oclumon.txt
-- To see the objects (nics , OS Processes , Disks ) present in a node at a particular time[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects
Following nodes are attached to the loggerd
pk3-iub-rp-od01
pk3-iub-rp-od02
[root@pn3-esk-rp-od01 bin]# ./oclumon showobjects -n pn3-esk-rp-od01 -time "2014-12-24 15:00:00"
Sample Output
----------------------------------------
Node: pk3-iub-rp-od02 Clock: '12-24-14 10.05.01' SerialNo:64028
----------------------------------------
SYSTEM:
#pcpus: 2 #vcpus: 32 cpuht: Y chipname: Intel(R) cpu: 2.43 cpuq: 3 physmemfree: 195428192 physmemtotal: 264536300 mcache: 29107440 swapfree: 25165816 swaptotal: 25165816 ior: 0 iow: 1135 ios: 222 swpin: 0 swpout: 0 pgin: 0 pgout: 543 netr: 42.281 netw: 115.057 procs: 1282 rtprocs: 80 #fds: 29952 #sysfdlimit: 6815744 #disks: 5 #nics: 4 nicErrors: 0
TOP CONSUMERS:
topcpu: 'osysmond.bin(9813) 5.99' topprivmem: 'java(9716) 408552' topshm: 'ora_lms2_iubDB2(12021) 5600244' topfd: 'ocssd.bin(9865) 196' topthread: 'java(9716) 47'
PROCESSES:
name: 'osysmond.bin' pid: 9813 #procfdlimit: 65536 cpuusage: 5.99 privmem: 32672 shm: 58076 #fd: 66 #threads: 12 priority: -100 nice: 0
name: 'oraagent.bin' pid: 11285 #procfdlimit: 65536 cpuusage: 0.79 privmem: 25532 shm: 17752 #fd: 89 #threads: 26 priority: 20 nice: 0
name: 'tnslsnr' pid: 11607 #procfdlimit: 65536 cpuusage: 0.39 privmem: 4392 shm: 10180 #fd: 19 #threads: 3 priority: 20 nice: 0
name: 'orarootagent.bi' pid: 11289 #procfdlimit: 65536 cpuusage: 0.39 privmem: 11720 shm: 14472 #fd: 32 #threads: 11 priority: 20 nice: 0
name: 'ocssd.bin' pid: 9865 #procfdlimit: 65536 cpuusage: 0.39 privmem: 78568 shm: 55952 #fd: 196 #threads: 26 priority: -100 nice: 0
name: 'ora_dia0_IUBDB2' pid: 11998 #procfdlimit: 65536 cpuusage: 0.39 privmem: 34392 shm: 125936 #fd: 27 #threads: 1 priority: 20 nice: 0
......
name: 'oracle+ASM2' pid: 11074 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2812 shm: 18224 #fd: 16 #threads: 1 priority: 20 nice: 0
name: 'oracle+ASM2' pid: 11335 #procfdlimit: 65536 cpuusage: 0.0 privmem: 2616 shm: 17460 #fd: 18 #threads: 1 priority: 20 nice: 0
DEVICES:
dm-2 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SWAP
dm-3 ior: 0.000 iow: 25.631 ios: 6 qlen: 0 wait: 0 type: SYS
dm-1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
dm-0 ior: 0.000 iow: 542.253 ios: 135 qlen: 0 wait: 0 type: SYS
sda ior: 0.000 iow: 567.884 ios: 80 qlen: 0 wait: 0 type: SYS
sda3 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
sda2 ior: 0.000 iow: 569.485 ios: 81 qlen: 0 wait: 0 type: SYS
sda1 ior: 0.000 iow: 0.000 ios: 0 qlen: 0 wait: 0 type: SYS
NICS:
lo netrr: 1.741 netwr: 1.741 neteff: 3.482 nicerrors: 0 pktsin: 11 pktsout: 11 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 11 innonunicast: 0 type: PUBLIC
eth0 netrr: 0.000 netwr: 0.000 neteff: 0.000 nicerrors: 0 pktsin: 0 pktsout: 0 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 0 innonunicast: 0 type: PUBLIC
bondeth0 netrr: 35.239 netwr: 109.135 neteff: 144.374 nicerrors: 0 pktsin: 154 pktsout: 157 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 154 innonunicast: 0 type: PUBLIC
bondib0 netrr: 5.300 netwr: 4.181 neteff: 9.480 nicerrors: 0 pktsin: 13 pktsout: 14 errsin: 0 errsout: 0 indiscarded: 0 outdiscarded: 0 inunicast: 13 innonunicast: 0 type: PRIVATE latency: <1 font="">1>
FILESYSTEMS:
mount: /u01 type: ext3 total: 103212320 used: 59980932 available: 37988508 used%: 61 ifree%: 96 [ORACLE_HOME IUBBRM2 IUBDB2]
mount: / type: rootfs total: 0 used: 0 available: 0 used%: 0 ifree%: -1 [IUBBRM2 iubDB2]
PROTOCOL ERRORS:
IPHdrErr: 0 IPAddrErr: 0 IPUnkProto: 0 IPReasFail: 0 IPFragFail: 0 TCPFailedConn: 35 TCPEstRst: 10738 TCPRetraSeg: 1203010 UDPUnkPort: 224 UDPRcvErr: 0
Metric DescriptionsSYSTEM View
Metric | Description |
---|---|
#pcpus |
Number of physical CPUs in the system
|
#vcpus |
Number of logical compute units
|
chipname |
Type of CPU
|
cpuht |
CPU hyperthreading enabled (Y) or disabled (N)
|
cpu |
Average CPU utilization per processing unit within the current sample interval (%).
|
cpuq |
Number of processes waiting in the run queue within the current sample interval
|
physmemfree |
Amount of free RAM (KB)
|
physmemtotal |
Amount of total usable RAM (KB)
|
mcache |
Amount of physical RAM used for file buffers plus the amount of physical RAM used as cache memory (KB)
Note: This metric is not available on Solaris or Windows systems.
|
swapfree |
Amount of swap memory free (KB)
|
swaptotal |
Total amount of physical swap memory (KB)
|
ior |
Average total disk read rate within the current sample interval (KB per second)
|
iow |
Average total disk write rate within the current sample interval (KB per second)
|
ios |
Average total disk I/O operation rate within the current sample interval (I/O operations per second)
|
swpin |
Average swap in rate within the current sample interval (KB per second)
Note: This metric is not available on Windows systems.
|
swpout |
Average swap out rate within the current sample interval (KB per second)
Note: This metric is not available on Windows systems.
|
pgin |
Average page in rate within the current sample interval (pages per second)
|
pgout |
Average page out rate within the current sample interval (pages per second)
|
netr |
Average total network receive rate within the current sample interval (KB per second)
|
netw |
Average total network send rate within the current sample interval (KB per second)
|
procs |
Number of processes
|
rtprocs |
Number of real-time processes
|
#fds |
Number of open file descriptors
Number of open handles on Windows
|
#sysfdlimit |
System limit on number of file descriptors
Note: This metric is not available on Windows systems.
|
#disks |
Number of disks
|
#nics |
Number of network interface cards
|
nicErrors |
Average total network error rate within the current sample interval (errors per second)
|
PROCESSES View Metric Descriptions
Metric | Description |
---|---|
name |
The name of the process executable
|
pid |
The process identifier assigned by the operating system
|
#procfdlimit |
Limit on number of file descriptors for this process
Note: This metric is not available on Windows, Solaris, AIX, and HP-UX systems.
|
cpuusage |
Process CPU utilization (%)
Note: The utilization value can be up to 100 times the number of processing units.
|
memusage |
Process private memory usage (KB)
|
shm |
Process shared memory usage (KB)
Note: This metric is not available on Windows, Solaris, and AIX systems.
|
workingset |
Working set of a program (KB)
Note: This metric is only available on Windows.
|
#fd |
Number of file descriptors open by this process
Number of open handles by this process on Windows
|
#threads |
Number of threads created by this process
|
priority |
The process priority
|
nice |
The nice value of the process
|
DEVICES View Metric Descriptions
Metric | Description |
---|---|
ior |
Average disk read rate within the current sample interval (KB per second)
|
iow |
Average disk write rate within the current sample interval (KB per second)
|
ios |
Average disk I/O operation rate within the current sample interval (I/O operations per second)
|
qlen |
Number of I/O requests in wait state within the current sample interval
|
wait |
Average wait time per I/O within the current sample interval (msec)
|
type |
If applicable, identifies what the device is used for. Possible values are
SWAP , SYS , OCR , ASM , and VOTING . |
NICS View Metric Descriptions
Metric | Description |
---|---|
netrr |
Average network receive rate within the current sample interval (KB per second)
|
netwr |
Average network sent rate within the current sample interval (KB per second)
|
neteff |
Average effective bandwidth within the current sample interval (KB per second)
|
nicerrors |
Average error rate within the current sample interval (errors per second)
|
pktsin |
Average incoming packet rate within the current sample interval (packets per second)
|
pktsout |
Average outgoing packet rate within the current sample interval (packets per second)
|
errsin |
Average error rate for incoming packets within the current sample interval (errors per second)
|
errsout |
Average error rate for outgoing packets within the current sample interval (errors per second)
|
indiscarded |
Average drop rate for incoming packets within the current sample interval (packets per second)
|
outdiscarded |
Average drop rate for outgoing packets within the current sample interval (packets per second)
|
inunicast |
Average packet receive rate for unicast within the current sample interval (packets per second)
|
type |
Whether PUBLIC or PRIVATE
|
innonunicast |
Average packet receive rate for multi-cast (packets per second)
|
latency |
Estimated latency for this network interface card (msec)
|
FILESYSTEMS View Metric Descriptions
Metric | Description |
---|---|
total |
Total amount of space (KB)
|
used |
Amount of used space (KB)
|
available |
Amount of available space (KB)
|
used% |
Percentage of used space (%)
|
mft% |
Percentage of master file table utilization
|
ifree% |
Percentage of free file nodes (%)
Note: This metric is not available on Windows systems.
|
ROTOCOL ERRORS View Metric Descriptions
Metric | Description |
---|---|
IPHdrErr |
Number of input datagrams discarded due to errors in their IPv4 headers
|
IPAddrErr |
Number of input datagrams discarded because the IPv4 address in their IPv4 header's destination field was not a valid address to be received at this entity
|
IPUnkProto |
Number of locally-addressed datagrams received successfully but discarded because of an unknown or unsupported protocol
|
IPReasFail |
Number of failures detected by the IPv4 reassembly algorithm
|
IPFragFail |
Number of IPv4 discarded datagrams due to fragmentation failures
|
TCPFailedConn |
Number of times that TCP connections have made a direct transition to the CLOSED state from either the SYN-SENT state or the SYN-RCVD state, plus the number of times that TCP connections have made a direct transition to the LISTEN state from the SYN-RCVD state
|
TCPEstRst |
Number of times that TCP connections have made a direct transition to the CLOSED state from either the ESTABLISHED state or the CLOSE-WAIT state
|
TCPRetraSeg |
Total number of TCP segments retransmitted
|
UDPUnkPort |
Total number of received UDP datagrams for which there was no application at the destination port
|
UDPRcvErr |
Number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port
|
2 comments:
The blog is so interactive and Informative , you should write more blogs like this Big Data Hadoop Online course Bangalore
Log Analysis: Collect and analyze logs from various sources to identify and troubleshoot issues quickly.
Anomaly Detection: Use machine learning algorithms to detect unusual patterns that may indicate potential problems or security threats. cyber security projects for Final Year
Post a Comment