Please see my other blog for Oracle EBusiness Suite Posts - EBMentors

Search This Blog

Note: All the posts are based on practical approach avoiding lengthy theory. All have been tested on some development servers. Please don’t test any post on production servers until you are sure.

Monday, October 28, 2013

Exadata: Monitoring Performance (Using Metrics)


Like database  in Exadata Metrics and alerts also help you monitor Oracle Exadata Storage Server Software. Metrics are associated with objects such as cells and cell disks, and can be cumulative, rate, or instantaneous.
Metrics are recorded observations of important run-time properties, retained in memory and stored on a disk for a more permanent history.


Displaying Metrics
Use the LIST METRICDEFINITION command to display the metric definitions for the cell. A metric definition listing shows the configuration of a metric.
CellCLI> LIST METRICDEFINITION CL_CPUT DETAIL
CellCLI> LIST METRICDEFINITION WHERE objectType = 'GRIDDISK'
Following object types are available
IORM_CONSUMER_GROUP, IORM_DATABASE, IORM_CATEGORY, CELL, CELLDISK, CELL_FILESYSTEM, GRIDDISK, HOST_INTERCONNECT, FLASHCACHE

CellCLI> LIST METRICDEFINITION WHERE name LIKE 'CD_IO_RQ.*' -
ATTRIBUTES name, metricType, description

CellCLI> LIST METRICCURRENT CL_TEMP DETAIL -- current metric values

CellCLI> LIST METRICCURRENT WHERE objectType = 'CELLDISK' AND 
metricValue != 0 ATTRIBUTES name, metricObjectName, 
metricValue, collectionTime

CellCLI> LIST METRICHISTORY CD_IO_RQ_R_LG WHERE alertState='critical' DETAIL

CellCLI> LIST METRICHISTORY WHERE objectType = 'CELLDISK' AND metricValue != 0 -
AND collectionTime > '2013-10-12T09:10:51-07:00' -ATTRIBUTES name, -
metricObjectName, metricValue, collectionTime

Few Useful Metrics 
Cell Metrics
CL_CPUT ==> The cell CPU utilization which is the instantaneous percentage of time over the previous minute that the system CPUs were not idle (from /proc/stat).
CL_CPUT_CS ==> The percentage of CPU time used by CELLSRV.
CL_CPUT_MS ==> The percentage of CPU time used by MS.
CL_FANS ==> The instantaneous number of working fans on the cell.
CL_FSUT ==> The percentage of total space utilized on the file system that is currently in use. This metric shows the space utilization in the various files systems on the cell.
CL_MEMUT_CS ==> The percentage of physical memory used by CELLSRV.
CL_MEMUT_MS ==> The percentage of physical memory used by MS.
CL_RUNQ ==> average number (over the preceding minute) of processes in the Linux run queue
CL_SWAP_IN_BY_SEC ==> The number of swap pages read in KB per second.
CL_SWAP_OUT_BY_SEC ==>The number of swap pages written in KB per second.
CL_TEMP ==> The instantaneous temperature (Celsius) of the server, provided by the Baseboard Management Controller (BMC). On VM it will show 0.0 C
CL_VIRTMEM_CS ==> The amount of virtual memory used by CELLSRV in MB.
CL_VIRTMEM_MS ==> The amount of virtual memory used by MS in MB.
CL-MEMUT ==> The percentage of total physical memory used on the cell. On the VM I had 2G for Cell VM and it was showing 99% used
IORM_MODE ==> I/O Resource Management objective for the cell.
N_NIC_NW ==> The instantaneous number of non-working interconnections.
N_NIC_RCV_SEC ==> The rate which is the total number of I/O packets received by interconnections per second. Not shown in VM
N_NIC_TRANS_SEC ==> The rate which is the total number of I/O packets transmitted by interconnections per second.  Not shown in VM

Cell Disk Metrics
Provide information about the I/O load for cell disks, such as the number of large blocks read from a cell disk.
CD_BY_FC_DIRTY ==> The number of data bytes in flash cache that are not synchronized to the cell disk.
CD_IO_BY_R_LG ==> The cumulative number of MB read in large blocks from a cell disk.
CD_IO_BY_R_LG_SEC ==> The rate which is the number of MB read in large blocks per second from a cell disk.
CD_IO_BY_R_SM ==> The cumulative number of MB read in small blocks from a cell disk.
CD_IO_BY_R_SM_SEC ==> The rate which is the number of MB read in small blocks per second from a cell disk.
CD_IO_BY_W_LG ==> The cumulative number of MB written in large blocks on a cell disk.
CD_IO_BY_W_LG_SEC ==>  The rate which is the number of MB written in large blocks per second on a cell disk.


Flash Cache Metrics
Provide information about the utilization of Flash Cache, such as the number of MB read per second from Flash Cache.
FC_BY_DIRTY ==> The number of data bytes in flash cache that are not synchronized to the grid disks.
FC_BY_STALE_DIRTY ==> The number of data bytes in flash cache which cannot be synchronized because the cached disks are not accessible.
FC_BY_USED ==> The number of MB used on flash cache.
FC_BYKEEP_OVERWR ==> The number of MB pushed out of flash cache because of the space limit for keep objects.
FC_BYKEEP_OVERWR_SEC ==> The number of MB per second pushed out of flash cache because of space limit for keep objects.
FC_BYKEEP_USED ==> The number of MB used for keep objects on Flash Cache.
FC_IO_BY_R ==> The number of MB read from Flash Cache.
FC_IO_BY_R_MISS ==> The number of MB read from disks because not all requested data was in Flash Cache.
FC_IO_BY_R_MISS_SEC ==> The rate which is the number of MB read from disks per second because not all requested data was in Flash Cache.
FC_IO_BY_R_SEC ==> The rate which is the number of MB read per second from Flash Cache.

Exadata Smart Flash Log Metrics
Provide information about flash log utilization, such as the number of MB written per second.
FL_ACTUAL_OUTLIERS ==> This metric shows the number of redo writes written to flash and disk that exceeded the outlier threshold.
FL_BY_KEEP ==> This metric shows the number of redo data bytes saved on flash due to disk I/O errors.
FL_DISK_FIRST ==> This metric shows the number of redo writes first written to disk.
FL_DISK_IO_ERRS ==> This metric shows the number of disk I/O errors encountered by Oracle Exadata Smart Flash Log.
FL_EFFICIENCY_PERCENTAGE ==> This metric shows the efficiency of Oracle Exadata Smart Flash Log expressed as a percentage.
FL_EFFICIENCY_PERCENTAGE_HOUR ==>
This metric shows the efficiency of Oracle Exadata Smart Flash Log over the past hour expressed as a percentage.

Grid Disk Metrics
Provide information about the I/O load for grid disks, such as the number of large blocks read from a grid disk.
GD_BY_FC_DIRTY ==> The number of data bytes cached in flash cache that are not synchronized to the grid disk.
GD_IO_BY_R_LG ==> The cumulative number of MB read in large blocks from a grid disk.
GD_IO_BY_R_LG_SEC ==> The rate which is the number of MB read in large blocks per second from a grid disk

Host Interconnection Metrics
Provide information about the I/O transmission for hosts that access cell storage.
N_MB_SENT ==> The cumulative number of MB transmitted to a particular host.
N_MB_DROP ==> The cumulative number of MB dropped during transmission to a particular host.

IORM Metrics
IORM uses the database name, not the database identifier. Provide information about the size of the I/O load from each category specified in the current IORM category plan.
CT_FC_IO_BY_SEC ==> This metric shows the number of megabytes of I/O per second for this category  to flash cache.
CT_FC_IO_RQ ==> This metric shows the number of I/O requests issued by an IORM category to flash cache.
CT_FC_IO_RQ_SEC ==> This metric shows the number of I/O requests issued by an IORM category to flash cache per second.
CT_FD_IO_BY_SEC ==> This metric shows the number of megabytes of I/O per second for this category to flash disks.

IORM Utilization Metrics
When OLTP and DSS workloads share Exadata Cells, IORM determines whether to optimize for low latency or high throughput. To optimize for low latency, large I/O requests should be distributed so the disk is not fully utilized. To optimize for high throughput, each Exadata Cell must handle many concurrent large I/O requests, allowing the cell to be fully utilized while applying optimization algorithms. However, when a cell has many concurrent large I/O requests, I/O latency is high because each I/O is queued behind many other I/Os.
The utilitization metrics for I/O requests from database and consumer groups correspond to the amount of time a database or consumer group utilized a cell. Large I/O requests utilize more of a cell than small I/O requests. The following are the utilization metrics for determining IORM optimization:
■CG_IO_UTIL_LG
■CG_IO_UTIL_SM
■CT_IO_UTIL_LG
■CT_IO_UTIL_SM
■DB_IO_UTIL_LG
■DB_IO_UTIL_SM
By comparing the amount of I/O resources consumed with the I/O resource allocations, the database administrator can determine if IORM should be tuned for latency or throughput. The IORM metric, IORM_MODE, shows the mode for IORM. The metric value ranges between 1 and 3. The following are the definitions for the values:
■1 means the cell IORM objective was set to low_latency.
■2 means the cell IORM objective was set to balanced.
■3 means the cell IORM objective was set to high_throughput.
A value in between 1-2 or 2-3 indicates the IORM objective was not the same throughout the metric period, and the value indicates proximity to a given objective. It is also indicative of a constantly-changing mix of workloads.

Related Posts:
Exadata: Replacing damaged disk is really plugNplay activity
Exadata: Get Cell statistics quickly
Exadata: What differentiates GI on Exadata with GI on non-Exadata?
Exadata: Understanding key OS Processes for a cell
Exadata: Health Checking Exadata
Exadata: Diagnostics using sundiag/deaddisk
Exadata: Knowing a bit Exadata administrative utilities

No comments: