在用top可以看到cpu,内存使用情况,
top
top - 19:48:16 up 140 days, 18:09,7 users,load average: 0.08, 0.15, 0.17
Tasks: 259 total,1 running, 257 sleeping,0 stopped,1 zombie
Cpu(s):0.0%us,0.0%sy,0.0%ni, 99.5%id,0.5%wa,0.0%hi,0.0%si,0.0%st
Mem:1921024k total,1910328k used,10696k free,17036k buffers
Swap:2097144k total,132612k used,1964532k free,1268668k cached
PID USERPRNIVIRTRESSHR S %CPU %MEMTIME+COMMAND
22533 oracle14-1960m 560m 524m S0 29.90:00.60 oracle
30735 oracle14-1912m 371m 366m S0 19.80:49.88 oracle
发现两个问题:1内存剩余不足。
2 PID为30735持续了49分钟,且占用系统内存19.8% 。
随后查看告警日志,发出内存空间不够,用如个脚本来跑
oracle@ccdatabase01 ~]$ cat s.sh
#!/bin/ksh
sqlplus -S /nolog <<EOF
connect system/$2;
col sid format 99999
col serial# format 999999
col username format a10
col program format a32
set lines 1000
set pages 1000
set trimspool on
select sid,serial#,username,program,
to_char(logon_time,'yyyy/mm/dd hh24:mi:ss') as login_time
from v\$session
where paddr in (select addr from v\$process where spid=$1);
--CURRENT SQL
select sql_id,sql_text
fromv\$sqltext_with_newlines
where hash_value=(
select sql_hash_value
fromv\$session
wherepaddr=(select addr from v\$process where spid=$1))
order by piece;
--PREV SQL
select sql_id,sql_text
fromv\$sqltext_with_newlines
where hash_value=(
select prev_hash_value
fromv\$session
wherepaddr=(select addr from v\$process where spid=$1))
order by piece;
exit
EOF
[oracle@ccdatabase01 ~]$ sh s.sh 30735 oracle
SID SERIAL# USERNAMEPROGRAMLOGIN_TIME
------ ------- ---------- -------------------------------- -------------------
96716 DBOICCorbaAgent.exe2013/05/28 18:38:06
no rows selected
SQL_IDSQL_TEXT
------------- ----------------------------------------------------------------
bvq9u4cfkqxr4 select deviceno, sum(abandcount) abandcountfrom (select devi
bvq9u4cfkqxr4 ceno, count(1) as abandcountfrom tCurrentBillLogwhere
bvq9u4cfkqxr4devicetype = 2and callidnum = -1and to_char(callb
bvq9u4cfkqxr4 egin,'YYYY.MM.DD')=to_char(sysdate,'YYYY.MM.DD')and calle
bvq9u4cfkqxr4 nd<=callbegingroup by devicenounion allselect devic
bvq9u4cfkqxr4 eno, count(1) as abandcountfrom tbilllog5where device
bvq9u4cfkqxr4 type = 2and callidnum = -1and to_char(callbegin,'Y
bvq9u4cfkqxr4 YYY.MM.DD')=to_char(sysdate,'YYYY.MM.DD')and callend<=cal
bvq9u4cfkqxr4 lbegingroup by deviceno)group by deviceno
查看该语句执行计划:
Execution Plan
| Id| Operation| Name| Rows| Bytes | Cost (%CPU)| Time|
| 0 | SELECT STATEMENT||||1786 (100)||
|1 |HASH GROUP BY||50 |1300 |1786(2)| 00:00:22 |
|2 |VIEW||50 |1300 |1785(2)| 00:00:22 |
|3 |UNION-ALL||||||
|4 |HASH GROUP BY||1 |27 |16(7)| 00:00:01 |
|5 |TABLE ACCESS BY INDEX ROWID| TCURRENTBILLLOG|1 |27 |15(0)| 00:00:01 |
|6 |INDEX RANGE SCAN| INDCURRENTBILLLOG_DEVICETYPE |99 ||7(0)| 00:00:01 |
|7 |HASH GROUP BY||49 |1323 |1769(2)| 00:00:22 |
|8 |TABLE ACCESS FULL| TBILLLOG5|105 |2835 |1768(2)| 00:00:22 |
执行计划时对表TBILLLOG5进行全表扫描,所有时间为22s,接下来做如下操作:
查看两个表TCURRENTBILLLOG和TBILLLOG5大小。
SQL> select count(*)from dbo.tcurrentbilllog;
COUNT(*)
293
Statistics
----------------------------------------------------------
0recursive calls
0db block gets
8357consistent gets
0physical reads
0redo size
516bytes sent via SQL*Net to client
492bytes received via SQL*Net from client
SQL> select * from dba_segments where segment_name='TCURRENTBILLLOG';
表TCURRENTBILLLOG有260M,决定对查询语句中的两个表进行碎片整理和整理空间。
SQL> alter table dbo.tbilllog5 enable row movement;
SQL> alter table dbo.tbilllog5 shrink space cascade;
SQL> alter table dbo.tcurrentbilllog enable row movement;
SQL> alter table dbo.tcurrentbilllog shrink space cascade;
2 对表TBILLLOG5建索引,再做查询时走索引。
select deviceno, count(1) as abandcount from dbo.tbilllog5
where devicetype = 2 and callidnum = -1 and
to_char(callbegin, 'YYYY.MM.DD')=to_char(sysdate, 'YYYY.MM.DD')
7* and callend<=callbegin group by deviceno) group by deviceno
根据查询条件中的三个字段,分别找出三个字段的数值:
SQL> select devicetype,count(*) from dbo.tbilllog5 group by devicetype;
DEVICETYPECOUNT(*)
---------- ----------
14719
63
2189389
56
3313817
1* select to_char(callbegin, 'YYYY.MM.DD'),count(*) from dbo.tbilllog5 group by to_char(callbegin,'YYYY.MM.DD')
SQL> /
TO_CHAR(CACOUNT(*)
---------- ----------
2011.05.204498
2011.05.234970
2011.05.284468
2012.05.078700
2012.05.235202
2012.05.253360
2012.05.294388
2013.05.06991
2013.05.173154
2013.05.121404
2013.05.274743
SQL> select callidnum,count(*) from dbo.tbilllog5 group by callidnum
CALLIDNUMCOUNT(*)
---------- ----------
1156531
616
22612
4140
589
83
32028
-1346502
710
根据以上在个条件,建立组合索引:
create index dbo.indbillog5_callbegin on dbo.tbilllog5 (to_char(callbegin, 'YYYY.MM.DD'),devicetype) tablespace ts_dbo nologging compute statistics;
最后查询的为
再次查看整个语句的awr sql report:
Stat NameStatementPer Execution % Snap
---------------------------------------- ---------- -------------- -------
Elapsed Time (ms)1160.50.2
CPU Time (ms)1160.50.4
Executions244N/AN/A
Buffer Gets5,77023.60.2
Disk Reads20.00.0
Parse Calls2441.01.4
Rows4882.0N/A
User I/O Wait Time (ms) 0N/AN/A
Cluster Wait Time (ms)0N/AN/A
Application Wait Time (ms)0N/AN/A
Concurrency Wait Time (ms)0N/AN/A
Invalidations0N/AN/A
Version Count1N/AN/A
Sharable Mem(KB)35N/AN/A
-------------------------------------------------------------
Execution Plan
----------------------------------------------------------------------------------------------------------------
| Id| Operation| Name| Rows| Bytes | Cost (%CPU)| Time|
----------------------------------------------------------------------------------------------------------------
|0 | SELECT STATEMENT||||56 (100)||
|1 |HASH GROUP BY||59 |1534 |56(6)| 00:00:01 |
|2 |VIEW||59 |1534 |55(4)| 00:00:01 |
|3 |UNION-ALL||||||
|4 |HASH GROUP BY||1 |27 |8(13)| 00:00:01 |
|5 |TABLE ACCESS BY INDEX ROWID| TCURRENTBILLLOG|1 |27 |7(0)| 00:00:01 |
|6 |INDEX RANGE SCAN| INDCURRENTBILLLOG_DEVICETYPE |13 ||4(0)| 00:00:01 |
|7 |HASH GROUP BY||58 |1566 |47(3)| 00:00:01 |
|8 |TABLE ACCESS BY INDEX ROWID| TBILLLOG5|616 | 16632 |46(0)| 00:00:01 |
|9 |INDEX RANGE SCAN| INDBILLOG5_CALLBEGIN|760 ||3(0)| 00:00:01 |
可以看到现在执行语句的buffer gets时间为1S。
接下来转载一个是涉及到开发导致CPU过高的:
昨天搞定了一个十万火急的issue,客户抱怨产品升级后系统会变慢和CPU使用率相当高,客户脾气很大,声称不尽快解决这个问题就退货,弄得我们 R&D压力很大,解决这个issue的任务分给了我,客户是南非的一个公司,由于时差问题,我只好在家远程解决问题,晚上8点半用 gotomeeting远程到客户电脑查看我们的系统,折腾了四个多小时,终于在凌晨时reproduce了这个high CPU,赶紧抓Log,用wireshark抓包,用gcore,gstack,strace和top保存了系统的相关输出。在第2天分析了这些文件后,找到了产品的bug,代码的作者分配了10K的缓冲区,并想当然认为10K足以够用,当然99%的情况下是够用的,但是在这1%的情况下出现了问题,缓冲区不幸被写满了,然后程序进入了死循环,导致high CPU。找到了问题了,fix就很容易了,客户的脾气也缓和了,fix很快就可以deliver给客户。反思解决问题的过程,觉得这个分析过程具有可复用性,值得总结一下。1.用top命令查看哪个进程占用CPU高gateway网关进程14094占用CPU高达891%,这个数值是进程内各个线程占用CPU的累加值。
PID USER PR NI VIRT RES SHR S %CPU MEM TIME+ COMMAND
14086root 25 0 922m 914m 538m R 10110.0 21:35.46 gateway
14087 root 25 0 922m 914m 538m R 10110.0 10:50.22 gateway
14081 root 25 0 922m 914m 538m S 9910.0 8:57.36 gateway
14082 root 25 0 922m 914m 538m R 9910.0 11:51.92 gateway
14089 root 25 0 922m 914m 538m R 9910.0 21:21.77 gateway
14092 root 25 0 922m 914m 538m R 9910.0 19:55.47 gateway
14094 root 25 0 922m 914m 538m R 9910.0 21:02.21 gateway
14083 root 25 0 922m 914m 538m R 9710.0 21:32.39 gateway
14088 root 25 0 922m 914m 538m R 97 10.0 11:23.12 gateway% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------------------
99.99 22.683879 3385 6702 poll
0.00 0.001132 0 6702 gettimeofday
0.00 0.000127 1 208 208 accept
0.00 0.000022 22 1 read
0.00 0.000000 0 1 write
0.00 0.000000 0 1 close
0.00 0.000000 0 14 time
0.00 0.000000 0 2 stat64
0.00 0.000000 0 4 clock_gettime
0.00 0.000000 0 7 send
0.00 0.000000 0 10 10 recvfrom
------ ----------- ----------- --------- --------- ------------------------------
100.00 22.685160 13652 218 total