« 上一篇下一篇 »

服务器ping正常,SSH服务突然连接不了案例分析解决

   今天早上有一台Oracle数据库服务器突然出现短暂的ssh连接不上的情况,ssh连接不上的时候,但,使用psping检测端口22也是正常(只返回5个包,没有持续ping),使用SQL Developer可以登录数据库进行任何操作,另外,通过DPA工具发现该服务器的CPU等资源消耗很低(发现数据库服务都正常后,就出去吃饭了),回来时,同事反馈ssh已经正常,错过诊断的大好时机,期间另外一个同事也做了一些检查:

检测发现ping正常,但是psping检测8088端口发现网络时延很长,甚至出现超时。他做了一个截图对比,如下所示.

ping是一个网络层的协议,只是表明网络在3层是通的;tomcat是应用层协议

 

吃饭回来后,发现ssh已经可以正常登录服务器,检查发现这个进程已经运行了二百多天了,那么也就是说sshd服务没有死掉,sshd服务也没有重启过。

使用ps -ef | grep sshd 找到sshd的进程,执行下面命令

[root@mylnx01 ~]# ps -eo pid,lstart,etime | grep 3423
 
 3423 Sun Feb 18 13:56:11 2018 234-09:01:48

检查日志信息,发现里面有几条 Did not receive identification string from xxx的信息(部分信息做了脱敏处理)。

[root@mylnx01 log]# tail -100 /var/log/secure
Oct  8 14:50:48 mylnx01 sshd[4341]: pam_unix(sshd:session): session opened for user oracle by (uid=0)
Oct  8 14:50:49 mylnx01 sshd[4341]: pam_unix(sshd:session): session closed for user oracle
Oct 10 12:26:41 mylnx01 sshd[742]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[743]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[790]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[789]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[745]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[744]: Did not receive identification string from 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[1007]: Connection closed by 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[1006]: Connection closed by 192.168.xxx.xxx
Oct 10 12:26:41 mylnx01 sshd[746]: Did not receive identification string from 192.168.xxx.xxx

搜索了一下这个错误的相关资料,一般出现错误是因为:

This one below means ssh server waited and did not receive what it needed in a timely fashion. This is typically due to connectivity issues. In an ssh connection, the server first provides its identification string, then waits for the client to then provide its identification string. If there is a loss in connection, or the client just bails, this is what you will see in the logs.
If someone uses telnet or netcat to fetch your ssh banner, or other various scans, the logs on the server side will show this as well.

 

小结:

这个错误信息意味着SSH服务因为没有获得它所需要的时间,和等待。它通常是由连接问题引起的。在ssh连接中,服务器首先提供其标识字符串,然后等待客户端提供其标识字符串。如果连接丢失或客户端刚刚退出,日志的内容将出现。、
 虽然路由问题是怀疑,但个人手头缺乏网络监视的详细的证据,但有一些证据的证据:最近都有网络问题,前天还发现,网络取代更严重,网络管理员找到供应商反馈,但在什么也不知道。因为他们不负责这件事。