CentOS下dotnet Core使用HttpWebRequest进行HTTP通讯,系统存在大量CLOSE_WAIT连接问题的分析,已解决。...
环境:
dotnet core 1.0.1
CentOS 7.2
今天在服务器巡检的时候,发现一个服务大量抛出异常
异常信息为:
LockStatusPushError&&Message:One or more errors occurred. (An error occurred while sending the request. Too many open files)&InnerMessageAn error occurred while sending the request. Too many open files& at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at System.Threading.Tasks.Task.Wait() at CommonHelper.HttpHelper.HttpRequest(String Url, String Method, String ContentType, Byte[] data, Encoding encoding) at CommonHelper.HttpHelper.PostForm(String Url, Dictionary`2 para, Encoding encoding) at CommonHelper.HttpHelper.PostForm(String Url, Dictionary`2 para) at DeviceService.Program.LockStatusPushMethod()
首先推断,是程序打开文件(端口或者管道)太多导致的超过系统最大限制
使用 ulimit -n 查看最大限制 发现 系统最大限制为65535 为正常值
使用 lsof | wc -l 查看当前打开文件数 发现执行非常缓慢,执行结果显示系统当前打开文件数500w++。。。。。
继而查看dotnet程序打开文件数,发现为400w++
lsof>>/tmp/lsof.log 把当前打开文件列表保存 以供问题判断。
文件导出后,发现 dotnet 程序有大量状态为 CLOSE_WAIT 的socket连接 目的地址为程序访问的HTTP服务器的80端口
dotnet 12208 20425 root 216r FIFO 0,8 0t0 2273974 pipe dotnet 12208 20425 root 217w FIFO 0,8 0t0 2273974 pipe dotnet 12208 20425 root 218u IPv4 2274459 0t0 TCP txk-web:44336->txk-web:http (CLOSE_WAIT) dotnet 12208 20425 root 219r FIFO 0,8 0t0 2274460 pipe dotnet 12208 20425 root 220w FIFO 0,8 0t0 2274460 pipe dotnet 12208 20425 root 221u IPv4 2271144 0t0 TCP txk-web:44340->txk-web:http (CLOSE_WAIT) dotnet 12208 20425 root 222r FIFO 0,8 0t0 2273977 pipe dotnet 12208 20425 root 223w FIFO 0,8 0t0 2273977 pipe dotnet 12208 20425 root 224u IPv4 2274462 0t0 TCP txk-web:44344->txk-web:http (CLOSE_WAIT) dotnet 12208 20425 root 225r FIFO 0,8 0t0 2271147 pipe dotnet 12208 20425 root 226w FIFO 0,8 0t0 2271147 pipe dotnet 12208 20425 root 227u IPv4 2272624 0t0 TCP txk-web:44348->txk-web:http (CLOSE_WAIT) dotnet 12208 20425 root 228r FIFO 0,8 0t0 2272625 pipe dotnet 12208 20425 root 229w FIFO 0,8 0t0 2272625 pipe dotnet 12208 20425 root 230u IPv4 2273985 0t0 TCP txk-web:44352->txk-web:http (CLOSE_WAIT) dotnet 12208 20425 root 231r FIFO 0,8 0t0 2271150 pipe dotnet 12208 20425 root 232w FIFO 0,8 0t0 2271150 pipe dotnet 12208 20425 root 233u IPv4 2272627 0t0 TCP txk-web:44356->txk-web:http (CLOSE_WAIT)
定位原因出现在HTTP访问上
继而查看程序的日志,发现需要程序访问的HTTP接口报500错误,
出现错误后程序会重试请求(逻辑上要求重试),重试间隔为100ms,太短导致短时间内有太多请求
首先解释CLOSE_WAIT
对方主动关闭连接或者网络异常导致连接中断,这时我方的状态会变成CLOSE_WAIT 此时我方要关闭连接来使得连接正确关闭。
初步判断可能有如下原因:
1.程序抛出异常后没有释放资源
2.dotnet core 底层的 bug
3.nginx代理强制关我的连接,又没有给我关闭的确认包
4.HTTP请求超时(这个基本没可能,HTTP接口在本机)
接下来首先看代码,我的HTTP访问方法代码如下:
private static byte[] HttpRequest(string Url, string Method, string ContentType, byte[] data, Encoding encoding) { WebResponse response = null; HttpWebRequest request = null; byte[] result = null; try { request = (HttpWebRequest)WebRequest.Create(Url); request.Headers["UserAgent"] = @"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"; request.Accept = @"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"; request.Method = Method; request.ContentType = ContentType; if (data != null) { var reqStreamAsync = request.GetRequestStreamAsync(); //reqStreamAsync.Wait(); using (Stream reqStream = reqStreamAsync.Result) { reqStream.Write(data, 0, data.Length); reqStream.Dispose(); } } var reqAsync = request.GetResponseAsync(); //reqAsync.Wait(); using (response = reqAsync.Result) { using (Stream stream = response.GetResponseStream()) { List<byte> byteArr = new List<byte>(); int tmp = -1; while ((tmp = stream.ReadByte()) >= 0) { byteArr.Add((byte)tmp); } result = byteArr.ToArray(); stream.Dispose(); } response.Dispose(); } } catch (Exception ex) { throw; } finally { if (request != null) { request.Abort(); request = null; } if (response != null) { response.Dispose(); response = null; } } return result; }
看到代码 第一想法是 HttpWebRequest 没有套using也没有Dispose(),
但是尝试后发现,这个类根本就没有实现IDisposable接口,也没法手工释放,
百度之后得到结论,只能Abort(),添加到finally,顺便给WebResponse增加Dispost(),重新尝试 -------- 无效。
之后修改了Centos的/etc/sysctl.conf
增加对keepalive相关配置进行尝试
net.ipv4.tcp_keepalive_time=60
net.ipv4.tcp_keepalive_probes=2
net.ipv4.tcp_keepalive_intvl=2
然后 sysctl -p 重新加载配置,再次尝试 -------- 问题依旧。
之后又感觉是程序没有释放HttpWebRequest,
在HTTP访问方法的finally中加入GC.Collect(),希望强制回收 -------- 还是没用。
最终已经放弃寻找问题,直接把重试的地方增加延时,如果http请求出错,Thread.Sleep(10000);
临时解决此问题。
问题最终没有完美解决。
希望各位如果谁能知道问题原因,与我讨论,谢谢
2017.04.07 更新
今天更换HttpClient进行HTTP通讯
发现问题解决了。。。。
代码如下,欢迎指正~
private async static Task<byte[]> HttpRequest(string Url, HttpMethodEnum HttpMethod, string ContentType, byte[] data) { byte[] result = null; try { using (HttpClient http = new HttpClient()) { http.DefaultRequestHeaders.Add("User-Agent", @"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"); http.DefaultRequestHeaders.Add("Accept", @"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"); HttpResponseMessage message = null; if (HttpMethod == HttpMethodEnum.POST) { using (Stream dataStream = new MemoryStream(data ?? new byte[0])) { using (HttpContent content = new StreamContent(dataStream)) { content.Headers.Add("Content-Type", ContentType); message = await http.PostAsync(Url, content); } } } else if (HttpMethod == HttpMethodEnum.GET) { message = await http.GetAsync(Url); } if (message != null && message.StatusCode == System.Net.HttpStatusCode.OK) { using (message) { using (Stream responseStream = await message.Content.ReadAsStreamAsync()) { if (responseStream != null) { byte[] responseData = new byte[responseStream.Length]; responseStream.Read(responseData, 0, responseData.Length); result = responseData; } } } } } } catch (Exception ex) { throw; } return result; }