一次软件爬取记录
昨天终于把成功爬取一个gps软件,这里记录一下过程。
起初,使用Fiddler偶然发现它的请求是这样的
于是,点击出发请求,发现果然使用http协议,先查看头信息,没有cookies,
使用:
POST http://*****/page/webservices/TRWebService.asmx HTTP/1.1
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; MS Web Services Client Protocol 2.0.50727.8669)
Content-Type: text/xml; charset=utf-8
SOAPAction: "http://www.*****.cn/New_GetVehicleminiInfo"
Host:
*****
Content-Length: 404
区别获取哪类信息。
试着在浏览器访问url,得到了:
检索New_GetVehicleInfo,查看:
response里的数据是base64Binary:
H4sIAAAAAAAEAOy9B2AcSZYlJi9tynt/SvVK1+B0oQiAYBMk2JBAEOzBiM3mkuwdaUcjKasqgcplVmVdZhZAzO2dvPfee++999577733ujudTif33/8/XGZkAWz2zkrayZ4hgKrIHz9+fB8/Ih7/Gr/Hr/Hu11j8GuWvkf4al79G/mvUv0bzaxS/RvVrLH+Nz36Nj36N3V9j/Gvs0M+Uvln+GlP6fEbfLn+NC/7281/jya+x92vco1Z79Nfv8Wsc/RqP6f2K/mv59zP6mRNsQP799e2c3/z3v3/8a7yiN+/9Gp/S/wEf31XUNud3am714NfYJ8hjarHzazzkVjW1yAiDkto03Eawm9GnrcLeo892f42DX2Obf27T3/JuRlhc/Bq/6NdY02+f8SfSL6C5v+++F+Yp/fcbvPmL/9jf4f820L7eKLpj+BdexEaR/hq/7A9Oe+MD3Nj45FM3Pvn7duN7QKN7sGFmdunbhzSmT5kO8TH93MyMwfw2M3ObUXTH8HM3M3/di+Nf4+mv8W3qDSMcHhOk8T7zm7T6uZ+ZEPPbzczNo+iO4f8NM/NwozbbYT3g5u//TTMDzM28fMgIuvj/3M3K4R+waSx7NIL7BGlMvx2w/fp/w2juevbz/wkAAP//Bpo5rZoHAAA=
在网上搜索一阵找到类似的情况:
http://bbs.****.net/topics/390408695
看下来结论是加密后的数据,需要找到加密过程。
还有另一个:
http://www.newsmth.net/nForum/#!article/WebDev/24420
这个没看明白是怎么解码的。
最后大致明白使用了加密算法,意思是在登陆时交换**,之后的数据都先加密在发出去。
这样就只能使用Wireshark抓包,以免漏掉其他类型的请求,结果没找到,无奈只能看源码。
使用JetBrains dotPeek反编译出c#源码,之所以确定是c#,是因为在软件安装目录使用了大量dll,其中一些三方的,搜索一下就能确定是.net。
在源码中检索加密方法,如des,aes, md5。最后找到了一个加密方法:
public static string StrEncode(string data) { byte[] bytes1 = Encoding.ASCII.GetBytes("********"); byte[] bytes2 = Encoding.ASCII.GetBytes("********"); DESCryptoServiceProvider cryptoServiceProvider = new DESCryptoServiceProvider(); int keySize = cryptoServiceProvider.KeySize; MemoryStream memoryStream = new MemoryStream(); CryptoStream cryptoStream = new CryptoStream((Stream) memoryStream, cryptoServiceProvider.CreateEncryptor(bytes1, bytes2), CryptoStreamMode.Write); StreamWriter streamWriter = new StreamWriter((Stream) cryptoStream); streamWriter.Write(data); streamWriter.Flush(); cryptoStream.FlushFinalBlock(); streamWriter.Flush(); return Convert.ToBase64String(memoryStream.GetBuffer(), 0, (int) memoryStream.Length); }然后检索这个方法明,只找到一个加密的语句:
GlobalTool.SetCfgValue("mapType", Convert.ToString(Convert.ToUInt32(GlobalTool.gi_MapType))); string text = this.txtName.Text; string appValue = GlobalTool.StrEncode(this.txtPassword.Text); if (!this.cbSavePassword.Checked) { appValue = ""; }显然,这是加密密码用的,没有加密其他数据。
改变思路,寻找请求方法,检索: New_GetVehicleInfo,找到了该方法:
public override void loadUserVehicleInfo(string svrid, string userid, string userpassword) { try { DateTime now = DateTime.Now; byte[] compressedData = this.m_WebSvc.New_GetVehicleInfo(svrid, userid, userpassword); TimeSpan span = (TimeSpan) (DateTime.Now - now); Console.Out.WriteLine("New_GetVehicleInfo:" + span.TotalMilliseconds.ToString()); string strxml = GlobalTool.GzipDeCompress(compressedData); span = (TimeSpan) (DateTime.Now - now); Console.Out.WriteLine("GlobalTool.GZipDeCompress:" + span.TotalMilliseconds.ToString()); if (strxml.Length > 0) { this.ReaderVehiceInfoXML(strxml); } span = (TimeSpan) (DateTime.Now - now); Console.Out.WriteLine("ReaderVehiceInfoXML:" + span.TotalMilliseconds.ToString()); } catch (Exception exception) { this.ErrorInfo("loadUserVehicleInfo", exception.ToString()); } }
接着检索:New_GetVehicleInfo
public byte[] New_GetVehicleInfo(string strsvr, string userid, string pwd) { return (byte[]) base.Invoke("New_GetVehicleInfo", new object[] { strsvr, userid, pwd })[0]; }检索:GzipDeCompress
public static string GzipDeCompress(byte[] compressedData) { StringBuilder stringBuilder = new StringBuilder(); int num = 0; byte[] buffer = compressedData; byte[] numArray = new byte[4096]; Stream stream = (Stream) new GZipStream((Stream) new MemoryStream(buffer), CompressionMode.Decompress); while (true) { int count = stream.Read(numArray, 0, numArray.Length); if (count > 0) { num += count; stringBuilder.Append(Encoding.Unicode.GetString(numArray, 0, count)); } else break; } stream.Close(); return stringBuilder.ToString(); }至此,可以确定数据没有加密,而是被压缩了,使用了base64编码,于是使用站长工具
http://www.bejson.com/enc/base64/
进行base64解码,结果什么都没解出来,就直接使用gzip解压,然后报错。
安装vs2008,使用上面的GzipDeCompress方法进行解压,报错。
最后尝试使用base64包解码,在使用gzip解压,得出结果。
< ? x m l v e r s i o n = " 1 . 0 " e n c o d i n g = " G B 2 3 1 2 " ? > < r o o t > < I t e m v _ c o d e = " �]A R 3 3 6 3 " o d o m e t e r = " 7 4 2 . 3 0 9 " r m a i l e s = " 0 " d a t e = " 2 0 1 8 - 0 1 - 2 9 " a v g q u a = " " o i l = " " / > < I t e m v _ c o d e = " �]A R 3 3 6 3 T��� " o d o m e t e r = " 7 4 2 . 3 0 9 " r m a i l e s = " " d a t e = " �N2 0 1 8 - 0 1 - 2 9 � 2 0 1 8 - 0 1 - 3 0 " a v g q u a = " 0 " o i l = " 0 " / > < I t e m v _ c o d e = " �]A R 7 6 7 3 " o d o m e t e r = " 1 7 9 . 6 3 6 " r m a i l e s = " 0 " d a t e = " 2 0 1 8 - 0 1 - 2 9 " a v g q u a = " " o i l = " " / > < I t e m v _ c o d e = " �]A R 7 6 7 3 T��� " o d o m e t e r = " 1 7 9 . 6 3 6 " r m a i l e s = " " d a t e = " �N2 0 1 8 - 0 1 - 2 9 � 2 0 1 8 - 0 1 - 3 0 " a v g q u a = " 0 " o i l = " 0 " / > < I t e m v _ c o d e = " �NA D H 9 6 7 " o d o m e t e r = " 1 1 2 5 . 3 7 " r m a i l e s = " 0 " d a t e = " 2 0 1 8 - 0 1 - 2 9 " a v g q u a = " " o i l = " " / > < I t e m v _ c o d e = " �NA D H 9 6 7 T��� " o d o m e t e r = " 1 1 2 5 . 3 7 " r m a i l e s = " " d a t e = " �N2 0 1 8 - 0 1 - 2 9 � 2 0 1 8 - 0 1 - 3 0 " a v g q u a = " 0 " o i l = " 0 " / > < I t e m v _ c o d e = " �NA D H 9 9 3 " o d o m e t e r = " 7 0 2 . 9 6 7 " r m a i l e s = " 0 " d a t e = " 2 0 1 8 - 0 1 - 2 9 " a v g q u a = " " o i l = " " / > < I t e m v _ c o d e = " �NA D H 9 9 3 T���" o d o m e t e r = " 7 0 2 . 9 6 7 " r m a i l e s = " " d a t e = " �N2 0 1 8 - 0 1 - 2 9 � 2 0 1 8 - 0 1 - 3 0 " a v g q u a = " 0 " o i l = " 0 " / > < I t e m v _ c o d e = " ;`���" o d o m e t e r = " 2 7 5 0 . 2 8 2 " r m a i l e s = " " d a t e = " �N2 0 1 8 - 0 1 - 2 9 � 2 0 1 8 - 0 1 - 3 0 " a v g q u a = " 0 " o i l = " 0 " / > < / r o o t >
可以看出字符间距过大,存在中文乱码,查看响应内容:
一个是utf-8,一个是gb2312,于是尝试解码,报错,再次尝试gbk,报错。无奈上网搜索。
https://stackoverflow.com/questions/4735566/python-unicode-problem
万万没想到,居然是utf-16,
<?xml version="1.0" encoding="GB2312"?><root><Item v_code="川AR33**" odometer="742.309" rmailes="0" date="2018-01-29" avgqua=" " oil=" " /><Item v_code="川AR33**
合计: " odometer="74**09" rmailes="" date="从2018-01-29 至 2018-01-30" avgqua="0" oil="0" /><Item v_code="川AR76**"
odometer="17**36" rmailes="0" date="2018-01-29" avgqua=" " oil=" " /><Item v_code="川AR76**
合计: " odometer="17**36" rmailes="" date="从2018-01-29 至 2018-01-30" avgqua="0" oil="0" /><Item v_code="京ADH9**"
odometer="112**7" rmailes="0" date="2018-01-29" avgqua=" " oil=" " /><Item v_code="京ADH9**
合计: " odometer="112**7" rmailes="" date="从2018-01-29 至 2018-01-30" avgqua="0" oil="0" /><Item v_code="京ADH9**"
odometer="70**67" rmailes="0" date="2018-01-29" avgqua=" " oil=" " /><Item v_code="京ADH9**
合计:" odometer="70**67" rmailes="" date="从2018-01-29 至 2018-01-30" avgqua="0" oil="0" /><Item v_code="总计:" odometer="275**82"
rmailes="" date="从2018-01-29 至 2018-01-30" avgqua="0" oil="0" /></root>
至此完成。