如何使用XMLStarlet访问Bash中HTML标记的内容

问题描述:

我在学习如何使用XMLStarlet访问Bash中HTML标记的内容。作为一个例子,我试图访问www.wisdomofchopra.com/iframe.php页面中的一些文本。我在指定XMLStarlet的HTML内容的“地址”时遇到了一些困难,并且会提供一些帮助。我的代码尝试以下:如何使用XMLStarlet访问Bash中HTML标记的内容

URL="http://www.wisdomofchopra.com/iframe.php" 
webPage="$(curl -s "${URL}")" 
echo "${webPage}" | xmlstarlet sel -T -t -c "//html/body//table/tr/td[@id='quote']/header/h2/" 

这将产生以下的输出:

-:29.12: Opening and ending tag mismatch: meta line 5 and head 
    </head> 
     ^
-:35.100: Entity 'nbsp' not defined 
te"><header><h2>&quot;Emotional intelligence is beyond total reality&quot;&nbsp; 
                      ^
-:35.106: Entity 'nbsp' not defined 
eader><h2>&quot;Emotional intelligence is beyond total reality&quot;&nbsp;&nbsp; 
                      ^
-:41.119: EntityRef: expecting ';' 
witter.com/intent/tweet?original_referer=http%3A%2F%2Fwww.wisdomofchopra.com&via 
                      ^
-:41.139: EntityRef: expecting ';' 
eet?original_referer=http%3A%2F%2Fwww.wisdomofchopra.com&via=WisdomOfChopra&text 
                      ^
-:41.196: EntityRef: expecting ';' 
via=WisdomOfChopra&text=%27Emotional+intelligence+is+beyond+total+reality%27&url 
                      ^
-:52.169: EntityRef: expecting ';' 
));document.write(' src="http://ads.adbrite.com/mb/text_group.php?sid=2171164&zs 
                      ^
-:52.186: EntityRef: expecting ';' 
(' src="http://ads.adbrite.com/mb/text_group.php?sid=2171164&zs=3436385f3630&ifr 
                      ^
-:52.209: EntityRef: expecting ';' 
ite.com/mb/text_group.php?sid=2171164&zs=3436385f3630&ifr='+AdBrite_Iframe+'&ref 
                      ^
-:53.99: EntityRef: expecting ';' 
p" href="http://www.adbrite.com/mb/commerce/purchase_form.php?opid=2171164&afsid 
                      ^
-:57.9: Opening and ending tag mismatch: head line 3 and html 
</html> 
    ^
-:58.1: Premature end of data in tag html line 2 

编辑:为方便起见,以下是网页一些大致相当于HTML代码:

<!DOCTYPE html> 
<html> 
    <head> 
    </head> 
    <body> 
     <h3>Your random fictional Deepak Chopra quote:</h3> 
     <table border="0" cellspacing="0" cellpadding="0"> 
      <tr> 
       <td width="128" align="left" valign="top"><img src="img/imageSmall2.png" width="80" height="80" /></td> 
       <td id="quote"><header><h2>&quot;Perceptual reality serves total truth&quot;&nbsp;&nbsp;</h2></header></td> 
      </tr> 
     </table> 
    </body> 
</html> 
+0

感谢您的建议。我不认为这是问题。我创建了一个与示例网页分层相似的代码,我仍遇到类似的问题。我已将此简化版本添加到发布文字中。 – d3pd 2014-09-23 20:32:55

我无法获得XMLStarlet来处理HTML,所以我只是使用grep和AWK来完成它:

printDeepakChopraAdvice(){ 
    URL="http://www.wisdomofchopra.com/iframe.php" 
    webPage="$(curl -s "${URL}")" 
    text="$(echo "${webPage}" | grep "id=\"quote\"" | awk -F"&quot;" '{print $2}')" 
    echo "${text}" 
}