提取文本形式与特定属性

问题描述:

我想写的XPath,将选择下div[@class="content"]p[position() > 1 and position() < last() - 1]提取文本形式与特定属性

到目前为止,我有这个<h3>, <ul> and <p>标签嵌套节点....

//div[@class="content"]/*[self::h3 or self::ul or self::p[position() > 1 and position() < last() - 1]]//text() 

但它不没有工作。

这里的HTML:https://gist.github.com/umrashrf/5167711

+0

我在'firebug + firepath'工作。你尝试过'import lxml.html'吗? – kev 2013-03-15 06:27:40

确定你的XML并没有很好地形成,所以我固定的这首。

<?xml version="1.0" encoding="UTF-8"?> 
<div class="content"> 
<h1/> 
<h2> 
    <p>Certified Nursing Assistant - Full Time</p> 
Job Summary</h2> 
<p>Responsible for providing personal care and assistance for residents in long  
term care facility.</p> 
<h2> 
</h2> 
<h3>Essential Functions:</h3> 
<ul> 
    <li> 
     <span style="line-height: 1.5;">Responsible</span> for providing 
personal care and assistance to residents </li> 
    <li>Assist residents in and out of bed, dressing, feeding, grooming and 
personal hygiene. </li> 
    <li>Provide basic treatments as required and directed by nursing staff. 
</li> 
    <li>Responsible for observing and reporting changes in residents' physical 
and emotional conditions to charge nurse. </li> 
</ul> 
<h3>Qualifications: </h3> 
<p>Education:</p> 
<ul> 
    <li>High school diploma or equivalent </li> 
    <li>Successful completion of state approved certified nursing assistance 
course </li> 
</ul> 
<p>Experience:</p> 
<ul> 
    <li>Previous health care related experience preferred </li> 
</ul> 
<a id="ctl00_ctl01_namelink" class="btn" href="employment-application.aspx? 
positionid=34">Apply Online</a> 
<br/> 
<br/> 
<h2> 
Apply in Person</h2> 
<p> 
To apply in persion please stop by Shenandoah Medical Center to pick up a job 
application.</p> 
<h2> 
Apply by Mail</h2> 
<p> 
To apply by mail, download and print <a target="_blank" href="/filesimages/Careers/SMC 
Employment Application.pdf"> 
    this form</a>. Please fill out the application and then mail to:<br/> 
    <br/> 
    <strong>Shenandoah Medical Center, Human Resources<br/> 
    </strong>300 Pershing Avenue<br/> 
Shenandoah, IA 51601</p> 
</div> 

现在,如果我正确地理解你的问题,你想找到所有的H3,UL和p标签,这是DIV的子节点[=“内容” @类]每个选定子节点必须满足条件[position()> 1和position()< last() - 1]。对于这个我认为这个单一的XPATH将做到:

//div[@class="content"]/h3[position() > 1 and position() < last() - 1] |   
//div[@class="content"]/p[position() > 1 and position() < last() - 1] | 
//div[@class="content"]/ul[position() > 1 and position() < last() - 1]