解析内容不在html标签Nokogiri
<form method="post" action="/M740/Biography/History/Drama/12+Years+a+Slave">
<input type="image" src="/public_site/webroot/cache/imdb/2024544_100.jpg" width="100" style="float:right;margin-left:2px;">
<strong><span style="color: rgb(255, 69, 0);">12 Years a Slave</span></strong>
<br>
In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.<br>
<br><strong>Century Cinemax - Junction</strong><br>
<a href="tel:0774136246">0774136246</a>
<a href="tel:0208022073">0208022073</a>
<br>
12:10, 19:10, 21:40<br>
<br><strong>Fox Cineplex Sarit</strong><br>
<a href="tel:0203753025">0203753025</a>
<a href="tel:0720366208">0720366208</a>
<br>
11:00, 14:00, 18:00, 20:40<br>
<br><strong>Planet Media - Kisumu </strong><br>
<a href="tel:0731999100">0731999100</a>
<a href="tel:0724999100 & 0202629388">0724999100 & 0202629388</a>
<br>
12:00, 14:30, 20:30<br>
<br>
<input type="hidden" name="cinema" value="0">
<input type="hidden" name="searchMovie" value="0">
<input type="hidden" name="movie" value="740">
<input type="hidden" name="date" value="0">
<input type="hidden" name="groupId" value="0">
<input type="submit" name="ok" value="Further Details">
</form>
好吧,这只是我试图解析使用Nokogiri的一部分HTML。 html中的语义并不完整,我正在用Nokogiri获得想要的内容。作为参考,这是我想要废除的网站(http://flix.co.ke/Frontpage/Listings)解析内容不在html标签Nokogiri
到目前为止,我能够获得电影的标题,一个电影院和两个电话号码,但与我的方法我不能真正得到所有内容所需
这是我使用
require 'rubygems'
require 'nokogiri'
require 'open-uri'
url = "http://flix.co.ke/Frontpage/Listings"
doc = Nokogiri::HTML(open(url))
doc.css(".min-width div form").each do |entry|
title = entry.at_css("span").text
puts title
cinema = entry.at_css("br+ strong").text
puts cinema
phone = entry.at_css("a").text
puts phone
puts entry.at_css("a").next_element.text
end
有了这个我目前的剧本我只能够得到电影的title
,one cinema
和two contact numbers
所以我的样本输出的模样。
12 Years a Slave
Century Cinemax - Junction
0774136246
0208022073
47 Ronin 3D
Century Cinemax - Junction
0774136246
0208022073
Delivery Man
Century Cinemax - Junction
0774136246
0208022073
Frozen
Century Cinemax - Junction
0774136246
0208022073
(continued...)
有,只是在休息标记后称号后的描述,我无法得到这一点,并我怎么通过
标签内的所有电影院循环?以及逗号分隔的电话号码和个人演出时间。
我只是不知道从哪里开始。我会想取得这样的成绩对于这种情况
12年从
在战前美国,所罗门·诺萨普,一个自由的黑人男子从纽约州北部,被绑架并卖入奴隶制。
- 世纪Cinemax的 - 结 12:10,19:10,21:40
- 福克斯影城沙立 11:00,14:00,18:00,20:40
etc
任何帮助将不胜感激。在此先感谢
电影院你循环html真的不是那么糟糕,并且你在br + strong
的正确轨道上,这就是你想要迭代的东西:
doc.search('.min-width div form').each do |form|
title = form.at('span').text
description = form.at('br').next.text
form.search('br + strong').each do |el|
cinema = el.text
phones = []
while next_el = el.at('+ a', '+ br + a')
el = next_el
phones << el.text
end
times = el.at('+ br').next.text
end
end
我不能强调这是多么有帮助。谢谢一堆! ;-) –
这是可怕的HTML:/它是无效的451错误和9警告。没有语义,所以你必须依靠可能会改变的结构,打破你的刮擦。
然而,你可以通过使用同级方法获得每一种:
doc.css('.min-width div form').each do |node|
description = node.at_css('br').next_sibling.text
puts description.strip
puts '-'*10
end
# >> In the antebellum United States, Solomon Northup, a free black man from upstate New York, is abducted and sold into slavery.
# >> ----------
# >> A band of samurai set out to avenge the death and dishonor of their master at the hands of a ruthless shogun.
# >> ----------
# >> An affable underachiever finds out he's fathered 533 children through anonymous donations to a fertility clinic 20 years ago. Now he must decide whether or not to come forward when 142 of them file a lawsuit to reveal his identity.
# >> ----------
# >> Fearless optimist Anna teams up with Kristoff in an epic journey, encountering Everest-like conditions, and a hilarious snowman named Olaf in a race to find Anna's sister Elsa, whose icy powers have trapped the kingdom in eternal winter.
# >> ----------
# >> A medical engineer and an astronaut work together to survive after an accident leaves them adrift in space.
# >> ----------
# >> A pair of aging boxing rivals are coaxed out of retirement to fight one final bout -- 30 years after their last match.
# >> ----------
# >>
# >> ----------
# >> Harrison, overworked and underpaid is looking for money for bride price. A 'business' opportunity presents itself when he gets the keys to the Company house. With the CEO away on holiday, he has access to a vacant fully furnished house. He ...
# >> ----------
# >>
# >> ----------
# >> A chronicle of Nelson Mandela's life journey from his childhood in a rural village through to his inauguration as the first democratically elected president of South Africa.
# >> ----------
# >> Author P. L. Travers reflects on her difficult childhood while meeting with filmmaker Walt Disney during production for the adaptation of her novel, Mary Poppins.
# >> ----------
# >> The Manzoni family, a notorious mafia clan, is relocated to Normandy, France under the witness protection program, where fitting in soon becomes challenging as their old habits die hard.
# >> ----------
# >> The dwarves, along with Bilbo Baggins and Gandalf the Grey, continue their quest to reclaim Erebor, their homeland, from Smaug. Bilbo Baggins is in possession of a mysterious and magical ring.
# >> ----------
# >> The film begins as Katniss Everdeen has returned home safe after winning the 74th Annual Hunger Games along with fellow tribute Peeta Mellark. Winning means that they must turn around and leave their family and close friends, embarking on a ...
# >> ----------
# >> A day-dreamer escapes his anonymous life by disappearing into a world of fantasies filled with heroism, romance and action. When his job along with that of his co-worker are threatened, he takes action in the real world embarking on a global ...
# >> ----------
# >> Faced with an enemy that even Odin and Asgard cannot withstand, Thor must embark on his most perilous and personal journey yet, one that will reunite him with Jane Foster and force him to sacrifice everything to save us all.
# >> ----------
# >> A journey into the lives of a mother polar bear and her two seven-month-old cubs as they navigate the changing Arctic wilderness they call home.
# >> ----------
# >> See and feel what it was like when dinosaurs ruled the Earth, in a story where an underdog dino triumphs to become a hero for the ages.
# >> ----------
通过使用以css
代替at_css
(您通过表单元素循环例如方式相同)
好多了! – Bala
包含有效的HTML片段,而不是提取。为了帮助你,我们必须跳过篮球。 –