使用具有多个属性的scala-xml API进行解析

问题描述：

我有我正在尝试使用的XML Scala XML API。我有XPath查询来从XML标签中检索数据。我想从<market>中检索<price>标记值，但使用了两个属性_id和type。我想写一个&&的条件，以便我为每个价格标签获得一个唯一值，例如，其中MARKET _ID = 1 && TYPE = "A"。使用具有多个属性的scala-xml API进行解析

对于低于参考下面的XML：

<publisher> 
    <book _id = "0"> 
     <author _id="0">Dev</author> 
     <publish_date>24 Feb 1995</publish_date> 
     <description>Data Structure - C</description> 
     <market _id="0" type="A"> 
      <price>45.95</price>    
     </market> 
     <market _id="0" type="B"> 
      <price>55.95</price> 
     </market> 
    </book> 
    <book _id="1"> 
     <author _id = "1">Ram</author> 
     <publish_date>02 Jul 1999</publish_date> 
     <description>Data Structure - Java</description> 
     <market _id="1" type="A"> 
      <price>145.95</price>   
     </market> 
     <market _id="1" type="B"> 
      <price>155.95</price>   
     </market> 
    </book> 
</publisher>

下面的代码工作正常

import scala.xml._ 

object XMLtoCSV extends App { 

    val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml") 

    val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text //45.95 
    val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text //155.95 

    println("price = " + price) 
    println("price1 = " + price1) 
}

输出是：

price = 45.9555.95 
price1 = 145.95155.95

我上面的代码是给我两个值因为我无法把& &条件。

请指教，而不是过滤什么SCALA功能我可以使用。
也让我知道如何获得所有的属性名称。
如果可能，请告诉我从哪里可以读取所有这些API。

在此先感谢。

答

你可以写一个自定义的谓词来检查多个属性：

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = { 
    node.attribute("_id").exists(_.text == marketId) && 
    node.attribute("type").exists(_.text == marketType) 
}

然后把它作为一个过滤器：

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text 
// 45.95 

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text 
// 155.95

我很感谢您的解决方案，但没有编写函数我们可以做到 - 有没有任何SCALA函数可以适应这种情况。 –

还有一件事，我已经与你分享了一个样本XML。但我的xml非常大。几乎200个标签意味着我必须编写200个函数，因为属性对于不同的标签是不同的，从一个到六个不同的属性。我想我必须编写6个函数，并且必须更改参数。 –

@PardeepSharma用一些标签的样本问另一个问题。 – ashawley

答

这将是这样写的，如果你有兴趣获得一份CSV数据的文件：

(xmlload \ "book").flatMap { bk => 
    (bk \ "market").flatMap { mkt => 
    (mkt \ "price").map { p => 
     Seq(
     bk \@ "_id", 
     mkt \@ "_id", 
     mkt \@ "type", 
     p.text.toFloat 
    ) 
    } 
    } 
}.map { cols => 
    cols.mkString("\t") 
}.foreach { 
    println 
}

它会输出以下内容：

0  0  A  45.95 
0  0  B  55.95 
1  1  A  145.95 
1  1  B  155.95

而一个常用的模式写入斯卡拉时，认识到：这就是最flatMapflatMap ... map可以改写为for -comprehensions：

for { 
    book <- xmlload \ "book" 
    market <- book \ "market" 
    price <- market \ "price" 
} yield { 
    val cols = Seq(
    book \@ "_id", 
    market \@ "_id", 
    market \@ "type", 
    price.text.toFloat 
) 
    println(cols.mkString("\t")) 
}

答

我使用的Spark与hiveContext我能解析xPath。

object xPathReader extends App{ 

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils") // Path for my winutils.exe 

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]") 
    val sc = new SparkContext(sparkConf) 
    val hiveContext = new HiveContext(sc) 
    val myXmlPath = "D:\\IBM\\DB\\xml" 
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it. 

    import hiveContext.implicits._ 

    val xmlDf = xmlRDDList.toDF("tempXMLTable") 
    xmlDf.registerTempTable("tempTable") 

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()  

    /* Output 
     +------+------+ 
     |BookId| Price| 
     +------+------+ 
     |  0| 55.95| 
     |  1|155.95| 
     +------+------+ 
    */ 
}

这与原始问题无关，这个问题是关于使用scala-xml解析XML，而不是Spark中的XPath。 – ashawley

我提供了一个替代方案，我没有说这是我解决方案的答案。 –

因为XmlFile.withCharset是私有对象，所以我无法使用它，因此我实现了xmlFileUtil。公共类XmlFileUtil { public static RDD withCharset（SparkContext上下文，字符串位置，字符串字符集，字符串rowTag）返回XmlFile.withCharset（context，location，charset，rowTag）; } } –

使用具有多个属性的scala-xml API进行解析

相关推荐