SparkCLR:处理文本文件失败

SparkCLR:处理文本文件失败

问题描述:

我努力学习SparkCLR处理一个文本文件,并使用在其上运行火花SQL查询Sample象下面这样:SparkCLR:处理文本文件失败

[Sample] 
internal static void MyDataFrameSample() 
{ 
    var schemaTagValues = new StructType(new List<StructField> 
           { 
            new StructField("tagname", new StringType()), 
            new StructField("time", new LongType()), 
            new StructField("value", new DoubleType()), 
            new StructField("confidence", new IntegerType()), 
            new StructField("mode", new IntegerType()) 
           }); 

    var rddTagValues1 = SparkCLRSamples.SparkContext.TextFile(SparkCLRSamples.Configuration.GetInputDataPath(myDataFile)) 
     .Map(r => r.Split('\t') 
      .Select(s => (object)s).ToArray()); 
    var dataFrameTagValues = GetSqlContext().CreateDataFrame(rddTagValues1, schemaTagValues); 
    dataFrameTagValues.RegisterTempTable("tagvalues"); 
    //var qualityFilteredDataFrame = GetSqlContext().Sql("SELECT tagname, value, time FROM tagvalues where confidence > 85"); 
    var qualityFilteredDataFrame = GetSqlContext().Sql("SELECT * FROM tagvalues"); 
    var data = qualityFilteredDataFrame.Collect(); 

    var filteredCount = qualityFilteredDataFrame.Count(); 
    Console.WriteLine("Filter By = 'confidence', RowsCount={0}", filteredCount); 
} 

但是这一直给我,说错误:

[2016-01-13 08:56:28,593] [8] [ERROR] [Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge] - JVM method execution failed: Static method collectAndServe failed for class org.apache.spark.api.python.PythonRDD when called with 1 parameters ([Index=1, Type=JvmObjectReference, Value=19],) 
    [2016-01-13 08:56:28,593] [8] [ERROR] [Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge] - 
    ******************************************************************************************************************************* 
     at Microsoft.Spark.CSharp.Interop.Ipc.JvmBridge.CallJavaMethod(Boolean isStatic, Object classNameOrJvmObjectReference, String methodName, Object[] parameters) in d:\SparkCLR\csharp\Adapter\Microsoft.Spark.CSharp\Interop\Ipc\JvmBridge.cs:line 91 
    ******************************************************************************************************************************* 

我的文本文件看起来像如下:

10PC1008.AA 130908762000000000   7.059829 100 0 
10PC1008.AA 130908762050000000   7.060376 100 0 
10PC1008.AA 130908762100000000   7.059613 100 0 
10PC1008.BB 130908762150000000   7.059134 100 0 
10PC1008.BB 130908762200000000   7.060124 100 0 

有什么我用这个方法错了吗?

编辑1

我已进行如下设置为我的样本项目属性:

enter image description here

我的用户Environmentalvariable是如下:(不知道的事项)

enter image description here

此外我看到SparkCLRWorker登录其无法加载组件按日志:

[2016-01-14 08:37:01,865] [1] [ERROR] [Microsoft.Spark.CSharp.Worker] - System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. 
---> System.IO.FileNotFoundException: Could not load file or assembly 'SparkCLRSamples, Version=1.5.2.0, Culture=neutral, PublicKeyToken=null' or one of its dependencies. The system cannot find the file specified. 
     at System.Reflection.RuntimeAssembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, RuntimeAssembly locationHint, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks) 
     at System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(AssemblyName assemblyRef, Evidence assemblySecurity, RuntimeAssembly reqAssembly, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks) 
     at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean forIntrospection) 
     at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection) 
     at System.Reflection.Assembly.Load(String assemblyString) 
     at System.Runtime.Serialization.FormatterServices.LoadAssemblyFromString(String assemblyName) 
     at System.Reflection.MemberInfoSerializationHolder..ctor(SerializationInfo info, StreamingContext context) 
     --- End of inner exception stack trace --- 
     at System.RuntimeMethodHandle.SerializationInvoke(IRuntimeMethodInfo method, Object target, SerializationInfo info, StreamingContext& context) 
     at System.Runtime.Serialization.ObjectManager.CompleteISerializableObject(Object obj, SerializationInfo info, StreamingContext context) 
     at System.Runtime.Serialization.ObjectManager.FixupSpecialObject(ObjectHolder holder) 
     at System.Runtime.Serialization.ObjectManager.DoFixups() 
     at System.Runtime.Serialization.Formatters.Binary.ObjectReader.Deserialize(HeaderHandler handler, __BinaryParser serParser, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) 
     at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream, HeaderHandler handler, Boolean fCheck, Boolean isCrossAppDomain, IMethodCallMessage methodCallMessage) 
     at System.Runtime.Serialization.Formatters.Binary.BinaryFormatter.Deserialize(Stream serializationStream) 
     at Microsoft.Spark.CSharp.Worker.Main(String[] args) in d:\SparkCLR\csharp\Worker\Microsoft.Spark.CSharp\Worker.cs:line 149 

你指定样本数据的位置和你的源文本文件复制到该位置?如果没有,你可以参考

https://github.com/Microsoft/SparkCLR/blob/master/csharp/Samples/Microsoft.Spark.CSharp/samplesusage.md

设置与参数[--data您的样本数据的位置| sparkclr.sampledata.loc。

+0

是的,我的命令行参数就像这个'--torun“MyDataFrameSample”--data D:\ SparkCLR \ build \ run \ data',文件存在那里。日志显示这个'16/01/13 12:14:14信息HadoopRDD:输入split:文件:/ D:/SparkCLR/build/run/data/data_small.txt:0 + 75981' – Kiran

尝试明确设置[--temp | spark.local.dir]选项(有关受支持参数的更多信息,请参阅sampleusage.md)。 SparkCLR工作者可执行文件在执行时被下载到这个目录中。如果您使用默认临时目录,那么工作程序可执行文件可能会被您的防病毒软件隔离,将其误认为您的浏览器下载的某些恶意程序。覆盖默认值如c:\ temp \ SparkCLRTemp将有助于避免该问题。

如果设置临时目录没有帮助,请在启动SparkCLR驱动程序代码时共享您正在使用的整个命令行参数列表。

+0

我已更新我的初始文章在'Edit 1'下有更多的细节,并且试图按照你的建议设置--temp,但是我仍然无法使它工作。还有什么我可以看看。还有任何想法,为什么我看到'FileNotFoundException:无法加载文件或程序集'SparkCLRSamples'' - Regars – Kiran

+0

它看起来像你试图在调试模式下运行SparkCLR。有关说明,请访问https://github.com/Microsoft/SparkCLR/blob/master/notes/windows-instructions.md#debugging-tips。如上所述,您需要设置CSharpBackendPortNumber和CSharpWorkerPath配置值。 – skaarthik

+0

我已经按照这些说明进行了调试,并且没有问题。 - 谢谢 – Kiran

这是你如何更改端口号,我希望这有助于在应用中帮助

。配置添加以下

为了完整起见,还必须添加指定的csharpworker

<appSettings> 
    <add key="CSharpBackendPortNumber" value="num"/> 
    <add key="CSharpWorkerPath" value="C:\MobiusRelease\samples\CSharpWorker.exe"/> 
</appSettings> 

通知路径的标签,以使在调试模式下这项工作,你应该先使用运行此命令从(墨比尔斯家)命令行, 目录

%SPARKCLR_HOME%\脚本

运行

sparkclr-submit.cmd debug 

这会给你这样一个包含端口号

消息

[CSharpRunner.main]由CSharpBackend使用的端口号为5567
* [CSharpRunner.main]后台运行调试模式。按回车键退出*