DL4J hello world
背景:之前尝试TensorFlow训练保存pb模型给到spark用,感觉还是性能太慢了;开始寻求在spark上跑深度学习的方法,权衡sparkNet和DL4J后选择。
参考官网 https://deeplearning4j.org/cn/quickstart 先弄了个例子:
步骤1:克隆到本地
F:\spark project\dl4j-examples>git clone https://github.com/deeplearning4j/dl4j-examples.git
Cloning into 'dl4j-examples'...
remote: Enumerating objects: 201, done.
remote: Counting objects: 100% (201/201), done.
remote: Compressing objects: 100% (133/133), done.
error: RPC failed; curl 56 OpenSSL SSL_read: SSfL_ERROR_SYSCALL, errno 10054
atal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
上面问题参考:https://stackoverflow.com/questions/21277806/fatal-early-eof-fatal-index-pack-failed 如下解决:
F:\spark project\dl4j-examples>
F:\spark project\dl4j-examples>git config --global core.compression 0
F:\spark project\dl4j-examples>git clone --depth 1 https://github.com/deeplearning4j/dl4j-examples.git
Cloning into 'dl4j-examples'...
remote: Enumerating objects: 768, done.
remote: Counting objects: 100% (768/768), done.
remote: Compressing objects: 100% (547/547), done.
remote: Total 768 (delta 161), reused 491 (delta 97), pack-reused 0 eceiving objects: 99% (761/768), 22.93 MiB | 76.00 KiB/s
Receiving objects: 100% (768/768), 22.94 MiB | 165.00 KiB/s, done.
Resolving deltas: 100% (161/161), done.
感觉执行到这里可以了,毕竟例子都下下来了,然后我又往下执行了几行:
F:\spark project\dl4j-examples>git fetch --unshallow
fatal: not a git repository (or any of the parent directories): .git
F:\spark project\dl4j-examples>git fetch --depth=2147483647
fatal: not a git repository (or any of the parent directories): .git
F:\spark project\dl4j-examples>git init
Initialized empty Git repository in F:/spark project/dl4j-examples/.git/
F:\spark project\dl4j-examples>git fetch --unshallow
fatal: --unshallow on a complete repository does not make sense
F:\spark project\dl4j-examples>git fetch --depth=2147483647
F:\spark project\dl4j-examples>git pull --all
There is no tracking information for the current branch.
Please specify which branch you want to merge with.
See git-pull(1) for details.
git pull <remote> <branch>
If you wish to set tracking information for this branch you can do so with:
git branch --set-upstream-to=<remote>/<branch> master
步骤2:大概要半小时,然后需要5G多磁盘空间
mvn clean install
这里我没有单独安装maven,而是偷懒在idea的其他项目中通过Execute Maven Goal在指定路径F:\spark project\dl4j-examples\dl4j-examples 下执行mvn clean install的
步骤3:挑个例子跑起来:
如org.deeplearning4j.examples.feedforward.classification.MLPClassifierLinear,
IDEA运行测试用例报错如下:
Error running 'AuthServer': Command line is too long. Shorten command line for AuthServeror also for Application default configuration.
解决办法:
修改项目下 .idea\workspace.xml,找到标签 <component name="PropertiesComponent"> , 在标签里加一行 <property name="dynamic.classpath" value="true" />
执行结果:
步骤4:到自己的spark里面运行
先添加依赖:
<dependency>
<groupId>org.datavec</groupId>
<artifactId>datavec-api</artifactId>
<version>1.0.0-beta5</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native-platform</artifactId>
<version>1.0.0-beta5</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-core</artifactId>
<version>1.0.0-beta5</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>dl4j-spark_2.11</artifactId>
<version>1.0.0-beta5</version>
</dependency>
<dependency>
<groupId>com.beust</groupId>
<artifactId>jcommander</artifactId>
<version>1.27</version>
</dependency>
复制org.deeplearning4j.legacyExamples.mlp.MnistMLPExample并稍作修改得到:
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.deeplearning4j.datasets.iterator.impl.MnistDataSetIterator
import org.deeplearning4j.eval.Evaluation
import org.deeplearning4j.nn.conf.NeuralNetConfiguration
import org.deeplearning4j.nn.conf.layers.DenseLayer
import org.deeplearning4j.nn.conf.layers.OutputLayer
import org.deeplearning4j.nn.weights.WeightInit
import org.deeplearning4j.spark.impl.multilayer.SparkDl4jMultiLayer
import org.deeplearning4j.spark.impl.paramavg.ParameterAveragingTrainingMaster
import org.nd4j.linalg.activations.Activation
import org.nd4j.linalg.dataset.DataSet
import org.nd4j.linalg.learning.config.Nesterovs
import org.nd4j.linalg.lossfunctions.LossFunctions
import java.util
object MnistMLPExample {
val batchSizePerWorker=16
val numEpochs = 2
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf
System.setProperty("hadoop.home.dir", "D:\\火狐下载\\hadoop-common-2.2.0-bin-master")
sparkConf.setMaster("local[*]")
sparkConf.setAppName("DL4J Spark MLP Example")
val sc = new JavaSparkContext(sparkConf)
sc.setLogLevel("WARN")
//Load the data into memory then parallelize
//This isn't a good approach in general - but is simple to use for this example
val iterTrain = new MnistDataSetIterator(batchSizePerWorker, true, 12345)
val iterTest = new MnistDataSetIterator(batchSizePerWorker, true, 12345)
val trainDataList = new util.ArrayList[DataSet]
val testDataList = new util.ArrayList[DataSet]
while ( {
iterTrain.hasNext
}) trainDataList.add(iterTrain.next)
while ( {
iterTest.hasNext
}) testDataList.add(iterTest.next)
val trainData = sc.parallelize(trainDataList)
val testData = sc.parallelize(testDataList)
//Create network configuration and conduct network training
val conf = new NeuralNetConfiguration.Builder()
.seed(12345)
.activation(Activation.LEAKYRELU)
.weightInit(WeightInit.XAVIER)
.updater(new Nesterovs(0.1))// To configure: .updater(Nesterovs.builder().momentum(0.9).build())
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(28 * 28).nOut(500).build())
.layer(new DenseLayer.Builder().nOut(100).build())
.layer(new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX).nOut(10).build())
.build();
//Configuration for Spark training: see http://deeplearning4j.org/spark for explanation of these configuration options
val tm = new ParameterAveragingTrainingMaster.Builder(batchSizePerWorker) //Each DataSet object: contains (by default) 32 examples
.averagingFrequency(5)
.workerPrefetchNumBatches(2) //Async prefetching: 2 examples per worker
.batchSizePerWorker(batchSizePerWorker)
.build();
//Create the Spark network
val sparkNet = new SparkDl4jMultiLayer(sc, conf, tm)
//Execute training:
var i = 0
while ( {
i < numEpochs
}) {
sparkNet.fit(trainData)
println("Completed Epoch {}", i)
i += 1;
}
//Perform evaluation (distributed)
// Evaluation evaluation = sparkNet.evaluate(testData);
val evaluation = sparkNet.doEvaluation(testData, 64, new Evaluation(10))(0) //Work-around for 0.9.1 bug: see https://deeplearning4j.org/releasenotes
println("***** Evaluation *****")
println(evaluation.stats)
//Delete the temp training files, now that we are done with them
tm.deleteTempFiles(sc)
println("***** Example Complete *****")
}
}
运行结果为:
......
15:03:14,619 INFO ~ Completed Epoch 1
15:03:20,498 INFO ~ ***** Evaluation *****
15:03:20,502 INFO ~
========================Evaluation Metrics========================
# of classes: 10
Accuracy: 0.9608
Precision: 0.9605
Recall: 0.9607
F1 Score: 0.9605
Precision, recall & F1: macro-averaged (equally weighted avg. of 10 classes)
=========================Confusion Matrix=========================
0 1 2 3 4 5 6 7 8 9
---------------------------------------------------
5807 0 10 3 11 17 24 3 40 8 | 0 = 0
1 6583 50 17 9 8 1 7 54 12 | 1 = 1
24 11 5759 24 27 11 19 31 47 5 | 2 = 2
14 16 99 5717 3 113 8 29 96 36 | 3 = 3
6 14 25 2 5615 1 31 5 16 127 | 4 = 4
28 8 14 65 15 5180 39 8 43 21 | 5 = 5
29 8 10 0 21 49 5775 0 26 0 | 6 = 6
11 24 77 13 35 3 2 6006 14 80 | 7 = 7
18 39 24 56 14 45 21 9 5600 25 | 8 = 8
22 12 5 43 115 28 2 71 43 5608 | 9 = 9
Confusion matrix format: Actual (rowClass) predicted as (columnClass) N times
==================================================================
15:03:20,502 INFO ~ Attempting to delete temporary directory: /tmp/hadoop-lenovo/dl4j/1572245786579_-2c362db/0/
15:03:22,346 INFO ~ Deleted temporary directory: /tmp/hadoop-lenovo/dl4j/1572245786579_-2c362db/0/
15:03:22,347 INFO ~ ***** Example Complete *****
DeepLearning4J环境搭建、测试完成,接下来了解下DLJ4的细节...