压缩Conda环境打破Audioread的后端(Python/Pyspark)
问题描述:
我以前使用conda构建了pyspark环境来打包所有的依赖关系,并在运行时将它们发送到所有节点。以下是我创造环境:压缩Conda环境打破Audioread的后端(Python/Pyspark)
`conda/bin/conda create -p conda_env --copy -y python=2 \
numpy scipy ffmpeg gcc libsndfile gstreamer pygobject audioread librosa`
`zip -r conda_env.zip conda_env`
然后采购conda_env
和运行pyspark
壳,我可以成功执行:未经环境
`import librosa
y, sr = librosa.load("test.m4a")`
说明来源的错误这个脚本结果的ffmpeg/GStreamer的不安装在我的本地。
将脚本提交给集群会导致librosa.load
错误,该错误可追溯到audioread
,指示在压缩归档环境中无法再找到后端(gstreamer或ffmpeg)。堆栈跟踪低于:
提交:
`PYSPARK_PYTHON=./NODE/conda_env/bin/python spark-submit --verbose \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./NODE/conda_env/bin/python \
--conf spark.yarn.appMasterEnv.PYTHON_EGG_CACHE=/tmp \
--conf spark.executorEnv.PYTHON_EGG_CACHE=/tmp \
--conf spark.yarn.executor.memoryOverhead=1024 \
--conf spark.hadoop.validateOutputSpecs=false \
--conf spark.driver.cores=5 \
--conf spark.driver.maxResultSize=0 \
--master yarn --deploy-mode cluster --queue production \
--num-executors 20 --executor-cores 5 --executor-memory 40G \
--driver-memory 20G --archives conda_env.zip#NODE \
--jars /data/environments/sqljdbc41.jar \
script.py`
跟踪:
`Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 172, in main
process()
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/worker.py", line 167, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "script.py", line 245, in <lambda>
File "script.py", line 119, in download_audio
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/librosa/core/audio.py", line 107, in load
with audioread.audio_open(os.path.realpath(path)) as input_file:
File "/mnt/yarn/usercache/user/appcache/application_1506634200253_39889/container_1506634200253_39889_01_000003/NODE/conda_env/lib/python2.7/site-packages/audioread/__init__.py", line 114, in audio_open
raise NoBackendError()
NoBackendError`
我的问题是:我怎么能打包此存档,以便librosa(真的audioread)是目前能找到的后端并加载.m4a文件?
答
这是一个路径问题,执行者无法找到FFMPEG,尽管它是在conda环境中打包的。这个破解修复了它。
path = os.getenv("PATH")
if "./NODE/conda_env/bin" not in path:
path += os.pathsep + "./NODE/conda_env/bin"
os.environ["PATH"] = path
y, _ = librosa.load(audiofn, self.conf.sr)