如何使用sklearn load_files方法忽略一些文件？

问题描述：

我的MAC OS会产生.DS_Store我的训练数据集的文件目录下，并load_files将加载它，并引发异常像如何使用sklearn load_files方法忽略一些文件？

的UnicodeDecodeError：“UTF-8”编解码器不能解码位置，字节0xFF 1116

我想知道如何过滤.DS_Store文件，但删除它？

你能告诉我们你是怎么过的迭代器中的文件？ – miku

@miku：他大概依靠['sklearn.datasets.load_files']（http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files）去做正如他的问题所暗示的那样。 – abarnert

答

望着documentation，似乎没有被任何方式直接在load_files过滤（或者说，你可以白名单的类别，但你可以在类别中没有白名单文件或黑名单任一级别）。

您可能想要考虑向scikit-learn项目提交功能请求。或者，您可能会认为它是一个隐藏文件（如适用于平台的定义 - 但在OS X和其他POSIX系统上应该包含名称以.开头的文件）被加载并为此提交错误报告的错误。

同时，还有一个load_content标志，您可以设置：

load_content : boolean, optional (default=True)

Whether to load or not the content of the different files. If true a ‘data’ attribute containing the text information is present in the data structure returned. If not, a filenames attribute gives the path to the files.

通False，它只会为您找到的文件名，然后你就可以，但是你想过滤器（例如，filenames = (filename for filename in ret.filenames if not filename.startswith('.'))）然后手动加载。

这似乎是给定工具可用的最佳解决方案。

在另一方面，鉴于多么简单load_files实际上，特别是如果你不使用额外的功能，如categories或shuffle - 它可能是简单的，只是不使用它，而是使用os.walk或只是os.listdir。在这种情况下，假定这些文件是完全两层深，而不是任意深度，后者可能是更简单：

def getfilenames(category): 
    return [filename for filename in os.listdir(category) 
      if not filename.endswith('.')] 
categoryfiles = [getcategory(os.path.join(rootpath, category) 
       for category in os.listdir(rootpath)]

非常感谢你，我已经根据你的代码改变了load_file的代码。 – user1687717

如果我们将load_content标志设置为False，那么我们如何手动加载文件？你能在这里回答我的问题吗？ http://stackoverflow.com/questions/17788431/scikit-learn-how-to-know-documents-in-the-cluster –

答

的load_files一个quick glance at the source表明，你唯一的选择是删除.DS_Store文件：

documents = [join(folder_path, d) 
    for d in sorted(listdir(folder_path))]

（如果你想认真对待.DS_Store污染，这里是一个严重的内核扩展：https://github.com/binaryage/asepsis）。

因此，传递'load_content = False'然后迭代加载文件不是一个选项吗？ – abarnert

@abarnert;我没有使用这个工具包，所以我不能真正回答这个问题。 – miku

我也没有，但我确实阅读了文档的简短页面。 – abarnert

答

我修改了sklearn load_files以接受附加参数'ignore_files'，它将接受要忽略的文件列表。你可以使用load_files的这个定义来代替sklearn。它返回的结果与load_files相同，因为我只是过滤需要忽略的文件。

用法：

load_files(dir_path,ignore_files=".DS_Store")

来源上gist

如何使用sklearn load_files方法忽略一些文件？

相关推荐