来自两个以上字符串的最长公共子字符串 - C++

问题描述：

我需要从C++中的一组文件名中计算出最长的公共子字符串。来自两个以上字符串的最长公共子字符串 - C++

准确地说，我的std ::字符串的std ::列表（或相当于QT，也未尝不可）

char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"}; 
std::list<std::string> files(x, x + sizeof(x)/sizeof(*x));

我需要计算n的所有字符串的不同最长公共子，在这例如对于n = 2

"File" and ".xls"

如果我能计算出最长公共子，我可以把它剪它，并再次运行算法得到第二长的，所以实际上这归结为：

有用于计算std :: strings的std :: list的LCS的（引用？）实现？

这不是一个很好的答案，但一个肮脏的解决方案，我有 - 蛮力上的QList QUrls从中仅将部分最后一个“/”取。我很乐意用“适当的”代码替换它。

（我发现http://www.icir.org/christian/libstree/ - ？这将有很大的帮助，但我不能得到它来编译我的机器上有人用这个可能）

QString SubstringMatching::getMatchPattern(QList<QUrl> urls) 
    { 
    QString a; 

    int foundPosition = -1; 
    int foundLength = -1; 
    for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++) 
    { 
     bool hit=true; 
     int xj; 
     for (int j=0; j<urls.first().toString().length()-i+1; j++) // try to match from position i up to the end of the string :: test character at pos. (i+j) 
     { 
      if (!hit) break; 

      QString firstString = urls.first().toString().right(urls.first().toString().length()-i).left(j); // this needs to match all k strings 
      //qDebug() << "SEARCH " << firstString; 

      for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number 
      { 
       if (!hit) break; 

       //qDebug() << " IN " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1); 
       //qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1); 
       if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) { 
        xj = j; 
        //qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString; 
        hit = false; 
       } 
      } 

     } 
     if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string 
     if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length 
     { 
      foundPosition = i; // at the current position 
      foundLength = xj-1; 
      //qDebug() << "Found at " << i << " length " << foundLength; 
     } 
    } 

    a = urls.first().toString().right(urls.first().toString().length()-foundPosition).left(foundLength); 
    //qDebug() << a; 
    return a; 
}

这可能有用。 http://stackoverflow.com/questions/2418504/algorithm-to-find-common-substring-across-n-strings –

我已经点击了数百个类似的问题没有找到答案，包括上面的和http：// stackoverflow.com/questions/10248728/how-to-find-longest-common-substring-using-c。我得到的最接近的是http://homepage.virgin.net/cdm.henderson/site/cpp/lcs/index.htm，但是这是为了子序列，而不是子串。 – user2707001

这是一个不平凡的问题。详尽的搜索是必要的。（如果我的第一眼看好） –

答

如果像你说的后缀树太重量级或以其他方式不切实际，以下相当简单的蛮力方法可能足以满足您的应用需求。

我假设不同的子串不重叠，从从左到右选取。

即使有了这些假设，也不需要包含一组字符串的唯一集合，其包括“不同的最长公共子串”。无论ñ是，可能有超过ň不同的共同子串都是一样的最大长度和ñ他们中间的任何选择是任意的。相应地，该解决方案找到最长的不同共同子串的最多个子集*，其中所有具有相同长度的那些子集是一个集合。

的算法如下：

Q是长度的目标配额。
字符串是字符串的问题集。
结果是最初为空的多重映射，一个长度映射到一组串， 结果[1]是设置的与长度升
Ñ，最初为0，是中代表的不同长度的数量结果
如果Q是0或字符串是空的返回结果
查找的字符串任何最短构件;保留副本S并将其删除从字符串。我们继续用S与的这些字符串比较子字符串，因为所有{字符串，小号}的共同子串必须小号的子。

迭代生成的所有子串小号，最长首先，使用由偏移量和长度控制明显嵌套循环。对于每个子SS 小号的：

如果SS不是字符串一个共同子串，下一个。

遍历结果[1]为升> =直到结果或直到SS端SS的长度被发现是被检查结果的子串。在后一种情况下，ss与已有的结果没有区别，所以接下来。

ss是不同于任何已有手段的常见子字符串。遍历 结果[1]为升 < SS的长度，删除每个结果即SS的串，因为所有这些都短于SS和不鲜明从它。 ss现在是一个不同于任何已有的常见子字符串和所有其他手头上的所有其他人都不同于ss。

对于升 = SS的长度，检查是否结果[1]存在，即如果有在手任何结果相同的长度SS。如果没有，请致电 a NewLength条件。

检查还如果ň == Q，即我们已经达到了鲜明长度的目标配额。如果NewLength获得并且还有N == Q，请拨打StickOrRise条件。

如果StickOrRaise获得然后用升 =所述长度在手最短结果的比较SS的长度。如果ss短于l 那么它对我们的配额太短了，所以接下来。如果SS比升更长 然后在手所有最短结果是有利于SS待驱逐，所以删除 结果[1]和递减Ñ。

插入ss转换成结果按其长度键。

如果NewLength获得增量N。

放弃过小号的子串内部循环有SS偏移的相同，但更短，因为它们都不是从SS不同。

进展所述的Offset S用于通过SS，到下一个非重叠子串的开始的长度的外迭代。

返回结果。

这里是一个实现中，并用演示了字符串的列表的程序：

#include <list> #include <map> #include <string> #include <iostream> #include <algorithm> using namespace std; // Get a non-const iterator to the shortest string in a list list<string>::iterator shortest_of(list<string> & strings) { auto where = strings.end(); size_t min_len = size_t(-1); for (auto i = strings.begin(); i != strings.end(); ++i) { if (i->size() < min_len) { where = i; min_len = i->size(); } } return where; } // Say whether a string is a common substring of a list of strings bool is_common_substring_of( string const & candidate, list<string> const & strings) { for (string const & s : strings) { if (s.find(candidate) == string::npos) { return false; } } return true; } /* Get a multimap whose keys are the at-most `quota` greatest lengths of common substrings of the list of strings `strings`, each key multi-mapped to the set of common substrings of that length. */ multimap<size_t,string> n_longest_common_substring_sets(list<string> & strings, unsigned quota) { size_t nlengths = 0; multimap<size_t,string> results; if (quota == 0) { return results; } auto shortest_i = shortest_of(strings); if (shortest_i == strings.end()) { return results; } string shortest = *shortest_i; strings.erase(shortest_i); for (size_t start = 0; start < shortest.size();) { size_t skip = 1; for (size_t len = shortest.size(); len > 0; --len) { string subs = shortest.substr(start,len); if (!is_common_substring_of(subs,strings)) { continue; } auto i = results.lower_bound(subs.size()); for ( ;i != results.end() && i->second.find(subs) == string::npos; ++i) {} if (i != results.end()) { continue; } for (i = results.begin(); i != results.end() && i->first < subs.size();) { if (subs.find(i->second) != string::npos) { i = results.erase(i); } else { ++i; } } auto hint = results.lower_bound(subs.size()); bool new_len = hint == results.end() || hint->first != subs.size(); if (new_len && nlengths == quota) { size_t min_len = results.begin()->first; if (min_len > subs.size()) { continue; } results.erase(min_len); --nlengths; } nlengths += new_len; results.emplace_hint(hint,subs.size(),subs); len = 1; skip = subs.size(); } start += skip; } return results; } // Testing ... int main() { list<string> strings{ "OfBitWordFirstFileWordZ.xls", "SecondZWordBitWordOfFileBlue.xls", "ThirdFileZBitWordWhiteOfWord.xls", "WordFourthWordFileBitGreenZOf.xls"}; auto results = n_longest_common_substring_sets(strings,4); for (auto const & val : results) { cout << "length: " << val.first << ", substring: " << val.second << endl; } return 0; }

输出：

length: 1, substring: Z length: 2, substring: Of length: 3, substring: Bit length: 4, substring: .xls length: 4, substring: File length: 4, substring: Word

（内置用gcc 4.8.1）

来自两个以上字符串的最长公共子字符串 - C++

相关推荐