如何查找出现在数组的每个元素中的最长子字符串?

问题描述:

我有一些作者的文本集合。每位作者在其所有文本中都有独特的签名或链接。如何查找出现在数组的每个元素中的最长子字符串?

示例作者1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl']; 

预期作者1输出是:@jhsad.sadas.com


为Author2实施例:

$texts=['This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
'Different message body text.  This is the 
*author\'s* signature. 

This is an afterthought that expresses that a signature is not always at the end.', 
'Finally, this is unwanted stuff. This is the 
*author\'s* signature.']; 

为Author2预期输出是:

This is the 
*author's* signature. 

要特别通知这样的事实,存在有一些表示签名的开始或结束没有可靠的识别字符(或位置)。它可以是任何长度的URL,Twitter提及,任何类型的纯文本等,包含在字符串的开始,结束或中间出现的任何字符序列。

我正在寻找一种方法,将提取单个作者的所有$text元素中存在的最长子字符串。

为了这个任务,预计所有作者都会在每个帖子/文本中都有一个签名子字符串。

IDEA: 我在考虑将单词转换为矢量,并找到每个文本之间的相似性。我们可以使用余弦相似性来查找签名。我认为解决方案必须是这样的想法。

mickmackusa's commented code捕捉了所需要的东西的本质,但我想看看是否有其他方法来达到预期的效果。

+0

你需要找到'@ jhsad.sadas.com'或确认字符串有吗?你是否允许松散的比赛,例如'@ jhsad.sadas.com.uk'? '@jhsad \ .sadas \ .com \ b'可以工作,或者如果这个域是一个变量,就使用'preg_quote'。 – chris85

+0

@ chris85,我想在他的文章中找到作者签名。我不知道它会是什么,他会在哪里使用它。 – mrmrn

+0

如果你不知道它是什么,而不是如何识别它? – chris85

这里是我的想法:

  1. 排序作者的由字符串长度(升序),让你从较小的文字工作,以更大的文字职位的集合。
  2. 将每个帖子的文本拆分为一个或多个空格字符,以便在处理期间仅处理完全非空白子字符串。
  3. 查找匹配的子字符串,发生在每个后续的帖子中,而不是越来越窄的子串(overlaps)。
  4. 通过分析它们的索引值对连续匹配的子串进行分组。
  5. 将所分组的连续子串重新组合为它们的原始字符串形式(当然,修剪了前导和尾随空白字符)。
  6. 按字符串长度(降序)对重组字符串进行排序,以便为最长的字符串分配0索引。
  7. 根据公共性和长度打印以筛选假定为作者签名的子字符串(作为最佳猜测)。

代码:(Demo

$posts['Author1']=['sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl']; 

$posts['Author2']=['This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
     'Different message body text.  This is the 
*author\'s* signature. 

    This is an afterthought that expresses that a signature is not always at the end.', 
     'Finally, this is unwanted stuff. This is the 
*author\'s* signature.']; 

foreach($posts as $author=>$texts){ 
    echo "Author: $author\n"; 

    usort($texts,function($a,$b){return strlen($a)-strlen($b);}); // sort ASC by strlen; mb_strlen probably isn't advantageous 
    var_export($texts); 
    echo "\n"; 

    foreach($texts as $index=>$string){ 
     if(!$index){ 
      $overlaps=preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY); // declare with all non-white-space substrings from first text 
     }else{ 
      $overlaps=array_intersect($overlaps,preg_split('/\s+/',$string,NULL,PREG_SPLIT_NO_EMPTY)); // filter word bank using narrowing number of words 
     } 
    } 
    var_export($overlaps); 
    echo "\n"; 

    // batch consecutive substrings 
    $group=null; 
    $consecutives=[]; // clear previous iteration's data 
    foreach($overlaps as $i=>$word){ 
     if($group===null || $i-$last>1){ 
      $group=$i; 
     } 
     $last=$i; 
     $consecutives[$group][]=$word; 
    } 
    var_export($consecutives); 
    echo "\n"; 

    foreach($consecutives as $words){ 
     // match potential signatures in first text for measurement: 
     if(preg_match_all('/\Q'.implode('\E\s+\Q',$words).'\E/',$texts[0],$out)){ // make alternatives characters literal using \Q & \E 
      $potential_signatures=$out[0]; 
     } 
    } 
    usort($potential_signatures,function($a,$b){return strlen($b)-strlen($a);}); // sort DESC by strlen; mb_strlen probably isn't advantageous 

    echo "Assumed Signature: {$potential_signatures[0]}\n\n"; 
} 

输出:

Author: Author1 
array (
    0 => 'sdsadsad daSDA DDASd asd aSD Sd dA SD ASD sadasdasds sadasd 

@jhsad.sadas.com sdsdADSA sada', 
    1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl 
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl', 
    2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g 
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf', 
) 
array (
    11 => '@jhsad.sadas.com', 
) 
array (
    11 => 
    array (
    0 => '@jhsad.sadas.com', 
), 
) 
Assumed Signature: @jhsad.sadas.com 

Author: Author2 
array (
    0 => 'Finally, this is unwanted stuff. This is the 
*author\'s* signature.', 
    1 => 'This is some random string representative of non-signature text. 

This is the 
*author\'s* signature.', 
    2 => 'Different message body text.  This is the 
*author\'s* signature. 

    This is an afterthought that expresses that a signature is not always at the end.', 
) 
array (
    2 => 'is', 
    5 => 'This', 
    6 => 'is', 
    7 => 'the', 
    8 => '*author\'s*', 
    9 => 'signature.', 
) 
array (
    2 => 
    array (
    0 => 'is', 
), 
    5 => 
    array (
    0 => 'This', 
    1 => 'is', 
    2 => 'the', 
    3 => '*author\'s*', 
    4 => 'signature.', 
), 
) 
Assumed Signature: This is the 
*author's* signature. 

您可以使用preg_match()与正则表达式来实现此目的。

$str = "KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg df gfdhgf g hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf"; 

preg_match("/\@[^\s]+/", $str, $match); 

var_dump($match); //Will output the signature 
+0

这里@ jhsad.sadas.com就是一个例子。我不知道那个作者真正的签名是什么!我拥有的只是那个作者的一些文本,我知道它有一个签名 – mrmrn

+1

@ chris85是的,我现在改变了! – WasteD

+1

@mrmrn但签名总是以@开头吗? – WasteD