从Java中的字符串中删除所有非“单词字符”，留下重音字符？

问题描述：

显然，当我使用正则表达式时，Java的正则表达式将Umlauts和其他特殊字符统计为非“单词字符”。从Java中的字符串中删除所有非“单词字符”，留下重音字符？

 "TESTÜTEST".replaceAll("\\W", "")

为我返回“TESTTEST”。我想要的只是所有真正的非“单词字符”被删除。任何方式做这个没有东西沿线

  "[^A-Za-z0-9äöüÄÖÜßéèáàúùóò]"

只有实现我忘了？

整个Unicode参考[http://www.regular-expressions.info/unicode.html](http://www.regular-expressions.info/unicode.html） – zaletniy 2011-08-10 09:33:54

答

使用[^\p{L}\p{Nd}]+ - 这匹配既不是字母也不是（十进制）数字的所有（Unicode）字符。

在Java：

String resultString = subjectString.replaceAll("[^\\p{L}\\p{Nd}]+", "");

编辑：

我改变\p{N}到\p{Nd}因为前者还喜欢¼一些数字符号匹配;后者不。在regex101.com上查看。

为什么'\\ [''在你的角色类里面？ – 2009-10-23 08:14:29

糟糕。错字。纠正。 – 2009-10-23 08:19:44

工程就像一个魅力！谢谢！ – Epaga 2009-10-23 08:33:35

答

嗯，这里是一个解决方案，我结束了，但我希望有一个更优雅的一个...

StringBuilder result = new StringBuilder(); 
for(int i=0; i<name.length(); i++) { 
    char tmpChar = name.charAt(i); 
    if (Character.isLetterOrDigit(tmpChar) || tmpChar == '_') { 
     result.append(tmpChar); 
    } 
}

result与期望的结果结束了......

事实上，你的变量String名为'name'表明它不会是一个大的字符串。但是，如果它确实很大（成千上万个字符），我会像现在一样去处理for-statement。 – 2009-10-23 09:34:35

答

有时你不想简单地删除的人物，只是去掉口音。我想出了我在我的Java REST的Web项目中使用，每当我需要在一个URL字符串以下实用工具类：

import java.text.Normalizer; 
import java.text.Normalizer.Form; 

import org.apache.commons.lang.StringUtils; 

/** 
* Utility class for String manipulation. 
* 
* @author Stefan Haberl 
*/ 
public abstract class TextUtils { 
    private static String[] searchList = { "Ä", "ä", "Ö", "ö", "Ü", "ü", "ß" }; 
    private static String[] replaceList = { "Ae", "ae", "Oe", "oe", "Ue", "ue", 
      "sz" }; 

    /** 
    * Normalizes a String by removing all accents to original 127 US-ASCII 
    * characters. This method handles German umlauts and "sharp-s" correctly 
    * 
    * @param s 
    *   The String to normalize 
    * @return The normalized String 
    */ 
    public static String normalize(String s) { 
     if (s == null) 
      return null; 

     String n = null; 

     n = StringUtils.replaceEachRepeatedly(s, searchList, replaceList); 
     n = Normalizer.normalize(n, Form.NFD).replaceAll("[^\\p{ASCII}]", ""); 

     return n; 
    } 

    /** 
    * Returns a clean representation of a String which might be used safely 
    * within an URL. Slugs are a more human friendly form of URL encoding a 
    * String. 
    * <p> 
    * The method first normalizes a String, then converts it to lowercase and 
    * removes ASCII characters, which might be problematic in URLs: 
    * <ul> 
    * <li>all whitespaces 
    * <li>dots ('.') 
    * <li>(semi-)colons (';' and ':') 
    * <li>equals ('=') 
    * <li>ampersands ('&') 
    * <li>slashes ('/') 
    * <li>angle brackets ('<' and '>') 
    * </ul> 
    * 
    * @param s 
    *   The String to slugify 
    * @return The slugified String 
    * @see #normalize(String) 
    */ 
    public static String slugify(String s) { 

     if (s == null) 
      return null; 

     String n = normalize(s); 
     n = StringUtils.lowerCase(n); 
     n = n.replaceAll("[\\s.:;&=<>/]", ""); 

     return n; 
    } 
}

作为我已经包括德国变音符号的正确处理，以及德国扬声器 - 该列表应该容易扩展到其他语言。

HTH

编辑：注意，它可能是不安全的，包括在URL中返回的字符串。您至少应该对其进行HTML编码以防止XSS攻击。

关于这方面的重要信息，你可以得到StringUtils类/包等。@ http://commons.apache.org/lang/download_lang.cgi – cV2 2011-12-19 11:39:21

答

您可能需要remove the accents and diacritic signs first，然后在每个字符位置检查“简化”字符串是否是ASCII字母 - 如果是，则原始位置应包含单词字符，否则可将其删除。

android API级别9之前不支持类java.text.Normalizer，所以如果您的应用必须与API级别8兼容（根据Google的Android仪表板），该方法不可行 – 2012-10-16 17:37:19

答

我试图实现完全相反的，当我碰到这个线程。我知道这是很老，但这里是我的解决方案仍然。您可以使用块，请参阅here。在这种情况下，编译下面的代码（用正确的进口）：

> String s = "äêìóblah"; 
> Pattern p = Pattern.compile("[\\p{InLatin-1Supplement}]+"); // this regex uses a block 
> Matcher m = p.matcher(s); 
> System.out.println(m.find()); 
> System.out.println(s.replaceAll(p.pattern(), "#"));

你应该看到下面的输出：

真正

#blah

最佳，

答

你可以使用StringUtils from apache

我想多一点信息会有帮助......这不是一个真正的答案。 – Micha 2013-10-09 18:29:29

请添加一些代码示例。 – Saurabh 2015-05-21 17:32:02

从Java中的字符串中删除所有非“单词字符”，留下重音字符？

相关推荐