解析XML文件时,无效的XML 字符 (Unicode: 0x7)异常处理
报错信息:
2015-01-29 00:10:22,075 ERROR commonapi.CommonApiAction - errorCode:5000,5000-00;Description:程序异常。Error on line 1 of document : An invalid XML character (Unicode: 0x19) was found in the CDATA section. Nested exception: An invalid XML character (Unicode: 0x19) was found in the CDATA section. org.dom4j.DocumentException: Error on line 1 of document : An invalid XML character (Unicode: 0x19) was found in the CDATA section. Nested exception: An invalid XML character (Unicode: 0x19) was found in the CDATA section. at org.dom4j.io.SAXReader.read(SAXReader.java:482) at org.dom4j.DocumentHelper.parseText(DocumentHelper.java:278) at com.hoodong.engine.commonapi.CommonApiAction.getWapDocsSearchJsonInfo(CommonApiAction.java:1866) at sun.reflect.GeneratedMethodAccessor43.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597)
错误原因:
这些无效的字符在一些文档中作为文档处理器的控制编码(微软选择了那些再0x82到0x95之间的字符作为"smart"标点),这些也被Unicode保留作为控制编码的,并且在XML中是不合法的。这里的无效字符不是指<,>等不能出现在XML文件的标签以外的字符,也不是由于编码问题引起的乱码,而是一些超出XML合法字符范围的不可见字符。根据W3C标准,有一些字符不能出现在XML文件中:
// Document authors are encouraged to avoid "compatibility characters", as defined in // Unicode [Unicode]. The characters defined in the following ranges are also discouraged. // They are either control characters or permanently undefined Unicode characters: [#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF], [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF], [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF], [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF], [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF], [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF], [#x10FFFE-#x10FFFF].
解决办法:
为了保证常用XML解析工具能将自己生成的XML文件成功解析,就需要先将文件中的无效字符过滤掉,或在生成XML文件时就对字符的有效性进行判断,抛弃无效字符。
Unicode是国际组织制定的可以容纳世界上所有文字和符号的字符编码方案。目前的Unicode字符分为17组编排,0x0000 至 0x10FFFF,每组称为平面(Plane),而每平面拥有65536个码位,共1114112个。然而目前只用了少数平面。、、都是将数字转换到程序数据的编码方案。
查了一下W3C中对XML 1.0的定义,其Unicode的合法字符范围(16进制)是:
Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
方法一:
// 保留合法字符 public String stripNonValidXMLCharacters(String in) { StringBuffer out = new StringBuffer(); // Used to hold the output. char current; // Used to reference the current character. if (in == null || ("".equals(in))) return ""; // vacancy test. for (int i = 0; i < in.length(); i++) { current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught here; it should not happen. if ((current == 0x9) || (current == 0xA) || (current == 0xD) || ((current >= 0x20) && (current <= 0xD7FF)) || ((current >= 0xE000) && (current <= 0xFFFD)) || ((current >= 0x10000) && (current <= 0x10FFFF))) out.append(current); } return out.toString(); }
方法二:
//过滤非法字符 //注意,以下正则表达式过滤不全面,过滤范围为 // 0x00 - 0x08 // 0x0b - 0x0c // 0x0e - 0x1f public static String stripNonValidXMLChars(String str) { if (str == null || "".equals(str)) { return str; } return str.replaceAll("[\x00-\x08\x0b-\x0c\x0e-\x1f]", ""); }
参考:;
上一篇:
通过多线程提高代码的执行效率例子