Post by Olivier MiakinenPost by PascalPost by Olivier Miakinen<cit. http://www.w3.org/TR/REC-xml/#charsets>
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
</cit.>
$xml =
preg_replace('/[^\x{9}|\x{A}|\x{D}|\x{20}-\x{D7FF}|\x{E000}-\x{FFFD}|\x{10000}-\x{10FFFF}]+/u',
' ', $xmlString);
'/[^\x{9}\x{A}\x{D}\x{20}-\x{7E}\x{A0}-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+/u'
La définition de Char dans XML 1.1 au lieu de XML 1.0, associée à la
petite note sur fond gris, me permet de proposer une autre regexp.
<cit. http://www.w3.org/TR/xml11/#charsets>
Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
[...]
Note:
Document authors are encouraged to avoid "compatibility characters", as
defined in Unicode [Unicode]. The characters defined in the following
ranges are also discouraged. They are either control characters or
permanently undefined Unicode characters:
[#x1-#x8], [#xB-#xC], [#xE-#x1F], [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
[#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
[#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
[#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
[#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
[#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
[#x10FFFE-#x10FFFF].
</cit.>
D'où :
$EXCLUDE_CHAR =
/* Control characters (including U+0000 and U+0085) */
'\x{0}-\x{8}' . '\x{B}\x{C}' . '\x{E}-\x{1F}' . '\x{7F}-\x{9F}' .
/* Surrogates */
'\x{D800}-\x{DFFF}' .
/* Non characters within Arabic Presentation Forms-A*/
'\x{FDD0}-\x{FDEF}' . /* FDEF and not FDDF, see errata */
/* Non characters *FFFE and *FFFF */
'\x{FFFE}\x{FFFF}' . '\x{1FFFE}\x{1FFFF}' . '\x{2FFFE}\x{2FFFF}' .
'\x{3FFFE}\x{3FFFF}' . '\x{4FFFE}\x{4FFFF}' . '\x{5FFFE}\x{5FFFF}' .
'\x{6FFFE}\x{6FFFF}' . '\x{7FFFE}\x{7FFFF}' . '\x{8FFFE}\x{8FFFF}' .
'\x{9FFFE}\x{9FFFF}' . '\x{AFFFE}\x{AFFFF}' . '\x{BFFFE}\x{BFFFF}' .
'\x{CFFFE}\x{CFFFF}' . '\x{DFFFE}\x{DFFFF}' . '\x{EFFFE}\x{EFFFF}' .
'\x{FFFFE}\x{FFFFF}' . '\x{10FFFE}\x{10FFFF}';
$xml = preg_replace('/[' . $EXCLUDE_CHAR . ']/u', ' ', $xmlString);