Bug #684
openCheck whether delimiters (such as a backslash) are treated correctly during character set conversion
50%
Description
The following excerpt is from DICOM PS 3.5 Section 6.1.2.3:
The Unicode and GB18030 standards have distinct Yen symbol, backslash, and several forms of reverse solidus. The separator for multi-valued data elements in DICOM is the character valued 05/12 regardless of what glyph is used to enter or display this character. The other reverse solidus characters that have a very similar appearance are not separators. The choice of font can affect the appearance of 05/12 significantly. Multi-byte encoding systems, such as GB18030, GBK and ISO 2022, may generate encodings that contain a byte valued 05/12. Only the character that encodes as a single byte valued 05/12 is a delimiter.
It is not sure whether the DcmSpecificCharacterSet class handles the backslash (and other delimiters) correctly for multi-byte character sets such as ISO 2022 IR 87, GB18030 and GBK.
See sample datasets from Bhuvan Bose (sent by email).
Updated by Jörg Riesmeier almost 9 years ago
- Status changed from New to Closed
- % Done changed from 0 to 100
Fixed with commit 9e6362e.
Updated by Jörg Riesmeier 18 days ago
- Status changed from Closed to Reopened
- Target version changed from 3.6.2 to 3.7.0
There still seems to be a problem, e.g., with GB18030:
From an email sent by Mathieu Malaterre <mathieu.malaterre@gmail.com>:
I disagree with the following behavior:
% curl -O https://dclunie.com/images/charset/charsettests.20070405.tar.bz2
% cp charsettests/SCSX2 new.dcm
% dcmodify -i 8,1030=$( printf "\x81\x5c\x82\x5c") new.dcmUsing dcmtk git/master + oficonv:
$ dcmdump --log-level trace +U8 new.dcm
[...]
D: DcmSpecificCharacterSet: Converting '\201\\202\' (with delimiters '\')
T: Converting sub-string '\201'
T: -> ERROR: Cannot convert character encoding: Invalid argument
T: Appending delimiter '\' to the output
W: DcmItem: An error occurred during the conversion to UTF-8 encoding,
the value of SpecificCharacterSet (0008,0005) is not updated
E: dcmdump: Cannot convert character encoding: Invalid argument:
converting file to UTF-8: new.dcmI would have expected something like the following:
(0008,1030) LO "StudyDescription" : [乗俓] (4 bytes)
Which seems consistent with my version of iconv:
$ echo -n '乗俓' | iconv -f utf8 -t gb18030 | hexdump -C
00000000 81 5c 82 5c |.\.\|
00000004
And, from another email:
Just in case that helps, here is another example with an actual '\' this time:
$ echo -n '乗\俓' | iconv -f utf8 -t gbk | hexdump -C
00000000 81 5c 5c 82 5c |.\\.\|
00000005
Updated by Jörg Riesmeier 14 days ago
- Target version changed from 3.7.0 to 3.7.1+