Bug #684: Check whether delimiters (such as a backslash) are treated correctly during character set conversion - DCMTK - OFFIS DCMTK and DICOM Projects

Actions

Copy link

Bug #684

closed

Check whether delimiters (such as a backslash) are treated correctly during character set conversion

Added by Jörg Riesmeier over 9 years ago. Updated about 2 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Jörg Riesmeier

Category:

Library

Target version:

3.7.1

Start date:

2016-05-27

Due date:

% Done:

100%

Estimated time:

8:00 h

Module:

dcmdata

Operating System:

Compiler:

Description

The following excerpt is from DICOM PS 3.5 Section 6.1.2.3:

The Unicode and GB18030 standards have distinct Yen symbol, backslash, and several forms of reverse solidus. The separator for multi-valued data elements in DICOM is the character valued 05/12 regardless of what glyph is used to enter or display this character. The other reverse solidus characters that have a very similar appearance are not separators. The choice of font can affect the appearance of 05/12 significantly. Multi-byte encoding systems, such as GB18030, GBK and ISO 2022, may generate encodings that contain a byte valued 05/12. Only the character that encodes as a single byte valued 05/12 is a delimiter.

It is not sure whether the DcmSpecificCharacterSet class handles the backslash (and other delimiters) correctly for multi-byte character sets such as ISO 2022 IR 87, GB18030 and GBK.

See sample datasets from Bhuvan Bose (sent by email).

Actions

Copy link

Updated by Jörg Riesmeier over 9 years ago

Description updated (diff)

Actions

Copy link

Updated by Jörg Riesmeier about 9 years ago

Status changed from New to Closed
% Done changed from 0 to 100

Fixed with commit 9e6362e.

Actions

Copy link

Updated by Jörg Riesmeier 4 months ago

Status changed from Closed to Reopened
Target version changed from 3.6.2 to 3.7.0

There still seems to be a problem, e.g., with GB18030:

From an email sent by Mathieu Malaterre <mathieu.malaterre@gmail.com>:

I disagree with the following behavior:

% curl -O https://dclunie.com/images/charset/charsettests.20070405.tar.bz2
% cp charsettests/SCSX2 new.dcm
% dcmodify -i 8,1030=$( printf "\x81\x5c\x82\x5c") new.dcm

Using dcmtk git/master + oficonv:

$ dcmdump --log-level trace +U8 new.dcm
[...]
D: DcmSpecificCharacterSet: Converting '\201\\202\' (with delimiters '\')
T: Converting sub-string '\201'
T: -> ERROR: Cannot convert character encoding: Invalid argument
T: Appending delimiter '\' to the output
W: DcmItem: An error occurred during the conversion to UTF-8 encoding,
the value of SpecificCharacterSet (0008,0005) is not updated
E: dcmdump: Cannot convert character encoding: Invalid argument:
converting file to UTF-8: new.dcm

I would have expected something like the following:

(0008,1030) LO "StudyDescription" : [乗俓] (4 bytes)

Which seems consistent with my version of iconv:

$ echo -n '乗俓' | iconv -f utf8 -t gb18030 | hexdump -C
00000000 81 5c 82 5c |.\.\|
00000004

And, from another email:

Just in case that helps, here is another example with an actual '\' this time:

$ echo -n '乗\俓' | iconv -f utf8 -t gbk | hexdump -C
00000000 81 5c 5c 82 5c |.\\.\|
00000005

Actions

Copy link