Bug #684: Check whether delimiters (such as a backslash) are treated correctly during character set conversion - DCMTK - OFFIS DCMTK and DICOM Projects

Actions

Copy link

Bug #684

closed

Check whether delimiters (such as a backslash) are treated correctly during character set conversion

Added by Jörg Riesmeier almost 10 years ago. Updated 2 months ago.

Status:

Closed

Priority:

Normal

Assignee:

Jörg Riesmeier

Category:

Library

Target version:

3.7.1

Start date:

2016-05-27

Due date:

% Done:

100%

Estimated time:

8:00 h

Module:

dcmdata

Operating System:

Compiler:

Description

The following excerpt is from DICOM PS 3.5 Section 6.1.2.3:

The Unicode and GB18030 standards have distinct Yen symbol, backslash, and several forms of reverse solidus. The separator for multi-valued data elements in DICOM is the character valued 05/12 regardless of what glyph is used to enter or display this character. The other reverse solidus characters that have a very similar appearance are not separators. The choice of font can affect the appearance of 05/12 significantly. Multi-byte encoding systems, such as GB18030, GBK and ISO 2022, may generate encodings that contain a byte valued 05/12. Only the character that encodes as a single byte valued 05/12 is a delimiter.

It is not sure whether the DcmSpecificCharacterSet class handles the backslash (and other delimiters) correctly for multi-byte character sets such as ISO 2022 IR 87, GB18030 and GBK.

See sample datasets from Bhuvan Bose (sent by email).

Actions

Copy link

Updated by Jörg Riesmeier almost 10 years ago

Description updated (diff)

Actions

Copy link

Updated by Jörg Riesmeier about 9 years ago

Status changed from New to Closed
% Done changed from 0 to 100

Fixed with commit 9e6362e.

Actions

Copy link

Updated by Jörg Riesmeier 5 months ago

Status changed from Closed to Reopened
Target version changed from 3.6.2 to 3.7.0

There still seems to be a problem, e.g., with GB18030:

From an email sent by Mathieu Malaterre <mathieu.malaterre@gmail.com>:

I disagree with the following behavior:

% curl -O https://dclunie.com/images/charset/charsettests.20070405.tar.bz2
% cp charsettests/SCSX2 new.dcm
% dcmodify -i 8,1030=$( printf "\x81\x5c\x82\x5c") new.dcm

Using dcmtk git/master + oficonv:

$ dcmdump --log-level trace +U8 new.dcm
[...]
D: DcmSpecificCharacterSet: Converting '\201\\202\' (with delimiters '\')
T: Converting sub-string '\201'
T: -> ERROR: Cannot convert character encoding: Invalid argument
T: Appending delimiter '\' to the output
W: DcmItem: An error occurred during the conversion to UTF-8 encoding,
the value of SpecificCharacterSet (0008,0005) is not updated
E: dcmdump: Cannot convert character encoding: Invalid argument:
converting file to UTF-8: new.dcm

I would have expected something like the following:

(0008,1030) LO "StudyDescription" : [乗俓] (4 bytes)

Which seems consistent with my version of iconv:

$ echo -n '乗俓' | iconv -f utf8 -t gb18030 | hexdump -C
00000000 81 5c 82 5c |.\.\|
00000004

And, from another email:

Just in case that helps, here is another example with an actual '\' this time:

$ echo -n '乗\俓' | iconv -f utf8 -t gbk | hexdump -C
00000000 81 5c 5c 82 5c |.\\.\|
00000005

Actions

Copy link