Project

General

Profile

Actions

Bug #684

open

Check whether delimiters (such as a backslash) are treated correctly during character set conversion

Added by Jörg Riesmeier over 9 years ago. Updated 14 days ago.

Status:
Reopened
Priority:
Normal
Category:
Library
Target version:
Start date:
2016-05-27
Due date:
% Done:

50%

Estimated time:
Module:
dcmdata
Operating System:
Compiler:

Description

The following excerpt is from DICOM PS 3.5 Section 6.1.2.3:

The Unicode and GB18030 standards have distinct Yen symbol, backslash, and several forms of reverse solidus. The separator for multi-valued data elements in DICOM is the character valued 05/12 regardless of what glyph is used to enter or display this character. The other reverse solidus characters that have a very similar appearance are not separators. The choice of font can affect the appearance of 05/12 significantly. Multi-byte encoding systems, such as GB18030, GBK and ISO 2022, may generate encodings that contain a byte valued 05/12. Only the character that encodes as a single byte valued 05/12 is a delimiter.

It is not sure whether the DcmSpecificCharacterSet class handles the backslash (and other delimiters) correctly for multi-byte character sets such as ISO 2022 IR 87, GB18030 and GBK.

See sample datasets from Bhuvan Bose (sent by email).

Actions #1

Updated by Jörg Riesmeier over 9 years ago

  • Description updated (diff)
Actions #2

Updated by Jörg Riesmeier almost 9 years ago

  • Status changed from New to Closed
  • % Done changed from 0 to 100

Fixed with commit 9e6362e.

Actions #3

Updated by Jörg Riesmeier 18 days ago

  • Status changed from Closed to Reopened
  • Target version changed from 3.6.2 to 3.7.0

There still seems to be a problem, e.g., with GB18030:

From an email sent by Mathieu Malaterre <>:

I disagree with the following behavior:

% curl -O https://dclunie.com/images/charset/charsettests.20070405.tar.bz2
% cp charsettests/SCSX2 new.dcm
% dcmodify -i 8,1030=$( printf "\x81\x5c\x82\x5c") new.dcm

Using dcmtk git/master + oficonv:

$ dcmdump --log-level trace +U8 new.dcm
[...]
D: DcmSpecificCharacterSet: Converting '\201\\202\' (with delimiters '\')
T: Converting sub-string '\201'
T: -> ERROR: Cannot convert character encoding: Invalid argument
T: Appending delimiter '\' to the output
W: DcmItem: An error occurred during the conversion to UTF-8 encoding,
the value of SpecificCharacterSet (0008,0005) is not updated
E: dcmdump: Cannot convert character encoding: Invalid argument:
converting file to UTF-8: new.dcm

I would have expected something like the following:

(0008,1030) LO "StudyDescription" : [乗俓] (4 bytes)

Which seems consistent with my version of iconv:

$ echo -n '乗俓' | iconv -f utf8 -t gb18030 | hexdump -C
00000000 81 5c 82 5c |.\.\|
00000004

And, from another email:

Just in case that helps, here is another example with an actual '\' this time:

$ echo -n '乗\俓' | iconv -f utf8 -t gbk | hexdump -C
00000000 81 5c 5c 82 5c |.\\.\|
00000005

Actions #4

Updated by Jörg Riesmeier 18 days ago

  • % Done changed from 100 to 50
Actions #5

Updated by Jörg Riesmeier 14 days ago

  • Target version changed from 3.7.0 to 3.7.1+
Actions

Also available in: Atom PDF