-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
malformed xml output for non-ascii filenames #302
Comments
I must admit I never got into the XML specs proper; the parsers I tried up to now were all happy to eat whatever UTF-8 I feed them, and I (naively?) expected that putting in the XML header as we have it now should suffice:
I'd be happy to make the necessary changes to make the output compliant, but it would help if someone who knows about these things can hold my hand and tell me how to do it though :) |
Yes, that's how it is supposed to work: utf-8 should never be a problem. But here it is about malformed filenames, which cannot be represented as UTF-8. What is the problem?From my point of view the problem has two layers:
My "solution" (in Python)In my case I am sanitizing the output of duc xml --database db.duc | python3 -c 'import re, sys; regex = re.compile(b"name=\"([^\"]+)\""); converter = lambda m: b"name=\"" + m.groups()[0].decode(errors="surrogateescape").encode(errors="backslashreplace") + b"\""; raw_xml = sys.stdin.buffer.read(); sys.stdout.buffer.write(regex.sub(converter, raw_xml))' (I hope, my example is not too hard to read for non-native Python speakers) The result (happily accepted by
In short:
Here ConclusionFirst of all: I am sorry, for bringing up such a tricky problem :) I guess, it is up to you as the maintainer to decide, whether the edge case of "malformed" (non-utf-8) input is worth being handled (with regard to the users of Sorry, that I cannot offer more detailed advise regarding XML or encoding details. My knowledge is limited in these fields. Anyway: thank you for your time! |
Right; that all makes perfect sense, I hadn't realized we were talking malformed file names here. The problem here is that duc is basically oblivious to encodings; it simply does not care about what the data means, the file names are just a sequence of characters. If it happens to be valid UTF-8, that's nice, but Duc does not care. XML does care however, so when exporting we should take proper care to emit valid UTF-8. The problem is how to do this nice and proper in plain C. We would at least need to interpret all names as UTF-8 to see when a sequence of bytes results in invalid UTF-8, and find another way to encode these bytes. I'm not comfortable pulling in some large dependency like iconv, but I guess we should be able to whip up something lightweight for this. |
I stumbled upon an issue with some local filenames (e.g. containing German umlauts).
At least Python's xml library refuses to load the data generated by
duc
, if special characters in filenames are involved.The following procedure demonstrates the issue (comments are included):
I do not know the XML specification in detail, but I think, non-trivial characters (everything outside of 7-bit ASCII?) need to be escaped.
Python would emit the following for the above special character:
What do you think?
Thank you for maintaining
duc
!The text was updated successfully, but these errors were encountered: