Difference between revisions of "Po Based Documentation Translations"
(Initial page creation which is a braindump of everything I'm currently thinking of.) |
(No difference)
|
Revision as of 15:37, 14 November 2018
Our current documentation translation workflow is based on creating separate xml file sets per language. These separate sets are copied from the original set in English and then translated on the spot.
While this works very well for a new translation, it becomes increasingly difficult to keep up with changes in the original English set. Particularly if the changes in there are big.
So we're considering to move to a po based workflow. In this scenario there is only one set of xml files for the documentation. These sources are primarily written in English with a few sections that are language specific. The language specific sections will only appear in the derived (that is generated by the build system) documentation sets in that language. For translation all strings in the source xml set are extracted into a po file. Several tools exist to handle translation of the po files in multiple languages.
Proof of concept
As a first proof of concept I have created an itstool branch on github. The name 'itstool' is chosen because itstool will be a central piece in making this work. It's a tool to extract translatable strings from xml files, combine them in a po file and generate translated xml files based on the base xml and po file.
This branch has converted a subset of the German guide and help file into a po file to match the English sources in guide/C and help/C. Not all has been converted as this is only a proof of concept. Next I have also merged a sizeable PR involving those same files to test how that affects the po file.
The process to get there involved several steps:
- First I have reduced the English and German guide and help files to a smaller test set to limit the test scope somewhat. The set I have retained was chosen to be big enough to be able to merge sunfish62's PR in which he moved large chunks from help to guide to allow us to evaluate how a big change affects the translations.
- Next I have lined up the xml tags in the English and German xml files. The goal in this step is to ensure itstool extracts the exact same msgid structure from the English and German set. This is influenced by tag order and punctuation/capitalization of msgids. Note this alignment is only required once during the conversion from separate xml file sets to po file. Once everything is in a po file, the gettext tools know how to handle msgid changes.
- If the msgid order can be made exactly the same, a script will generate a real po file from the msgcatalogs extracted by itstool in step 2.
- Next I have written a number of extra makefile rules to help in this process. The commit log gives all the details.
- With all of that I have made a first version of de.po and committed it on the branch. Keep in mind this only deals with the limited subset of files. If the po route proves viable, we have to extend this to include the messages from all xml files.
- Finally I have merged in sunfish62' changes to estimate its impact on de.po
Todo
- The process is not complete yet. The end deliverable is still a set of xml files per language. So rules should be added to merge the source xml set with a po file into a language specific xml set.
- The proof of concept equally only works on a small subset of the German xml file set. If good to go, the same effort still has to be done for all other xml files (the remainder in the German translation and all files in the other translations).
- As pointed out by Frank H. Ellenberger on irc, we should consider which often recurring strings to convert into entities. A first good candidate are our base urls (https://www.gnucash.org, https://bugs.gnucash.org,...) and possibly the terms in the glossary file in the code repository. This needs more thought.
- We should still determine how to deal with untranslated content. In the separate xml file set workflow this content was simply missing. However in the po based workflow this will appear in English. There are different opinions on which is better. There are several possible alternatives to deal with untranslated content:
- the untranslated content will appear in English in the native document. So the content will be mixed. This is the default with a po based workflow.
- we could post process the generated files to add annotations to the untranslated content. For example it might be an opportunity to solicit for help in further translations.
- another way to post process would be to completely remove untranslated sections again. This would be similar to the original separate xml file sets workflow. However we have less control of this as all sections are extracted automatically.
- we leave it up to the translator to decide. A translator can use its based language selectors in the source documents to control which sections should appear in the translated files. That would allow each language to decide for themselves what to include and what not.
Known Issues
- I have noticed the extraction of mediaobjects is less than ideal. It's a large blob of text. We may have to extend the itstool ruleset to better suit our needs.