File size observations on the IATE TBX Termbase

Is has been known for a while now that a database dump of IATE, the EU Terminology Database, has been made available as a download instead of a web search form in June 2014. The ZIP file is ~116 MB, the unpacked database 2.2 GB (!) large. Since it contains all EU languages, I split this file into 4 subfiles, and extracted four trilingual DE/FR/EN files using an XSL transformation sheet. xsltproc.exe from Apache’s Xerxes XML Parser package couldn’t cope with the complete file, but the four 550MB files passed through in about 10 minutes each and dropped to about half their original size.

About 250-275MB per file is still quite fat, so I thought about ways to reduce this further. (Un-)fortunately, IATE isn’t exactly renowned for its accuracy – colleagues in the know will always tell you to use IATE with caution. IATE has a “Reliability” rating which is assigned to each entry, running from 1 (unchecked) via 2 (minimal reliability) and 3 (reliable) to 4 (very reliable/assessed). Thus, I was tempted to throw out all Reliability 1+2 entries and considered to also do away with Rel. 3 entires, since the IATE team itself notes:

This code was automatically assigned to many entries, regardless of their previous validation status, following the merger of existing databases to create IATE. Therefore some entries marked as ‘reliable’ are not necessarily so.

Uh-huh. So basically, all sorts of stuff was thrown in and instead of correctly classifying it as minimally reliable (Rel. 2) until the material could be reviewed, it was decided to recommend it as “reliable” (Rel. 3). That was the point at which I wrote two more XSL sheets to filter for Reliability 3+4 (R34) and exclusively for Reliablity 4 (R4). Since that run looked promising, I wrote yet another XSLT script to clean up the results (C), deleting empty language groups (“tig” elements) or even empty Term entries (“termEntry” elements). Here’s what happened:

IATE TBX File Size Reductions for DE/FR/EN
Filename Orig. Size R34 Size R34 % from Orig R34 Cleaned Size R34C % from R34 R34C % from Orig R4 Size R4 % from Orig R4 Cleaned Size R4C % from R4 R4C % from Orig
IATE-de-fr-en-1of4.tbx 273 MB 166 MB -39% 125 MB -25% -54% 113 MB -59% 57 MB -50% -79%
IATE-de-fr-en-2of4.tbx 276 MB 233 MB -16% 212 MB -9,0% -23% 106 MB -62% 50 MB -53% -82%
IATE-de-fr-en-3of4.tbx 253 MB 213 MB -16% 192 MB -10% -24% 100 MB -61% 46 MB -54% -82%
IATE-de-fr-en-4of4.tbx 271 MB 245 MB -10% 231 MB -6% -15% 107 MB -61% 56 MB -48% -79%

Now, what does this mean?

Apparently, German, English and French make up roughly 50% of the whole IATE database. This isn’t astonishing, as DE, FR and EN are the OFFICIAL official languages of the EU (that means, all documents must be made available in at least one of these three languages). But it also means that on average, 80% of the chosen DE/FR/EN data subset are classed as “reliable or very reliable” and still almost 40% as “very reliable”.

Additionally, this means that by cutting out all unreliable entries and all the unnecessary bits (empty tags, superfluous whitespace, etc.), we can achieve significant file size reductions. This plays an important role during import of the TBX database into other systems, notably SDL Trados Studio’s beloved companion, SDL MultiTerm, which didn’t manage to import the original DE-FR-EN files without lots of “file lock limit” errors. More on that in another post, perhaps, but Paul Filkon already wrote on that in What A Whopper. The message is: “Don’t use IATE as-is, adapt it to your needs!” For example, one could further filter IATE by the “field” column to adapt it to one’s own expert fields as a translator.

If you are interested in the XSL transformation sheets used, you can download them as a 3kB ZIP file. If you don’t know anything about XML/XSL, but would like to have a look at the resulting varieties of DE-FR-EN TBX files, send me a nice-to-read e-mail to info ~at~ defrent ~dot~ de (no “mee too!” blog comments, please). The “unedited TBX” ZIP file weighs in at ~55MB, the filtered Reliability 3+4 ZIP is ~37MB and the Reliability 4 ZIP is only ~7.5 MB. Since the resulting SDL MultiTerm termbases are 5 times as heavy as the corresponding TBX file, I am reluctant to send out those, but with the free MultiTerm Convert tool from the SDL OpenExchange, conversion should be a matter of minutes. Of course, the IATE usage conditions from their download site apply to the edited files, too:

You are allowed to reproduce the data provided on this page for your personal needs, to distribute it for non-commercial and commercial purposes, and to make and distribute derivative works, provided the source is acknowledged as follows: Download IATE, European Union, 2014.

Edit (1st Oct. 2014): @jeromobot recommended Paul Filkin’s recommendation, which I will repeat here in short: If you are looking for more thoroughly cleaned IATE files that are ready for import into your CAT, you might want to visit Henk Sanderson’s site SanTrans, where he also mentions addditional IATE pitfalls, like terms-that-aren’t and escaped (pseudo-)HTML codes like <i>some term</i> inside entries.

Christopher Köbel

IT / IT-Marketing / Tech in DE / FR / EN defrent.de | XING Profil

Veröffentlicht in English Articles, Howtos in English Getagged mit: , , , , ,
2 Kommentare zu “File size observations on the IATE TBX Termbase
  1. Thank you, Christopher, for mentioning my website, and also for giving me ideas of how to improve on the useability of my extractions by filtering on the reliability code.
    Regards,
    Henk

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert.

*