Generic XML generation with AppleScriptObjectiveC (introduction)

Jean-Christophe Helary <jean.christophe.helary@...>

-- This document is intended for my blog, but I thought that I'd share it here first, just in case there are comments/suggestions/questions, etc. Including the code, it is 3000 words. It's trivial for most of you, but it took me quite some time to put everything together and then to document everything. Everything is still not super clear, so go ahead and tear it apart. :)

This code is an attempt at putting together a practical introduction to using AppleScriptObjectiveC from all the information I gathered when creating a stand-alone application that required XML generation.

The application itself, a script that converts Excel contents to TMX is trivial as far as AppleScript itself is concerned. The problem that needed a solution was generic simple XML generation.

There are a number of solutions in AppleScript to create XML but nothing out of the box to generate generic data. System Events cannot create generic XML, it can only modify existing files. The only format it can natively produce is the "property list" format, used to store application preferences, etc. There are solutions that involve concatenating strings and doing a lot of checks on the data to make sure the output is valid (in XML a number of characters are forbidden, so all strings that are concatenated need to be thoroughly checked for those) but they are neither elegant nor robust. There are libraries available, but they do a lot more than what casual users need.

I thought that offering a simple solution for a simple problem that would provide the reader with a step-by-step introduction to reading the Foundation documentation and understanding how to use it with vanilla AppleScript was a better approach, at least for me, since short of having scriptable applications that do what you want, the only way to do really complex things in AppleScript is to use Foundation.

My problem, which is common to all amateur AppleScript users, is a problem of discoverability and of fluency. AppleScript is not a trivial language, a systematic description of its features is non-existent, and when you want to go beyond the "tell Finder to sort the mess on my Desktop" you end up needing to check a lot of references that either consider that you know a lot, or that you don't know much. There is no middle ground and no easy way to find what you need without actually asking. In fact, I still wonder how I could have written this code without the help of people who just seem to "know". There is no clear discoverability path to the information I gathered here.

AppleScriptObjC is not described in the AppleScript Language Guide issued by Apple. The 3rd edition of "Learn AppleScript" by Hamish Sanderson and Hanaan Rosenthal from Apress (2010) discusses some aspects of AppleScriptObjC but the document that best describes it is "Everyday AppleScriptObjC" by Shane Stanley (2015). The Foundation framework, along with other frameworks that can be accessed by AppleScriptObjC are described in the developer documentation (either in Xcode or online) with Objective-C use in mind (syntax, etc.) so we'll thus use the 3 references to go through the code.

Becoming fluent in a language is a problem both for the linguist and for the programmer. Fluency only comes with regular practice based on sound references, and regular contact with "natives". As far as AppleScript (and AppleScriptObjC) is concerned, "natives" can be found in a number of very interesting places on the internet. The first is the official AppleScript User List (ASUL) hosted by Apple. Apple has been a bad host in recent years so fear of losing this resource has motivated some of its users to create an independent list, hosted on There also is the Macsripter web forum. I'm not super fond of web forums. Their user interface is clumsy more often than not and this one is no exception. But it's been around for ever and a number of world-class experts are there to comment on code and generally manage the community. Then, there is the ScriptDebugger's user forum. Conversations there generally revolve around higher level topics, the web forum is a modern one and the user experience is of extremely high quality. Considering that the next version of ScriptDebugger (7, in Beta at the time of this writing) will offer a Lite version à la BBEdit once the trial period expires, you really have no reason to prefer Apple's Script Edit over ScriptDebugger.

Ok, so, you're all set, no more chatting, let's check the code.

use AppleScript version "2.4" -- Yosemite (10.10) or later
use framework "Foundation"
use scripting additions

As seen in the script headers above, we'll be using the "Foundation" framework which allows us to directly work on the XML tools that macOS offers. Without this declaration we can't access what we need.

This code will generate an XML file that will reproduce the following typical omegat.project file, with default values for the project folders, English as source language, Japanese as target language, sentence segmentation and default translations enabled, and no external command to process when creating the target files.

In XML, the order of tags as well as the white space used to present them is irrelevant, so the appearance of the data will change depending on how the code was produced. It may not be exactly like a "genuine" OmegaT file, but as far as the meaning of the data and OmegaT proper are concerned that is not relevant.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <project version="1.0">

The tokenizers are generally not set by the user because OmegaT would automatically select them based on the corresponding language codes.
Anything else is user settable, even the "excludes" files, even though that is an advanced setting that most users should not have to consider.

So, we're going to create a handler that takes as input the following values:

• source_dir_value: path to directory
• source_dir_excludes_masks: list of mask tags
• target_dir_value: path to directory
• tm_dir_value: path to directory
• glossary_dir_value: path to directory
• glossary_file_value:  path to text file
• dictionary_dir_value: path to directory
• source_lang_value: string language code (we do need to check the validity of the code based on the appropriate ISO standard, but we consider here that the check has been made upstream, if only to keep the XML generation code free of such external checks)
• target_lang_value: string language code (see above)
• source_tok_value: string automatically proposed by the script, based on language code (here again, the tokenizers corresponding to the source language should be a valid tokenizer but we let the upstream code select and validate it so stick to the core XML generation code)
• target_tok_value: string automatically proposed by the script, based on language code (see above)
• sentence_seg_value: boolean
• support_default_translations_value: boolean
• remove_tags_value: boolean
• external_command_value: string to be sent to exec when the target files are created, can be empty if no action is requested.

We'll make this code a handler so that we can call it without having to copy it every time we need it. This is better because it allows to logically separate portions of the code.

on CreateOmegaTProjectFile(ParametersList)

The first few lines below are necessary to initialize the XML tags. First comes "project_tags", a list of tags that are used in the file. There are 15 tags, of which 14 tags that only take a simple value, either a path, or a text string, etc. As we can see above, "source_dir_excludes" takes a list of "mask" tags which we'll create with "masks", which contains the patterns that describe the type of files to be ignored by OmegaT. Last, we initialize "valueindex", an index that will be used to associate the tag to its value found in the list that we feed the handler.
  Then comes the beginning of the XML structure creation.

set project_tags to {"source_dir", "source_dir_excludes", "target_dir", "tm_dir", "glossary_dir", "glossary_file", "dictionary_dir", "source_lang", "target_lang", "source_tok", "target_tok", "sentence_seg", "support_default_translations", "remove_tags", "external_command"}
set masks to {"**/.svn/**", "**/CVS/**", "**/.cvs/**", "**/desktop.ini", "**/Thumbs.db", "**/.DS_Store"}
set valueindex to 0

We're first going to create an XML element that we'll then use as the XML document root. We'll create the other elements once that is done.

set projectRoot to current application's NSXMLNode's elementWithName:"omegat"

This is our first line of AppleScriptObjC. How do we call the Foundation items that we need to work with?

From "Everyday AppleScriptObjC":
"Class names are effectively properties of the current application (which is, in turn, the parent of AppleScript in your script)."

From "Learn Applesccript":
"AppleScriptObjC presents Cocoa classes as class elements of the current application object."

So, everything we'll call from Cocoa will be called from "current application", hence the use of "current application's NSXMLNode".

In the Xcode documentation, NSXMLNode is described as, well, an XML node.


As the documentation says, an XML node can be anything from: an element, an attribute, text, a processing instruction, a namespace, or a comment.


Here we use the "elementWithName" method to create the element. Clicking on its description in the Xcode documentation browser shows that the method is a "type method" and returns "an NSXMLElement object with a given tag identifier, or name".


"Type methods" apply to classes, they are generally used to create objects which are specific instances of a given class. When we'll work on the objects themselves we'll need "instance methods". In the documentation, "type methods" have a "+" prefixed to their name and "instance methods" have a "-" prefixed instead.


The syntax for such method calls is methodName:parameter, so here we have: elementWithName:"omegat". This syntax well be used for all the other elements that we create.


It is important to note that we are using a shortcut here. As Shane mentions on ASUL:
"(elementWithName is) what's known as a convenience method. You're actually making a particular subclass of NSXMLNode, an NSXMLElement.
You could have used:
set projectRoot to current application's NSXMLElement's alloc()'s initWithName:"omegat"
But convenience methods are common, if not following any particular logic of when and where they're provided. Something like "stringWithString:" instead of "alloc()'s initWithString:" is a common example."


To make sense of that we need to know how Objective-C/Cocoa works:

From "Everyday AppleScriptObjC":
"The equivalent (of AppleScript's make) in Cocoa is actually a two-stage process: first the object is created by allocating memory for it, and then it is initialized.  The first stage is done using the +alloc method. You will not see it listed (in the documentation) because it is a method all classes inherit from the NSObject class."

If we check the NSObject description, we see "+ alloc" which is described as "Returns a new instance of the receiving class" and later "You must use an init... method to complete the initialization process." Then we see "- init", which is described as "Implemented by subclasses to initialize a new object (the receiver) immediately after memory for it has been allocated." In the method description we also see that it only exists for Objective-C and not for Swift which only has "init()" to cover allocation and initialization in one fell swoop.

So, instead of using that longer process, we prefer to use that "convenience method" and make the code slightly shorter.

Foundation reference:

set theProject to current application's NSXMLNode's documentWithRootElement:projectRoot

Here again, we'll use a convenience method that does not require us to go through the alloc-init process. Now we have a XML document and its root element. We'll need a few more things to make our output look like what OmegaT needs.

Foundation reference:

theProject's setCharacterEncoding:"UTF-8"

"setCharacterEncoding" is not documented in the Foundation documentation. The closest thing related to an NSXMLDocument that we have is "characterEncoding" (notice the case for "Character": lower case in the documentation, upper case in the code), which is described as an "instance property", which seems to correspond since theProject is indeed an instance of NSXMLDocument.

The "set" prefix is explained in "Everyday AppleScriptObjC":
"To set a new value for a Cocoa property, assuming it is not read-only, you use the word set, followed by the property name with the  first letter in uppercase, followed by a colon or underscore and parentheses, depending on the syntax.  The single argument is then the proposed new value."

Now we understand why "set" is prefixed and why characterEncoding is changed to CharacterEncoding when prefixed with set. If we need to get the characterEncoding property of the document, let's not forget about the letter case...

Setting characterEncoding happens to be enough to create the <?xml version="1.0" encoding="UTF-8"?> line in our example *and* to actually encode the data in UTF-8.

Foundation reference:

theProject's setStandalone:true

setStandalone works like setCharacterEncoding and adds the standalone="yes" part to our xml declaration.

I had the following line in a previous version of this code:

theProject's setDocumentContentKind:(current application's NSXMLDocumentXMLKind)

but we can dispense with it since XML is the default kind of XML document (other kinds include HTML, XHTML and text). Still there is something in this line that we ought to remember for other occasions. "documentContentKind" is an instance property, like "standalone". To set it we must thus use "setDocumentContentKind". The possible values for a documentContentKind are documented as "enumerations", of which NSXMLDocumentXMLKind is the default value in the case of an XML document. To use NSXMLDocumentXMLKind as a value, we must do as we've done for other Cocoa items: call them from the current application, hence the (current application's NSXMLDocumentXMLKind).

Foundation reference:

set project to current application's NSXMLNode's elementWithName:"project"
set projectVersion to current application's NSXMLNode's attributeWithName:"version" stringValue:"1.0"
project's addAttribute:projectVersion
projectRoot's addChild:project

We're still in NSXMLNode territory here. Now we're creating the "project" element with a "version" attribute that has a "1.0" value.

To do this, we first create the element and we then separately create an attribute node by using the "attributeWithName:stringValue:" method (see the Xcode description) that actually comes in two parts: the attributeWithName: part and the stringValue: part.

Once created, the 2 nodes have no relation to each other or to what we've created so far. We need to "link" everything together now and we do that with the 2 lines that follow:
"addAttribute" is documented as an instance method, which is good because "project" is an instance of "NSXMLNode" and "adds an attribute node to the receiver", which is exactly what we're trying to do. The parameter is "projectVersion", the attribute node that we created 1 line before that.

Now we have an element what would look like <project version="1.0"></project> and we need to add it as a child of the document root element. That's what does "addChild": an instance method that "adds a child node after the last of the receiver’s existing children". The receiver is projectRoot and the child is project.

Foundation reference:

repeat with child in project_tags
set valueindex to valueindex + 1
if contents of child is not "source_dir_excludes" then
set child to (current application's NSXMLNode's elementWithName:child stringValue:(item valueindex of ParametersList))
(project's addChild:child)
set child to (current application's NSXMLNode's elementWithName:"source_dir_excludes")
(project's addChild:child)
set source_dir_excludes to (project's elementsForName:"source_dir_excludes")'s firstObject()
repeat with mask in masks
set mask to (current application's NSXMLNode's elementWithName:"mask" stringValue:mask)
(source_dir_excludes's addChild:mask)
end repeat
end if
end repeat

Now that we've been through the basics of using Foundation to create a basic XML structure, the rest of the code is very straightforward. We're running a loop on the tag list that was created at the beginning of the script and for each "child" we'll set a "stringValue" that corresponds to the item at the same position in the ParametersList that has been sent to this handler.

Once the element and its string value are created, we add it as a child to the "project" element that we created just above.

The only exception to that process is when we bump into "source_dir_excludes". Here, what happens is that we do not use stringValue: to set its value since it is going to be made of a list of <mask> tags. So instead of that, we first create the tags and add them one by one as children to "source_dir_excludes". When we're done with that special content, we resume the loop and deal with the other tags.

How we identify "source_dir_excludes" within the list of existing tags is interesting. We use elementsForName: an instance method for project that returns an array of all the elements with the name given as the parameter (here "source_dir_excludes"), and since we only have one here, we specify that we want to work with that array's "firstObject". That way we create a reference to the NSXMLElement "source_dir_excludes" that we can later use to add children to it. It took me a while to figure that out. Thank you Shane :)

The creation of the list of <mask> tags is straightforward and does introduce any new concept.

Foundation reference:

set theData to theProject's XMLDataWithOptions:((get current application's NSXMLNodePrettyPrint) + (get current application's NSXMLDocumentTidyXML))
theData's writeToFile:(item 16 of ParametersList) options:(current application's NSDataWritingAtomic) |error|:(missing value)

We're reaching the end of this XML generation code. Now we need to output the data, in readable form. "XMLDataWithOptions" is a method that returns an NSData object and the options are listed under "NSXMLNodeOptions". The description says, "One or more options (bit-OR'd if multiple) to affect the output of the document..." where "bit-OR'd if multiple" means "added". Another thing to notice is that the second and following options (if any) seem to need a "get". Adding the "get" to the first option does not seem to be necessary.

"writeToFile:options:error:" is an instance method of NSData, which theData is, it requires a Unix absolute path and options in the same way as defined the line before. |error| is put between vertical bars to avoid naming conflicts since the AppleScript "error" is not the same as the Foundation "error" and we're calling the later here, and the description of "error" says we can use a "NULL" value, for which the AppleScript equivalent is missing value (Everyday AppleScriptObjC, p. 52 "The terms nil and its close relative NULL are used commonly in Cocoa.  They are essentially the equivalent of AppleScript’s missing value, and the scripting bridge converts between them.")

Foundation reference:

end CreateOmegaTProjectFile

Let's now test the code with the following values:

set ProjectSettings to {"PATH_1", "MASKS", "PATH_2", "PATH_3", "PATH_4", "FILE_5", "PATH_6", "LANG_7", "LANG_8", "TOKENIZER_9", "TOKENIZER_10", "SEGMENTATION_11", "DEFAULT_TRANSLATION_12", "REMOVE_TAGS_13", "EXTERNAL_COMMAND_14", "/Users/suzume/Desktop/omegat.project"}

my CreateOmegaTProjectFile(ProjectSettings)

Jean-Christophe Helary
----------------------------------------------- @brandelune

Jean-Christophe Helary <jean.christophe.helary@...>

Thank you for the comments. I eventually posted the thing here after a number of modifications:

Jean-Christophe Helary
----------------------------------------------- @brandelune