Working with XML...


Jean-Christophe Helary <jean.christophe.helary@...>
 

It looks like System Events has problems with big XML files...

When I try 

tell application "System Events"
return XML elements of contents of XML file XMLfile
end tell

of a small xml file, I get all the info I need. When I do that on a big file (125mb here), which is otherwise a valid XML document, I get this:

"System Events got an error: Some data was the wrong type." which highlights "XML elements" in SD and suggests that it is expecting an item instead...

Somebody is going to tell me that the XML suite in System Events doesn't work and I'd better use [asobjc|satimage's xmllib|something else] but before I move forward with other tools I'd like to confirm the above issue...

Thank you in advance.

Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com



Shane Stanley
 

On 18 Dec 2017, at 4:05 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

Somebody is going to tell me that the XML suite in System Events doesn't work and I'd better use [asobjc|satimage's xmllib|something else] but before I move forward with other tools I'd like to confirm the above issue...
I've never had it not work, so much as become unbearably slow with anything other than small files.

So yes, you'll need to use something else. Last I played with XMLLib.osax -- and it was only a play -- it's CFRef type didn't play well in scripts using AppleScriptObjC.

And yes, it's easy enough, relatively, to do in AppleScriptObjC.

--
Shane Stanley <sstanley@...>
<www.macosxautomation.com/applescript/apps/>, <latenightsw.com>


Jean-Christophe Helary <jean.christophe.helary@...>
 

Thank you Shane,

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...
Ok, I'm off the lists, I'll come back when I have questions!

Jean-Christophe 

On Dec 18, 2017, at 14:13, Shane Stanley <sstanley@...> wrote:

On 18 Dec 2017, at 4:05 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

Somebody is going to tell me that the XML suite in System Events doesn't work and I'd better use [asobjc|satimage's xmllib|something else] but before I move forward with other tools I'd like to confirm the above issue...

I've never had it not work, so much as become unbearably slow with anything other than small files.

So yes, you'll need to use something else. Last I played with XMLLib.osax -- and it was only a play -- it's CFRef type didn't play well in scripts using AppleScriptObjC.

And yes, it's easy enough, relatively, to do in AppleScriptObjC.

--
Shane Stanley <sstanley@...>
<www.macosxautomation.com/applescript/apps/>, <latenightsw.com>


Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com



Shane Stanley
 

On 18 Dec 2017, at 5:13 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...
It doesn't get much more than a passing reference, with one example using XPath queries. It's the sort of thing where the approach depends largely on the structure of the XML you're dealing with. That said, if XPath suits the job, it saves a lot of work and time.

--
Shane Stanley <sstanley@...>
<www.macosxautomation.com/applescript/apps/>, <latenightsw.com>


Jean-Christophe Helary <jean.christophe.helary@...>
 



On Dec 18, 2017, at 19:59, Shane Stanley <sstanley@...> wrote:

On 18 Dec 2017, at 5:13 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...

It doesn't get much more than a passing reference, with one example using XPath queries. It's the sort of thing where the approach depends largely on the structure of the XML you're dealing with. That said, if XPath suits the job, it saves a lot of work and time.

Well, I'm not sure what will suit the job. Basically I want to create a few scripts that do QA checks on XML files of a specific type (tmx). The checks involve checking nodes uniqueness, data consistency between nodes, searches, etc. I have very little ideas where to start, so I'll just try a few things, see how that fares, and repeat until I find better ideas...



Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com



Sandor Szatmari
 

The version did libxslt that was shipped with El Capitain, and maybe later, has large file support disabled.  You can rebuild from source and add XML_PARSE_HUGE to the default XSLT_PARSE_OPTIONS to parse files larger than 10MB.

Sandor

On Dec 18, 2017, at 07:06, Jean-Christophe Helary <jean.christophe.helary@...> wrote:



On Dec 18, 2017, at 19:59, Shane Stanley <sstanley@...> wrote:

On 18 Dec 2017, at 5:13 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...

It doesn't get much more than a passing reference, with one example using XPath queries. It's the sort of thing where the approach depends largely on the structure of the XML you're dealing with. That said, if XPath suits the job, it saves a lot of work and time.

Well, I'm not sure what will suit the job. Basically I want to create a few scripts that do QA checks on XML files of a specific type (tmx). The checks involve checking nodes uniqueness, data consistency between nodes, searches, etc. I have very little ideas where to start, so I'll just try a few things, see how that fares, and repeat until I find better ideas...



Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com



Jean-Christophe Helary <jean.christophe.helary@...>
 

Thank you Sandor,

I was just working on a 125mb file with 500,000 end nodes (not sure that's the right term) and all the tools I tried were pretty slow *except* for the dedicated tool I use for such tasks (that I suspect is not a strict XML parser since the file was not valid XML in the first place)...

Anyway, I'll keep investigate, but I'd rather work with out of the box solutions...

Jean-Christophe

On Dec 20, 2017, at 7:49, Sandor Szatmari <admin.szatmari.net@...> wrote:

The version did libxslt that was shipped with El Capitain, and maybe later, has large file support disabled. You can rebuild from source and add XML_PARSE_HUGE to the default XSLT_PARSE_OPTIONS to parse files larger than 10MB.

Sandor

On Dec 18, 2017, at 07:06, Jean-Christophe Helary <jean.christophe.helary@...> wrote:



On Dec 18, 2017, at 19:59, Shane Stanley <sstanley@...> wrote:

On 18 Dec 2017, at 5:13 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...
It doesn't get much more than a passing reference, with one example using XPath queries. It's the sort of thing where the approach depends largely on the structure of the XML you're dealing with. That said, if XPath suits the job, it saves a lot of work and time.
Well, I'm not sure what will suit the job. Basically I want to create a few scripts that do QA checks on XML files of a specific type (tmx). The checks involve checking nodes uniqueness, data consistency between nodes, searches, etc. I have very little ideas where to start, so I'll just try a few things, see how that fares, and repeat until I find better ideas...



Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com

Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com


Sandor Szatmari
 

Yea, it was a bummer that Apple shipped with large file support disabled. We regularly have XML files well above 100MB. This was a big surprise when we moved to El Capitain. We just downloaded the source from Apple’s open source repo and added that one option and rebuilt. Stashing it in /usr/local/... gives us the option to use it when needed.

Good luck!

Regards,
Sandor

On Dec 20, 2017, at 00:20, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

Thank you Sandor,

I was just working on a 125mb file with 500,000 end nodes (not sure that's the right term) and all the tools I tried were pretty slow *except* for the dedicated tool I use for such tasks (that I suspect is not a strict XML parser since the file was not valid XML in the first place)...

Anyway, I'll keep investigate, but I'd rather work with out of the box solutions...

Jean-Christophe

On Dec 20, 2017, at 7:49, Sandor Szatmari <admin.szatmari.net@...> wrote:

The version did libxslt that was shipped with El Capitain, and maybe later, has large file support disabled. You can rebuild from source and add XML_PARSE_HUGE to the default XSLT_PARSE_OPTIONS to parse files larger than 10MB.

Sandor

On Dec 18, 2017, at 07:06, Jean-Christophe Helary <jean.christophe.helary@...> wrote:



On Dec 18, 2017, at 19:59, Shane Stanley <sstanley@...> wrote:

On 18 Dec 2017, at 5:13 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

I guess I'll have to try AppleScriptObjC then... Which means diving into your book...
It doesn't get much more than a passing reference, with one example using XPath queries. It's the sort of thing where the approach depends largely on the structure of the XML you're dealing with. That said, if XPath suits the job, it saves a lot of work and time.
Well, I'm not sure what will suit the job. Basically I want to create a few scripts that do QA checks on XML files of a specific type (tmx). The checks involve checking nodes uniqueness, data consistency between nodes, searches, etc. I have very little ideas where to start, so I'll just try a few things, see how that fares, and repeat until I find better ideas...



Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com

Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com





Shane Stanley
 

On 20 Dec 2017, at 9:49 am, Sandor Szatmari <admin.szatmari.net@...> wrote:

The version did libxslt that was shipped with El Capitain, and maybe later, has large file support disabled.
I'm not sure how that relates to using AppleScriptObjC, which mainly involves using NSXMLDocument/NSXMLElement/NSXMLNode. I just tried loading a ~110MB file, and it loaded fine. Actual loading took under six seconds, although displaying it is another matter.

--
Shane Stanley <sstanley@...>
<www.macosxautomation.com/applescript/apps/>, <latenightsw.com>


Sandor Szatmari
 

Well...

On Dec 20, 2017, at 07:14, Shane Stanley <sstanley@...> wrote:

On 20 Dec 2017, at 9:49 am, Sandor Szatmari <admin.szatmari.net@...> wrote:

The version did libxslt that was shipped with El Capitain, and maybe later, has large file support disabled.

I'm not sure how that relates to using AppleScriptObjC, which mainly involves using NSXMLDocument/NSXMLElement/NSXMLNode. I just tried loading a ~110MB file, and it loaded fine. Actual loading took under six seconds, although displaying it is another matter.

The op said this following AppleScript would not work on large files.

tell application "System Events"
return XML elements of contents ofXML file XMLfile
end tell

I don’t know how AppleScript would handle this under the hood, but I had an experience where a tool Apple shipped was hobbled.  Perhaps under the hood libxml is used.  Perhaps not...  Perhaps there was some other similar limitation going on.

I was just sharing my experiences in hopes to help...

Regards,
Sandor


Shane Stanley
 

On 21 Dec 2017, at 2:19 pm, Sandor Szatmari <admin.szatmari.net@...> wrote:

I was just sharing my experiences in hopes to help...

Great -- I just didn't see that you were talking about System Events rather than AppleScriptObjC.

FWIW, I suspect the issue is that because System Events is not saving any state, the XML is loaded repeatedly. For small files this doesn't matter, but obviously it does for big ones.

So for a ~110MB file here, this takes about 9 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
end tell

And this takes about 18 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
set y to XML element 1 of x
end tell

While this takes about 32 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
set y to XML element 1 of x
set z to XML elements of y
end tell

So I don't think it's a size limit per se.


Jean-Christophe Helary <jean.christophe.helary@...>
 

Shane,

Would you mind clarifying how you time a job? I'd like to test my file with your code.


On Dec 21, 2017, at 12:45, Shane Stanley <sstanley@...> wrote:

So for a ~110MB file here, this takes about 9 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
end tell

And this takes about 18 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
set y to XML element 1 of x
end tell

While this takes about 32 seconds:

tell application "System Events"
set x to contents of XML file "/Users/shane/Desktop/SwissProt.xml"
set y to XML element 1 of x
set z to XML elements of y
end tell

So I don't think it's a size limit per se.

Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com


Shane Stanley
 

On 21 Dec 2017, at 5:58 pm, Jean-Christophe Helary <jean.christophe.helary@...> wrote:

Would you mind clarifying how you time a job? I'd like to test my file with your code.

I used the timer in Script Debugger for those figures. For stuff that's more like a small fraction of a second, I'll use:

use framework "Foundation"
use scripting additions

set startTime to current application's NSDate's |date|()
-- do stuff
return -(startTime's timeIntervalSinceNow())




Jean-Christophe Helary <jean.christophe.helary@...>
 

On 2017/12/21, at 16:26, Shane Stanley <sstanley@...> wrote:

Would you mind clarifying how you time a job? I'd like to test my file with your code.
I used the timer in Script Debugger for those figures. For stuff that's more like a small fraction of a second, I'll use:
Thank you! That was pretty obvious. Sorry :)


Jean-Christophe Helary
-----------------------------------------------
@brandelune http://mac4translators.blogspot.com