




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
An adaptive xml parser for developing high-performance web services using table-driven xml (tdx) parsing technology and permutation phrase grammar. The parser integrates scanning, parsing, and validation into a single-pass without backtracking, resulting in faster performance compared to traditional validating parsers. The technique is particularly useful for large complex systems with extensible schemas.
Typology: Papers
1 / 8
This page cannot be seen from the preview
Don't miss anything!





This paper presents an adaptive XML parser that is based on table-driven XML (TDX) parsing technology. This technique can be used for developing extensible high- performance Web services for large complex systems that typically require extensible schemas. The parser integrates scanning, parsing, and validation into a single-pass with- out backtracking by utilizing compact tabular representa- tions of schemas and a push-down automaton (PDA) at runtime. The tabular forms are constructed from a set of schemas or WSDL descriptions through the use of permuta- tion grammar. The engine is implemented as a PDA-based, table-driven driver, as a result, it is independent of XML schemas. When XML schemas are updated or extended, the tabular forms can be regenerated and populated to the generic engine without requirement of redeployment of the parser. This adaptive approach balances the need for per- formance against the requirements of reconstruction and redeployment of the Web services. Our experiments show the adaptive parser usually demonstrates performance of 5 times faster than traditional validating parsers and perfor- mance drop within 20% of the fastest fully compiled tradi- tional validating parsers.
The Extensible Markup Language (XML) format deliv- ers key advantages in interoperability and is widely adopted as a standard for exchanging structured information by Web services. Web services technologies and applications have built on the success of XML by providing standardized delivery of structurally and semantically rich content over the Web, as defined by the Simple Object Access Proto- col (SOAP) and Web Service Definition Language (WSDL) W3C standards. However, the interoperability of XML Web services often comes at the price of reduced efficiency of message composition, transfer, and parsing compared to simple binary protocols. Several studies have evaluated the performance of SOAP and concluded that SOAP and XML incur a substantial performance penalty compared to
binary protocols [4, 9, 10]. Parsing and validation of XML against a schema is expensive [12, 19], as well as the cost of deserialization into usable in-memory objects for appli- cations [6, 10]. Several efforts have been made to address the parsing and validation performance through the use of grammar- based parser generation by leveraging XML schema lan- guages such as DTD [23], XML schema [14], and Re- lax NG [8] at compile time. Compiled schema-specific parsers [7, 11, 15, 16, 20–22, 24, 25] have shown significant performance improvement. Schema-specific parsers encode parsing states and validation rules at compile time by ex- ploiting schema structures and validation rules to increase processing efficiency at runtime. However, each generated schema-specific parser must be appropriate to the operating system, compiler, supporting li- braries, and hardware on which applications will be run on. The parser must be regenerated and deployed when an XML schema is updated. This is a significant challenge for devel- oping extensible Web services. A Web service is usually a long term agreement that allows consumers to interact with a web service. To address schema updates, service design- ers typically add new elements to their schema by changing the source code, adding the required business logic and re- building the service. However, this approach only works for simple services. Consider for example a large business ap- plication requires customizations to fit specific industries, countries, and customers. Exposing such business applica- tions as web services is difficult because they have to be able to be customized over time and these customizations must work for all consumers, even consumers that have made changes to the application. This issue requires that the services have to be designed to be extensible, i.e. Ex- tensible Web services that typically require extensible XML schemas. Our previous works [26] presents a table-driven XML (TDX) parsing and validation for high-performance Web services. Our TDX technique utilizes a compact tabu- lar representation of schemas and a push-down automa- ton (PDA) for a single-pass parsing and validation with-
out backtracking. To avoid backtracking on XML ele- ments and attributes defined by XML schema constructs such as unordered sequence of elements (xs:all) and at- tributes (xs:attribute), in our later work [25], we ex- tend Backus Naur Form (BNF) with support of permuta- tion phrase grammar representation of a schema (thus we call a TDX parser with permutation phrase support a pTDX parser). The permutation phrase grammar is a compact rep- resentation of common XML element and attribute permu- tations that have specific occurrence constraints. The per- mutation phrase grammar requires a specialized recognizer, which is implemented by a two-stack push-down automa- ton. In pTDX parser the parsing engine is implemented as a generic table-driven driver that is independent of XML schema at runtime. However, the DFA-based scanner is built from a schema-directed Flex [18] description^1. The Flex description of the scanner is fed to Flex to generate DFA-based scanner source code in C. As a result, rebuild- ing of the pTDX parser is a requirement for schema updates. We refer this pTDX parser to Flex-based pTDX parser, or pTDX-fle in short thereafter. This paper presents an adaptive XML parsing and validating technique that can be used to develop high- performance extensible Web services. Unlike pTDX-flex parser separating a scanner and a parsing engine, this ap- proach implements a table-driven engine that integrates scanning, parsing and validation. This is based on the obser- vation that the parsing table not only drives the parsing, but also can it direct scanning. We call this approach a table- directed pTDX parser, or pTDX-table in short. The remainder of this paper is organized as follows. We first give a brief description of an adaptive pTDX parser in Section 2. In Section 3, we introduce mapping rules from XML schema components to augmented LL(1) grammar. Construction of modular tables is described in Section 4. Section 5 gives table-driven 2-stack PDA based engine that integrates scanning, parsing and validation. Performance evaluation is given in section 6 and related work is discussed in Section 7. Conclusions are drawn in Section 8.
The architecture of a pTDX-based Web service with swappable modules is shown in Figure 1. The front-end consists of a generic parsing engine, and several modules containing an LL(1) parsing table, a token table, an LL(1) production rule table, tag name table and an action table. The generic engine, implemented by a push-down au- tomaton (PDA), scans the XML messages for tags and character data (CDATA) , convert each recognized tag into a token, and performs well-formedness checking and va-
(^1) Flex is a frequently used automatic generator tool by compiler devel- opers for high-performance scanners.
Figure 1. Architecture of a pTDX-based exten- sible Web service with swappable modules.
lidity of the XML content by consulting the parsing ta- ble combined with the production rules with semantic ac- tions. Well-formedness and most structural and some XML content types imposed by XML schema are automatically incorporated into the parsing table and production rules, thus they are verified automatically by the parsing en- gine. Some schema built-in types or derived types from <xs:restriction> can not be easily incorporated in to grammar productions. Such types are checked by semantic actions associated with grammar productions. A semantic action function is invoked by the engine to validate if the content of an XML element conforms its constraints. The engine also invokes functions at the back-end through the application action table, which contains function pointers (callbacks) to the application logic for performing applica- tion tasks. Both type-checking and application actions are encoded as indices to the entries in the action tables. By using indices, the semantic actions associated with gram- mar productions do not need to be pre-compiled. This en- sures that the modular tables are swappable. In addition, the scanner and parsing engine are implemented as table-driven generic scanner and parser, thus are independent of XML schemas or WSDL descriptions. Therefore, this approach offers a flexible and adaptive mechanism to deal with XML schema updates. When an XML schema updates, the modu- lar tables can be regenerated and populated with no require- ment of recompilation of the scanner and the engine. The back-end that consists of an application-specific ac- tion table and a shared parameter table serves for applica- tion service logic. Application logic tasks can be achieved by well-defined APIs that are indexed and stored in an application-specific action table. These application func- tions are also triggered by semantic actions associated with production rules. Like type-checking function, the use of index tokens to refer application functions ensures the in- dependence of the scanner and the parsing engine on the application logic in the back-end. The parameter data ta- ble temporarily holds primitive data passed from the engine that can be directly used by application functions without
Rule# Translation 1 Γ[[]]N = {N → T N ′, N ′^ → T N ′^ , N ′^ → ǫ} ∪ Γ[[X]]N
Table 1. Examples rules mapping Schema component to augmented LL(1) grammar productions.
cases (ref. [26] for examples). Such violation can be eliminated by applying left-factoring [1]. We perform left-factoring for the generated grammar to ensure the LL(1) properties preserved.
Modular tables play a key role in Table-Driven XML parsing. Modularity offers a flexible and adaptive mecha- nism for dealing with schema updates. In this section, we describes construction of these modular tables from a set of schemas or WSDL descriptions.
Not only are tokenization of string once and match- ing on tokens more efficient than repeatedly comparing strings, but also tokenization simplifies process of pars- ing table and grammar production rules. Tokens are de- fined by schema element tag names, attribute tag names, schema built-in types such as xs:boolean and some facets such as xs:enumeration. Element tag names are further classified as starting element tag and closing element tag. Through this paper, we use bNAME and eNAME to denote the starting element tag
Some schema built-in types or derived types from <xs:restriction> can not be easily incorporated in to grammar productions. Such types are checked by seman- tic actions associated with grammar productions. Schema built-in types are implemented as libraries. Derived types
from <xs:restriction> are constructed from schemas as routines for invocation by the engine to perform ele- ments’ or attributes’ content type checking.
The parsing table is a two dimensional array M [A, a], where A is a nonterminal, and a is a terminal. Each en- try of the table is either an index that refers to a produc- tion rule or an token indicating an error entry^2. The parsing table are constructed through the use of FIRST and FOL- LOW sets [1]. The differences between our augmented LL(1) grammar and the LL(1) grammar in [1] exist in that ours supports occurrence production rules and permutation phrase production rules. The former imposes no affect to calculation of FIRST and FOLLOW sets while the latter does. All of the constituent elements should be treated as the first element when constructing the FIRST and FOL- LOW sets because of the unordered property of the permu- tation phrase production. Thus, the permutation grammar composition symbol is commutative and associative and the FIRST and FOLLOW sets are computed as union of all el- ements (ref. [25]).
The engine behaves in two modes: scanning mode and parsiong mode. In scanning mode, the engine works as a scanner to scan tags and converts recognized tags into to- kens. In paring mode, the engine consumes tokens and per- forms parsing and validation. When the top of the stack is a nonterminal and there is no current token to parse, the en- gine enters scanning mode. Once a tag name or CDATA is recognized, the engine converts the recognized tag name or CDATA into a token, and enters into parsing mode.
In scanning mode, the engine scans the input string each time to match a specific tag name. Tag names are classified (^2) We say the parsing table entry is either a production or an error entry to simplify the description thereafter.
as starting element name , closing element name , attribute name , and chracter data. Once a match is found, the engine converts the tag into a token, and enters into parsing mode. The parsing table provides information of the specific tag name. From the point view of scanning, the parsing table restricts the possible strings that can be next input string. Each row of the parsing table is indexed by a nonterminal and each column is indexed by a terminal, i.e. a token rep- resenting a tag name. To this end, the expected tag name must be among the ones that the nonterminal can generate, and there is exactly one tag name is expected to meet. The engine checks the entry indexed by the nonterminal and the token. If it is a production entry, the engine picks up the to- ken’s corresponding tag name and starts to scan. Otherwise, the engine try next token. If no matck is found for all the tokens that correspond to a production entry with the non- terminal, its behavior depends on the type of nonterminal. If it is a regular nonterminal, it indicates an error. If it is a permutation nonterminal, put the nonterminal into auxiliary stack. Typically each row contains few production entries unless for a grammar that consisting large portion of per- mutation phrases.
In parsing mode, the engine behaves similarly as a per- mutation parsing engine. From the point view of parsing, the parsing table encodes a topdown parsing tree for each instance of the XML schema from which the parsing table is constructed. A predictive parsing engine maintains a lo- cal stack to track the parser’s states. The nonterminal on top of the stack and the current token determines a unique pro- duction in the parsing table that needs to be expanded. To be able to parse permutation phrase pgrammar, an auxiliary stack is required to temporialy hold permutation nontermi- nals that can not be expanded at this point. This indicates that this permutation non terminal does not generate the cur- rent symbol. Two flags are also needed for parsing occur- rences constraints. The main stack is initialized with $, the endmarker, and S, the start symbol on top. The current sym- bol X, which is the symbol on top of the main stack, and c, the current token generated in the scanning mode, deter- mine the parsing action.
Test Schema Schema No. of Elts. Instance Instance Throughput (MB/Sec) Case Filename Size (Bytes) <xs:all> Filename Size (Bytes) Validating Parsers Non-Validating pTDX-flex pTDX-table gSOAP Xerces DFA Expat G21 2k g.xsd 4021 21 g.xml 2341 41 29 11 5 34 23 A50 64k a.xsd 3155 50 a 64k.xml 68060 33 27 10 3 38 26 A50 3k a.xsd 3155 50 a 3k 3016 31 25 8 3 38 21 B5 0.2k b.xsd 814 5 b.xml 291 24 22 3 3 26 13 B5 8k b.xsd 814 5 b 8k.xml 8232 40 38 19 7 44 37 A50 16k a.xsd 4021 50 a50 16k 17156 32 25 11 10 39 26 A2 0.3k a2.xsd 569 2 a2 0.3k 341 28 28 3 2 31 12 A4 0.4k a4.xsd 668 4 a4 0.4k 452 32 28 4 2 35 14 A8 0.6k a8.xsd 881 8 a8 0.6k 678 34 32 5 4 38 16 A16 1k a16.xsd 1314 16 a16 1k 1124 35 31 6 2 39 18 A32 2k a32.xsd 2190 32 a32 2k 2036 36 30 8 3 39 20 A32 4k a32.xsd 2190 32 a32 4k 3886 34 27 10 3 42 22 A32 8k a32.xsd 2190 32 a32 8k 7584 35 28 10 3 42 25 A32 16k a32.xsd 2190 32 a32 16k 16826 35 28 17 2 38 25 A32 32k a32.xsd 2190 32 a32 32k 33462 36 28 11 4 42 26 Table 2. Test Cases and measurements.
parsing [7, 11, 13, 15, 16, 19–22, 24, 26]. Schema-specific XML parsing achieves performance gains by exploiting schema information to compose a parser at compile time and utilizing the parsing states at runtime to verify schema validation constraints.
Our previous work on the gSOAP toolkit [20] is the ear- liest work on a schema-specific LL(1) recursive descent parser for XML with namespace support and validation. To our knowledge, this was also the first published work in the literature to suggest an integrated approach to schema- specific parsing by collapsing scanning, parsing, validation, and deserialization into one phase. However, gSOAP imple- ments a recursive descent the parser that involves function calling overhead and blocking property.
In [21] Van Engelen presents a method that integrates parsing and validation into a single stage by using a two- level schema in which a lower-level Flex scanner drives a DFA validation. The DFA is directly constructed from a schema based on a set of mapping rules. However, this approach can only process a non-cyclic subset of XML schema due the limitations of regular languages described by DFAs. Furthermore, this approach is not applicable in practice for permutation phrase that consists of even not a large number of elements due to the fact that the number of DFA states increases exponentially.
Chiu et al. [7] also suggest an approach to merge all aspects of low-level parsing and validation by extending DFAs to nondeterministic generalized automata. They also provide a technique for translating these into deterministic generalized automata. However, translating from an NFA to a DFA may blow up the number of states, thus limiting these parsers to small occurrence constraints. Furthermore, their approach does not support namespaces, which is an essential requirement for SOAP compliance.
Cardinality-constraint automata (CCA) [16] offers an ef- ficient schema-aware XML parsing technique by extending deterministic finite automata with cardinality constraints on state transitions. These automata can easily take care of oc-
currences constraints imposed by schema. Unfortunately, CCA does not provide mechanism for well-formedness checking.
XML Screamer [11] presents an efficient parser gener- ator that translates XML schema into a parser either in C or Java code. Similar to gSOAP and the work by Chiu et al., XML screamer also integrates deserialization with scanning, parsing, and validation. It demonstrates that high- performance can be obtained by careful design of APIs. The tool uses recursive descent with backtracking, and covers a large schema space. As with all recursive descent parsers, XML Screamer is a blocking parser. More recent work that builds on XML Screamer is iScreamer [13]. iScreamer is a schema-directed interpretive XML parser and achieves high-performance gains by using a carefully tuned set of special-purpose bytecodes. iScreamer, does not support full schema features. Also, its reliance on specialized bytecodes may hinder its acceptance.
TDX [24, 26] provides an integrated approach that combines well-formedness checking, content-type valida- tion and application-specific event by pre-encoding parsing states in a tabular form at compile time and by utilizing an efficient push-down automaton at runtime. However, TDX relies on exponential enumerations of permutation phrases and is therefore not space optimal. pTDX-flex [25] pro- poses a TDX-based approach that achieves both time and memory space efficientcy by extending extend Backus Naur Form (BNF) with support of permutation phrase gram- mar representation of a schema. The permutation phrase grammar is a compact representation of common XML el- ement and attribute permutations that have specific occur- rence constraints. The permutation phrase grammar re- quires a specialized recognizer, which is implemented by a two-stack push-down automaton. However, this Flex-based TDX parser lacks capability of addresing schema updates.
In this paper we presented an adaptive table-driven XML parsing and validation technique that can be used to develop extensible high-performance Web services. The adaptive TDX encodes XML parsing states in compact tabular forms by support of permutation phrase grammar. As a result it ensures a memory space efficiency. This adaptive approach uses interpretive scanning at run time by leveraging these tabular forms to improve scanning performance.
[1] A. Aho, R. Sethi, and J. Ullman. Compilers: Principles, Techniques and Tools. Addison-Wesley Publishing Com- pany, Reading MA, 1985. [2] Apache Foundation. Xerces XML Parser. Ghttp://xerces.apache.org/. [3] A. I. Baars, A. L¨oh, and S. D. Swierstra. Functional pearl parsing permutation phrases. Journal of Functional Pro- gramming , 14(6):635–646, 2004. [4] F. E. Bustamante, G. Eisenhauer, K. Schwan, and P. Widener. Efficient wire formats for high performance computing. In Supercomputing ’00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM) , page 39, Washington, DC, USA, 2000. IEEE Computer So- ciety. [5] R. D. Cameron. Extending context-free grammars with per- mutation phrases. ACM Letters on Program Languages and Systems , 2(1-4):85–94, 1993. [6] K. Chiu, M. Govindaraju, and R. Bramley. Investigating the Limits of SOAP Performance for Scientific Computing. In HPDC ’02: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing , page 246, Washington, DC, USA, 2002. IEEE Computer So- ciety. [7] K. Chiu and W. Lu. A compiler-based approach to schema- specific XML parsing. In In proceedings of The First Inter- national Workshop on High Performance XML Processing ,
[8] J. Clark and M. Makoto. Relax NG specification, November
[12] W. Lowe, M. Noga, and T. Gaul. Foundations of fast com- munication via XML. Annals of Software Engineering , 13:357–379, 2002. [13] M. Matsa, E. Perkins, A. Heifets, M. G. Kostoulas, D. Silva, N. Mendelsohn, and M. Leger. A high-performance inter- pretive approach to schema-directed parsing. In WWW ’07: Proceedings of the 16th international conference on World Wide Web , pages 1093–1102, New York, NY, USA, 2007. ACM. [14] OMG. XML metadata interchange (XMI) specifications. Available from http://www.omg.org/. [15] E. Perkins, M. Matsa, M. G. Kostoulas, A. Heifets, and N. Mendelsohn. Generation of efficient parsers through di- rect compilation of xml schema grammars. IBM Syst. J. , 45(2):225–244, 2006. [16] F. Reuter. Cardinality automata: A core tech- nology for efficient schema-aware parsers, 2003. http://www.swarms.de/publications/cca.pdf. [17] SourceForge.net. http://expat.sourceforge.net. [18] sourceforge.net. Flex: The fast lexical analyzer. http://flex.sourceforge.net/. [19] H. S. Thompson and R. Tobin. Using finite state automata to implement W3C XML schema content model validation and restriction checking. In In Proceedings of XML Europe ,
[20] R. van Engelen. The gSOAP toolkit 2.1, 2001. http://gsoap2.sourceforge.net. [21] R. van Engelen. Constructing finite state automata for high performance XML Web services. In proceedings of the In- ternational Symposium on Web Services (ISWS) , 2004. [22] R. van Engelen and K. Gallivan. The gSOAP toolkit for web services and peer-to-peer computing networks. In proceed- ings of the 2nd IEEE International Symposium on Cluster Computing and the Grid , pages 128–135, Berlin, Germany, May 2002. [23] W3C XML Specification DTD. XML metadata interchange (XMI) specifications. Available from http://www.omg.org/. [24] W. Zhang and R. van Engelen. A table-driven streaming XML parsing methodology for high-performance Web ser- vices. In ICWS ’06: Proceedings of the IEEE International Conference on Web Services (ICWS’06) , pages 197–204, Washington, DC, USA, 2006. IEEE Computer Society. [25] W. Zhang and R. van Engelen. High-performance XML parsing and validation with permutation phrase grammar parsers. In ICWS ’08: Proceedings of the IEEE Interna- tional Conference on Web Services (ICWS’08) , pages 286– 294, Beijing,China, 2008. IEEE Computer Society. [26] W. Zhang and R. A. van Engelen. TDX: a high-performance table-driven XML parser. In ACM-SE 44: Proceedings of the 44th annual Southeast regional conference , pages 726– 731, New York, NY, USA, 2006. ACM.